Smart XML, Part 2: Converting HTML Entities for XML

by Castwide on 5-2-2008 • Tags: code, html, php, xml7 comments

Yesterday I discovered a bothersome feature in PHP 5's DOM library.  Loading XML data into a DOMDocument object triggers errors if it contains HTML entities.  I have no idea if this issue is common in other implementations, but it was a big enough hassle that it made me add a new function to the Smart XML library.

This example code attempts to load an XML document that contains an HTML entity ( ):

$xml = '<root><node>Non-breaking space: &nbsp;</node></root>';
$dom = DOMDocument::loadXML($xml);

The loadXML() function returns a null object and outputs the following error:

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity

The DOMDocument object needs a document type declaration that defines HTML entities.  One simple solution to the problem is to include the HTML doctype at the beginning of the document, like so:

$xml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">' .
    '<root><node>Non-breaking space: &nbsp;</node></root>';
$dom = DOMDocument::loadXML($xml);

This time the function returns a valid document, but it still generates the same error.  We can use @ to suppress it.

$xml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">' .
    '<root><node>Non-breaking space: &nbsp;</node></root>';
$dom = @DOMDocument::loadXML($xml);

Now we have a valid document and no errors.

In cases where we can't use the HTML doctype, or we're loading from a data source we can't control, another option is to transform all the HTML entities into numeric entries.  The &nbsp; entity, for example, would become &#160.

The Smart XML library we started in Part 1 already includes a function to retrieve the HTML entity table.  We'll use it for a function that builds a translation table from HTML to numeric entities, and then we can use strtr() to perform the translation.

Here is the code:

class SmartXML {
    public static function GetExtendedEntityTable() {
        static $entities = -1;
        if ($entities == -1) {
            $entities = get_html_translation_table(HTML_ENTITIES);
            $entities[chr(130)] = '&sbquo;';    // Single Low-9 Quotation Mark
            $entities[chr(131)] = '&fnof;';     // Latin Small Letter F With Hook
            $entities[chr(132)] = '&bdquo;';    // Double Low-9 Quotation Mark
            $entities[chr(133)] = '&hellip;';   // Horizontal Ellipsis
            $entities[chr(134)] = '&dagger;';   // Dagger
            $entities[chr(135)] = '&Dagger;';   // Double Dagger
            $entities[chr(136)] = '&circ;';     // Modifier Letter Circumflex Accent
            $entities[chr(137)] = '&permil;';   // Per Mille Sign
            $entities[chr(138)] = '&Scaron;';   // Latin Capital Letter S With Caron
            $entities[chr(139)] = '&lsaquo;';   // Single Left-Pointing Angle Quotation Mark
            $entities[chr(140)] = '&OElig;';    // Latin Capital Ligature OE
            $entities[chr(145)] = '&lsquo;';    // Left Single Quotation Mark
            $entities[chr(146)] = '&rsquo;';    // Right Single Quotation Mark
            $entities[chr(147)] = '&ldquo;';    // Left Double Quotation Mark
            $entities[chr(148)] = '&rdquo;';    // Right Double Quotation Mark
            $entities[chr(149)] = '&bull;';     // Bullet
            $entities[chr(150)] = '&ndash;';    // En Dash
            $entities[chr(151)] = '&mdash;';    // Em Dash
            $entities[chr(152)] = '&tilde;';    // Small Tilde
            $entities[chr(153)] = '&trade;';    // Trade Mark Sign
            $entities[chr(154)] = '&scaron;';   // Latin Small Letter S With Caron
            $entities[chr(155)] = '&rsaquo;';   // Single Right-Pointing Angle Quotation Mark
            $entities[chr(156)] = '&oelig;';    // Latin Small Ligature OE
            $entities[chr(159)] = '&Yuml;';     // Latin Capital Letter Y With Diaeresis
        }
        return $entities;
    }

    public static function GetHTMLConversionTable() {
        static $entities = -1;
        if ($entities == -1) {
            $entities = array();
            $tmp = SmartXML::GetExtendedEntityTable();
            foreach ($tmp as $k => $v) {
                $entities[$v] = '&#' . ord($k) . ';';
            }
        }
        return $entities;
    }

    public static function ConvertHTMLEntities($src) {
        return strtr($src, SmartXML::GetHTMLConversionTable());
    }

The following code returns a document with HTML entities translated and no errors:

$xml = '<root><node>Non-breaking space: &nbsp;</node></root>';
$dom = DOMDocument::loadXML(SmartXML::ConvertHTMLEntities($xml));

Incidentally, after the document has been created, adding HTML entities to it will not generate errors, even without the doctype declaration.

Comments

health insurance 252 supplemental health insurance 301
cheap california auto insurance xfig classic car insurance xkpgie
viagra jyxx buy viagra 25mg orf
tramadol online in florida wqlgqn cialis 42768
florida health insurance 9998 homeownersinsurance xzlwcr
viagra irlov ultram vyjic
how does viagra work xnz ultram time released pill %(((

Add Comment

More Articles