Smart XML, Part 2: Converting HTML Entities for XML

Yesterday I discovered a bothersome feature in PHP 5's DOM library.  Loading XML data into a DOMDocument object triggers errors if it contains HTML entities.  I have no idea if this issue is common in other implementations, but it was a big enough hassle that it made me add a new function to the Smart XML library.

This example code attempts to load an XML document that contains an HTML entity ( ):

$xml = '<root><node>Non-breaking space: &nbsp;</node></root>';
$dom = DOMDocument::loadXML($xml);

The loadXML() function returns a null object and outputs the following error:

Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity

The DOMDocument object needs a document type declaration that defines HTML entities.  One simple solution to the problem is to include the HTML doctype at the beginning of the document, like so:

$xml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">' .
    '<root><node>Non-breaking space: &nbsp;</node></root>';
$dom = DOMDocument::loadXML($xml);

This time the function returns a valid document, but it still generates the same error.  We can use @ to suppress it.

$xml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">' .
    '<root><node>Non-breaking space: &nbsp;</node></root>';
$dom = @DOMDocument::loadXML($xml);

Now we have a valid document and no errors.

In cases where we can't use the HTML doctype, or we're loading from a data source we can't control, another option is to transform all the HTML entities into numeric entries.  The &nbsp; entity, for example, would become &#160.

The Smart XML library we started in Part 1 already includes a function to retrieve the HTML entity table.  We'll use it for a function that builds a translation table from HTML to numeric entities, and then we can use strtr() to perform the translation.

Here is the code:

class SmartXML {
    public static function GetExtendedEntityTable() {
        static $entities = -1;
        if ($entities == -1) {
            $entities = get_html_translation_table(HTML_ENTITIES);
            $entities[chr(130)] = '&sbquo;';    // Single Low-9 Quotation Mark
            $entities[chr(131)] = '&fnof;';     // Latin Small Letter F With Hook
            $entities[chr(132)] = '&bdquo;';    // Double Low-9 Quotation Mark
            $entities[chr(133)] = '&hellip;';   // Horizontal Ellipsis
            $entities[chr(134)] = '&dagger;';   // Dagger
            $entities[chr(135)] = '&Dagger;';   // Double Dagger
            $entities[chr(136)] = '&circ;';     // Modifier Letter Circumflex Accent
            $entities[chr(137)] = '&permil;';   // Per Mille Sign
            $entities[chr(138)] = '&Scaron;';   // Latin Capital Letter S With Caron
            $entities[chr(139)] = '&lsaquo;';   // Single Left-Pointing Angle Quotation Mark
            $entities[chr(140)] = '&OElig;';    // Latin Capital Ligature OE
            $entities[chr(145)] = '&lsquo;';    // Left Single Quotation Mark
            $entities[chr(146)] = '&rsquo;';    // Right Single Quotation Mark
            $entities[chr(147)] = '&ldquo;';    // Left Double Quotation Mark
            $entities[chr(148)] = '&rdquo;';    // Right Double Quotation Mark
            $entities[chr(149)] = '&bull;';     // Bullet
            $entities[chr(150)] = '&ndash;';    // En Dash
            $entities[chr(151)] = '&mdash;';    // Em Dash
            $entities[chr(152)] = '&tilde;';    // Small Tilde
            $entities[chr(153)] = '&trade;';    // Trade Mark Sign
            $entities[chr(154)] = '&scaron;';   // Latin Small Letter S With Caron
            $entities[chr(155)] = '&rsaquo;';   // Single Right-Pointing Angle Quotation Mark
            $entities[chr(156)] = '&oelig;';    // Latin Small Ligature OE
            $entities[chr(159)] = '&Yuml;';     // Latin Capital Letter Y With Diaeresis
        }
        return $entities;
    }

    public static function GetHTMLConversionTable() {
        static $entities = -1;
        if ($entities == -1) {
            $entities = array();
            $tmp = SmartXML::GetExtendedEntityTable();
            foreach ($tmp as $k => $v) {
                $entities[$v] = '&#' . ord($k) . ';';
            }
        }
        return $entities;
    }

    public static function ConvertHTMLEntities($src) {
        return strtr($src, SmartXML::GetHTMLConversionTable());
    }

The following code returns a document with HTML entities translated and no errors:

$xml = '<root><node>Non-breaking space: &nbsp;</node></root>';
$dom = DOMDocument::loadXML(SmartXML::ConvertHTMLEntities($xml));

Incidentally, after the document has been created, adding HTML entities to it will not generate errors, even without the doctype declaration.

There are no comments posted to this news item.

Add Comment

Account

Register Account

Forgot Password

More Articles

End of Life for PHP 4

Chris Shiflett...

Migrating to Phrameworks 1.0.1

Nobody can say I don't drink my own Kool Aid.

CSSTitle: Tooltips with Style

A demo of the CSSTitle library is now available.  This library...

Shoot Em Up

Just a quick note to mention that Galactic Front is back...

More XML Woes

"XML is like violence. If it doesn't solve your problem, you're not using enough of it." ...

Smart XML, Part 2: Converting HTML Entities for XML

Yesterday I discovered a bothersome feature in PHP 5's DOM library.  Loading XML data into a...