End of Life for PHP 4
Chris Shiflett...
Yesterday I discovered a bothersome feature in PHP 5's DOM library. Loading XML data into a DOMDocument object triggers errors if it contains HTML entities. I have no idea if this issue is common in other implementations, but it was a big enough hassle that it made me add a new function to the Smart XML library.
This example code attempts to load an XML document that contains an HTML entity ( ):
$xml = '<root><node>Non-breaking space: </node></root>';
$dom = DOMDocument::loadXML($xml);
The loadXML() function returns a null object and outputs the following error:
Warning: DOMDocument::loadXML() [function.DOMDocument-loadXML]: Entity 'nbsp' not defined in Entity
The DOMDocument object needs a document type declaration that defines HTML entities. One simple solution to the problem is to include the HTML doctype at the beginning of the document, like so:
$xml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">' .
'<root><node>Non-breaking space: </node></root>';
$dom = DOMDocument::loadXML($xml);
This time the function returns a valid document, but it still generates the same error. We can use @ to suppress it.
$xml = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">' .
'<root><node>Non-breaking space: </node></root>';
$dom = @DOMDocument::loadXML($xml);
Now we have a valid document and no errors.
In cases where we can't use the HTML doctype, or we're loading from a data source we can't control, another option is to transform all the HTML entities into numeric entries. The entity, for example, would become  .
The Smart XML library we started in Part 1 already includes a function to retrieve the HTML entity table. We'll use it for a function that builds a translation table from HTML to numeric entities, and then we can use strtr() to perform the translation.
Here is the code:
class SmartXML {
public static function GetExtendedEntityTable() {
static $entities = -1;
if ($entities == -1) {
$entities = get_html_translation_table(HTML_ENTITIES);
$entities[chr(130)] = '‚'; // Single Low-9 Quotation Mark
$entities[chr(131)] = 'ƒ'; // Latin Small Letter F With Hook
$entities[chr(132)] = '„'; // Double Low-9 Quotation Mark
$entities[chr(133)] = '…'; // Horizontal Ellipsis
$entities[chr(134)] = '†'; // Dagger
$entities[chr(135)] = '‡'; // Double Dagger
$entities[chr(136)] = 'ˆ'; // Modifier Letter Circumflex Accent
$entities[chr(137)] = '‰'; // Per Mille Sign
$entities[chr(138)] = 'Š'; // Latin Capital Letter S With Caron
$entities[chr(139)] = '‹'; // Single Left-Pointing Angle Quotation Mark
$entities[chr(140)] = 'Œ'; // Latin Capital Ligature OE
$entities[chr(145)] = '‘'; // Left Single Quotation Mark
$entities[chr(146)] = '’'; // Right Single Quotation Mark
$entities[chr(147)] = '“'; // Left Double Quotation Mark
$entities[chr(148)] = '”'; // Right Double Quotation Mark
$entities[chr(149)] = '•'; // Bullet
$entities[chr(150)] = '–'; // En Dash
$entities[chr(151)] = '—'; // Em Dash
$entities[chr(152)] = '˜'; // Small Tilde
$entities[chr(153)] = '™'; // Trade Mark Sign
$entities[chr(154)] = 'š'; // Latin Small Letter S With Caron
$entities[chr(155)] = '›'; // Single Right-Pointing Angle Quotation Mark
$entities[chr(156)] = 'œ'; // Latin Small Ligature OE
$entities[chr(159)] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
}
return $entities;
}
public static function GetHTMLConversionTable() {
static $entities = -1;
if ($entities == -1) {
$entities = array();
$tmp = SmartXML::GetExtendedEntityTable();
foreach ($tmp as $k => $v) {
$entities[$v] = '&#' . ord($k) . ';';
}
}
return $entities;
}
public static function ConvertHTMLEntities($src) {
return strtr($src, SmartXML::GetHTMLConversionTable());
}
The following code returns a document with HTML entities translated and no errors:
$xml = '<root><node>Non-breaking space: </node></root>';
$dom = DOMDocument::loadXML(SmartXML::ConvertHTMLEntities($xml));
Incidentally, after the document has been created, adding HTML entities to it will not generate errors, even without the doctype declaration.
There are no comments posted to this news item.