Smart XML, Part 1: Encoding and Decoding Entities
by Castwide on 4-22-2008 Tags: code, html, php, xml 0 commentsOne of the major challenges in building a strong framework for web applications is XML handling. PHP 5's XML functions, such as the DOM and SimpleXML libraries, go a long way toward providing the capabilities we need, but there are still a few instances where we need more. Those additional capabilities are my goal for the SmartXML class.
Among the capabilities we need:
- Elegant handling of malformed HTML
- Whitelists for permitted elements and attributes
- Entity encoding
The first issue we'll tackle is entity encoding. In a non-trivial application, properly encoding data for web pages can be a tricky matter. The simplest solution is to encode everything that doesn't contain HTML. Unfortunately, we cannot always know whether the data used by the application has already been encoded. Consider this simple example:
<?
$input = 'Quick & dirty';
$output = htmlentities($input);
echo $output;
?>
The above code will turn the ampersand into the & entity code, which is exactly what we want. But there is an unfortunate side effect if the input has already been encoded, like so:
<?
$input = 'Quick & dirty';
$output = htmlentities($input);
echo $output;
?>
The htmlentities() function will encode the ampersand that begins the & entity, and the output will look like this:
Quick & dirty
No matter how we try to predict when data needs to be encoded or decoded, it's nearly impossible to avoid exceptional cases. Of course, this applies not only to HTML, but also to XML in general.
Instead of endlessly chasing the exceptions, we can handle them with "smart" entity encoding. Our smart encoding function will not encode ampersands if they are already part of an entity code. Thus, the strings 'Quick & dirty' and 'Quick & dirty' will both result in the same output. This function, and a couple of related functions, will be the first additions to our SmartXML class.
We can retrieve an array of valid HTML entities using PHP's get_html_translation_table() function. It's important to note, however, that the array is not complete. If you've ever dealt with HTML that came from Microsoft Word, you've dealt with entities that are not in the table, such as the and quotation marks. Fortunately, PHP's documentation includes a user-contributed note that extends the array to include several missing entities. We'll wrap that code in a function called GetExtendedEntityTable().
Next we'll create a function called IsValidEntity(). This function will simply read a string and determine if it exists in the array of entities. If not, it will perform a second check to see if it's part of a numeric entity.
Finally, our Encode() function will transform special characters into HTML entities, but when it encounters an ampersand, it will read ahead to determine if it's already part of an entity code. If not, it will be transformed into the & entity.
Here's the complete code:
<?
class SmartXML {
public static function GetExtendedEntityTable() {
static $entities = -1;
if ($entities == -1) {
$entities = get_html_translation_table(HTML_ENTITIES);
$entities[chr(130)] = '‚'; // Single Low-9 Quotation Mark
$entities[chr(131)] = 'ƒ'; // Latin Small Letter F With Hook
$entities[chr(132)] = '„'; // Double Low-9 Quotation Mark
$entities[chr(133)] = '…'; // Horizontal Ellipsis
$entities[chr(134)] = '†'; // Dagger
$entities[chr(135)] = '‡'; // Double Dagger
$entities[chr(136)] = 'ˆ'; // Modifier Letter Circumflex Accent
$entities[chr(137)] = '‰'; // Per Mille Sign
$entities[chr(138)] = 'Š'; // Latin Capital Letter S With Caron
$entities[chr(139)] = '‹'; // Single Left-Pointing Angle Quotation Mark
$entities[chr(140)] = 'Œ'; // Latin Capital Ligature OE
$entities[chr(145)] = '‘'; // Left Single Quotation Mark
$entities[chr(146)] = '’'; // Right Single Quotation Mark
$entities[chr(147)] = '“'; // Left Double Quotation Mark
$entities[chr(148)] = '”'; // Right Double Quotation Mark
$entities[chr(149)] = '•'; // Bullet
$entities[chr(150)] = '–'; // En Dash
$entities[chr(151)] = '—'; // Em Dash
$entities[chr(152)] = '˜'; // Small Tilde
$entities[chr(153)] = '™'; // Trade Mark Sign
$entities[chr(154)] = 'š'; // Latin Small Letter S With Caron
$entities[chr(155)] = '›'; // Single Right-Pointing Angle Quotation Mark
$entities[chr(156)] = 'œ'; // Latin Small Ligature OE
$entities[chr(159)] = 'Ÿ'; // Latin Capital Letter Y With Diaeresis
ksort($entities);
array_flip($entities);
}
return $entities;
}
public static function IsValidEntity($entity) {
if (array_search($entity, SmartXML::GetExtendedEntityTable()) !== false) {
return true;
}
// Check for numeric entities
if (preg_match('/^&#[0-9]+;$/', $entity) == 1) {
return true;
}
return false;
}
public static function Encode($text) {
$result = '';
$entities = SmartXML::GetExtendedEntityTable();
for ($cursor = 0; $cursor < strlen($text); $cursor++) {
$char = substr($text, $cursor, 1);
if ($char == '&') {
$end = strpos($text, ';', $cursor);
if ($end === false) {
$char = '&';
} else {
$entity = substr($text, $cursor, $end - $cursor + 1);
if (!SmartXML::IsValidEntity($entity)) {
$char = '&';
}
}
} else {
if (isset($entities[$char])) {
$char = $entities[$char];
}
}
$result .= $char;
}
return $result;
}
}
?>
An example of our smart encoding in action:
<?
$input = 'Quick & dirty';
$output = SmartXML::Encode($input);
echo $output;
$input = 'Quick & dirty';
$output = SmartXML::Encode($input);
echo $output;
?>
The SmartXML::Encode() function will properly encode the first string and will not malform the second.
In future articles, we'll extend the SmartXML class with more features, such as parsing XML for well-formedness, and scrubbing input with whitelists of permitted elements and attributes.