Entity not found in XML
Created: Last updated:
XML seems to be very simple when you take a first look at it, after all HTML when properly written is XML; hence, XHMTL. Once you dive deeper into XML it gets more difficult and you may encounter a problem like this one: Entity not found.
It usually happens when you load XML and try parsing the data. What is an even bigger problem is finding an answer for this problem because the whole topic seems to be so advanced (nerdy) that most pages on the web expect you to be already proficient in this field.
Now, this here is not a quicky into XML for dummies but to understand what is going on we have to talk about some fundamentals. I try to keep it very simple, though.
The main problem
When you load an XML file in php you are probably using simplexml_load_file or DOMDocument::load. Either way, the main error message is the same: "Entity 'entityname' not found ...". Additionally simplexml will add some more lines with the actual line of text.
It should not take you long to realize that one or more of the funny characters is the culprit, maybe a copyright sign © (©) somewhere in a paragraph.
Computers are stupid
We sometimes forget it but computers are incredibly stupid—you have to tell them everything!
Parsing an XML document is no exception. There are in fact a couple pitfalls in the characteristic of XML itself. Mainly the tags with the less than sign (<) and greater than sign (>) plus the double-quotes (") around the value in attributes.
Imagine you want to save some HTML in XML. No doubt, you have some of the signs or double-quotes in your HTML data. Your XML parser will have a problem with them. Here is an example:
- <!-- an XML element holding HTML data -->
Some HTML text with a <span class="special">special</span> word
This will clearly not work because XML has absolutely no way to know why the <span> tag is not a valid XML tag. If we need valid XML we have to somehow tell XML, i.e. escape these characters. The way we do that in XML are character references.
This brings us to the next pitfall. XML knows or has a basic set of character references, called entity references, and miraculously knows what to do with this entities. What is not obvious is that an entity reference like " has to be translated internally. XML parsers are looking for this ampersand pattern. If they recognize a pattern but it does not match the limited internal set they throw our error.
The big problem: Next to this limited number of references we have a lot more references to work around some problems with character encoding in different character sets and languages. In fact there is a large list with lots of character references.
Wait a minute, you say. What about HTML and these funny character references? Why is there only a problem in XML and not HTML with these characters? Because we tell browsers how to deal with these "other" character references in HTML.
Ever wondered why web pages have this weird first line of text starting with the word DOCTYPE? No? Well, there is your answer.
Document Type Definition (DTD)
The DOCTYPE itself is meaningless but the DTD defined inside the DOCTYPE is where all the fun stuff is happening. Like with CSS or Scripts we can define the data either internally, inside the document, or externally, by providing a file path to the definition. The DOCTYPE in HTML usually has the link to a .dtd file as defined at www.w3c.org. Feel free to copy the link in a DOCTYPE statement from any web page and paste it into the address field; it should pop up the file download window in your browser. Save it and take a look at it. The file is a regular text file, no secrects there.
Inside the file we have all the Document Type Definitions telling the browser what to do with all the tags and attributes including some other stuff. One of this other staff is the information to the funny character references.
The proper name for the funny characters is Character References. You will find them almost at the top of a dtd file and labeled Character mnemonic entities in the comment line.
There we have three types of entity sets and you will also see references to file names (as .ent files). Don't try downloading these files, they are nowhere to be found at w3c.org for download. I suspect they don't want you to directly link to them (and indeed you should not) and I think it is no secret, browsers have them already stored internally as well as the .dtd files. These files rarely change.
So, how do we know what is in those .ent files? If you haven't looked at the whole page in the earlier link above here it is again but this time directly to the entity sets at www.w3c.org. There you see the DTD sets for Latin-1 characters, Special characters and symbols.
Entity versus Numeric
Before we can discuss the solution to our problem with the missing entity error lets have a few words about what entity (or entities) stands for.
Lets take the copyright sign as an example. Within the DTD we see the following line:
- <!--ENTITY copy "©"-->
- <!-- copyright sign, U+00A9 ISOnum -->
Note the word copy and the number 169.
In terms of characters a computer always works with numbers internally, actually with their binary value as a byte. What this means is that browsers and XML parser always know what to do with ©—which is a numerich character reference. Note: For simplicity we don't add and talk about the hex values.
For humans this is not a fun thing to remember; however, copy for copyright is. Character references with words instead of numbers are called character entity references.. Therefore we can say, the DTD for the entity sets is nothing more than a translation for conversion table. We tell the computer program what to do with entity references, how to translate/convert them into numeric references.
You can actually create your entity references in a DTD. Say you want a entity reference as &company you simply have to add the ENTITY definition into a DTD.
Now you know why browsers for HTML can secretly work with entities. XML parsers on the other hand (like DOMDocument or simplexml) cannot as long as they don't have any form of DTD telling what to do with entities.
There are two solutions to the problem.
Provide a document type definition
Provide the definition DTD in your XML files either internally or externally with a .dtd file.
Here lies the real reason, I believe, why you cannot link or download the definitions from w3c.org. Imagine everyone would add the link to the .ent files at w3c.org into their XML files. Every time a XML file is loaded and a XML parser kicks into gear a connection to their servers is opened. An absolute waist of bandwidth in the Internet to three files that rarely if ever change.
So if anything keep these files local or better yet use ...
avoid character entity references
Instead, work exclusively with numeric characters.
I will later/soon have some links to a dtd file you can download and then link into your XML files locally as well as some script to translate/convert entity references into numeric references and vice versa. Which ever solution you prefer for your environment.