July 28, 2006
Here's a helpful article that documents some common pitfalls to avoid when composing XML documents. Nobody wants to be called an XML Bozo by Tim Bray, the co-editor of the XML specification, right?
There seem to be developers who think that well-formedness is awfully hard -- if not impossible -- to get right when producing XML programmatically and developers who can get it right and wonder why the others are so incompetent. I assume no one wants to appear incompetent or to be called names. Therefore, I hope the following list of dos and don'ts helps developers to move from the first group to the latter.
- Don't think of XML as a text format
- Don't use text-based templates
- Use an isolated serializer
- Use a tree or a stack (or an XML parser)
- Don't try to manage namespace declarations manually
- Use unescaped Unicode strings in memory
- Use UTF-8 (or UTF-16) for output
- Use NFC
- Don't expect software to look inside comments
- Don't rely on external entities on the Web
- Don't bother with CDATA sections
- Don't bother with escaping non-ASCII
- Avoid adding pretty-printing white space in character data
- Don't use
- Use XML 1.0
- Test with astral characters
- Test with forbidden control characters
- Test with broken UTF-*
I'm a little ambivalent about XML, largely due to what John Lam calls "The Angle Bracket Tax". I think XSLT is utterly insane for anything except the most trivial of tasks, but I do like XPath-- it's sort of like SQL with automatic, joinless parent-child relationships.
But XML is generally the least of all available evils, and if you're going to use it, you might as well follow the rules.
Posted by Jeff Atwood
I only have occasional need to deal with XML at present so might well be an unwitting Bozo. But many of these rules expressed as Don'ts leave questions begging. For example, #5 if you don't use an XML parser what do you use?
I on the other hand love xslt :). I've yet to run into a problem that requires an impractical solution. And with grouping, regexp and all the goodies of xpath 2.0 it's even easier to use.
I have rule nr. 1 taped on the wall behind my desk. Whenever someone comes in with an xml-related issue I simply point to the poster. This is usually all it takes :).
Talking of bozos, #17: Can we find the person who came up with the term "astral plane" and beat them to death with their own dungeons and dragons books? Please?
If you need that many rules to get your document format right, you might want to think about a different format.
When I first joined this organisation that used an XML database called Tamino that then used XSL files to create webpages, along with the help of some Java.
The whole system was massive, complex and bloody slow.
I re-developed the whole thing using SQL Server 2000 and asp.net pages. It uses a fraction of the size, runs much faster and its very easy to make changes, unlike the XSL system :yuck:
Jeff, it should be noted that using XSLT will (prettymuch) guarantee that your output will be conformant to all of those rules for generating XML.
So although on one hand you say "XSLT is insane", on the other hand this entire post seems to be an argument in favour of it.
I would like to add a question to that list - Does this problem really require XML ? (Think Ant).
Here's a good list of things to consider when writing XML:
And when converting HTML to XHTML
you obviously are not a good xslt programmer.
I would like to add a question to that list - Does this problem really require XML
And of course he recommends... the serializer/XmlWriter! Yes, let's all write at least 3 lines of code for every element, more if there are attributes!
I don't have a problem with XML, but the notion that it's perfectly okay to expect developers to write 500 lines of code comprising 46 routines and 13 classes just to spawn a single document sounds characteristic of an Architecture Astronaut.
Maybe text-based templates aren't the answer either, but you can use a single routine to escape a full XML string without the ridiculous overhead of a "writer". IMO, in order for XML to really be productive for developers, the dev tools either have to serialize it automatically (.NET Web Services), or allow it to be written "natively" (Ruby / XLinq). Without simplified support, I'd have to ask if the same problem could be solved with plain-text/CSV or an RDBMS.