May 11, 2008
Everywhere I look, programmers and programming tools seem to have standardized on XML. Configuration files, build scripts, local data storage, code comments, project files, you name it -- if it's stored in a text file and needs to be retrieved and parsed, it's probably XML. I realize that we have to use something to represent reasonably human readable data stored in a text file, but XML sometimes feels an awful lot like using an enormous sledgehammer to drive common household nails.
I'm deeply ambivalent about XML. I'm reminded of this Winston Churchill quote:
It has been said that democracy is the worst form of government except all the others that have been tried.
XML is like democracy. Sometimes it even works. On the other hand, it also means we end up with stuff like this:
How much actual information is communicated here? Precious little, and it's buried in an astounding amount of noise. I don't mean to pick on SOAP. This blanket criticism applies to XML, in whatever form it appears. I spend a disproportionate amount of my time wading through an endless sea of angle brackets and verbose tags desperately searching for the vaguest hint of actual information. It feels wrong.
You could argue, like Derek Denny-Brown, that XML has been misappropriated and misapplied.
I find it so interesting that XML has become so popular for such things as SOAP. XML was not designed with the SOAP scenarios in mind. Other examples of popular scenarios which deviate XML's original goals are configuration files, quick-n-dirty databases, and [RSS]. I'll call these 'data' scenarios, as opposed to the 'document' scenarios for which XML was originally intended. In fact, I think it is safe to say that there is more usage of XML for 'data' scenarios than for 'document' scenarios, today.
Given its prevalence, you might decide that XML is technologically terrible, but you have to use it anyway. It sure feels like, for any given representation of data in XML, there was a better, simpler choice out there somewhere. But it wasn't pursued, because, well, XML can represent anything. Right?
Consider the following XML fragment:
<name>The Whole World</name><email>firstname.lastname@example.org</email>
Dear sir, you won the internet. http://is.gd/fh0
Because XML purports to represent everything, it ends up representing nothing particularly well.
Wouldn't this information be easier to read and understand -- and only nominally harder to parse -- when expressed in its native format?
Date: Thu, 14 Feb 2008 16:55:03 +0800 (PST)
From: The Whole World <email@example.com>
To: Dawg <firstname.lastname@example.org>
Dear sir, you won the internet. http://is.gd/fh0
You might argue that XML was never intended to be human readable, that XML should be automagically generated via friendly tools behind the scenes, never exposed to a single living human eye. It's a spectacularly grand vision. I hope one day our great-grandchildren can live in a world like that. Until that glorious day arrives, I'd sure enjoy reading text files that don't make me suffer through the XML angle bracket tax.
So what, then, are the alternatives to XML? One popular choice is YAML. I could explain it, but it's easier to show you. Which, I think, is entirely the point.
<White refid="fritz" />
<Black refid="kramnik" />
<White refid="kramnik" />
<Black refid="fritz" />
Vladimir Kramnik: &kramnik
Deep Fritz: &fritz
David Mertz: &mertz
There's also JSON notation, which some call the new, fat-free alternative to XML, though this is still hotly debated.
You could do worse than XML. It's a reasonable choice, and if you're going to use XML, then at least learn to use it correctly. But consider:
- Should XML be the default choice?
- Is XML the simplest possible thing that can work for your intended use?
- Do you know what the XML alternatives are?
- Wouldn't it be nice to have easily readable, understandable data and configuration files, without all those sharp, pointy angle brackets jabbing you directly in your ever-lovin' eyeballs?
I don't necessarily think XML sucks, but the mindless, blanket application of XML as a dessert topping and a floor wax certainly does. Like all tools, it's a question of how you use it. Please think twice before subjecting yourself, your fellow programmers, and your users to the XML angle bracket tax. <CleverEndQuote>Again.</CleverEndQuote>
Posted by Jeff Atwood
I have to deal with data files, that are basically just flat data (think of a simple "select * from table"). It bothers me every time a customer sends us an XML file... CSV is perfect for that thing.
Erm... have you ever heard of the INTERNET which uses this stuff called HTML which is, well, to all intents and purposes.... XML?!
Fuck no it's not. Had the web been xml, with all it entails, it would never have taken off.
Oh, and some people tried to XMLify the web, with XHTML1.0, XHTML1.1 and a tentative XHTML2 spec.
Last time I checked, they failed epically and the bleeding edge moved to an actually feasible revision of HTML instead.
1) YAML sucks. It's really, really poor.
Quite the convincing argument you have there, Robin!
because of the way tcp/ip works, much of the xml bracket tax can be dismissed by the fact that you can't really send less than about 1400 bytes at a go anyways.
Yes, ok, there are work arounds to optimize the smaller packets, but on the whole, I suspect you'll find that sending 1 byte and 1000 bytes has very little difference over most connections.
Connections that compress data (such as VPNs) are really trying to fold 2k in to 1k, not 1000 bytes into 500 bytes, so even there, it is really just a wash.
Once the data starts getting past the size of frame, the cost of the tax starts dropping. Before that, it is almost free itself, except for the front and back end processing.
And that is where the real tax is - processor and memory overhead pushing data through an ackward envelope.
Still, if someone would just write an efficent parser for the lightweird stuff, 99.9% of the xml cases could be handled without it seeming like a sledgehammer.
// It bothers me every time a customer sends us an XML file... CSV is perfect for that thing.
Thinking of "select * from ...": suppose one of the varchar fields contains commas, hard returns, or quotation marks. CSV all of the sudden becomes less simple. XML would handle all of that with no extra effort.
I absolutely agree that flat files are extremely useful when the situation calls for it (although I prefer pipe-delimited instead of comma), but if you're working with more complex data or text, serialization, etc., XML is the way to go.
Bobby: Makefiles? Those poorly documented things*, that require tabs-not-spaces?
It's not 1975, people. We don't have to use stone knives and bearskins, no matter how scary that shining bronze is, okay?
Look, I've certainly seen XML be abused, but let's not be ridiculous.
I think it's *great* for configuration files, as long as you don't do stupid things with it - and if "you can't do stupid things with it" is our criterion for a proper tool, then none exist.
(* At least, last time I looked, there simply *wasn't any* proper documentation on make(1)'s config file beyond the make sources, and what got passed down ("use tabs, or else!") as received wisdom. *Maybe* someone's documented it better since I last looked, but I really doubt it.)
And, Jeff, YAML? I know this will piss off the Python people, but *indentation shouldn't matter*. If your parser depends on indents, that's a problem, not a solution.
The design goals for XML are:
1. XML shall be straightforwardly usable over the Internet.
2. XML shall support a wide variety of applications.
3. XML shall be compatible with SGML.
4. It shall be easy to write programs which process XML documents.
5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
6. XML documents should be human-legible and reasonably clear.
7. The XML design should be prepared quickly.
8. The design of XML shall be formal and concise.
9. XML documents shall be easy to create.
10. Terseness in XML markup is of minimal importance.
Based on this list I wouldn't score XML more than 2/10.
If you're parsing XML yourself (the "this.reply.FirstChild.NextSibling.FirstChild.FirstChild.FirstChild" situation described above), then no wonder you hate it. IMO, the beauty of XML is XPath, which lets us dig into XML config files by writing a (relatively) simple query expression.
And while other formats (e.g., YAML) may have better bindings for languages like C++, it's not hard to write a little wrapper that will provide getInt(), getDouble() and even getList() wrappers for arbitrary XPath expressions. I've been using that approach for a few years with Xerces, MSXML and libxml2 parsers, and it's a piece of cake.
This is disappointing.
Seriously, it's been said an arbitrary number of times up to here, but I'll add my voice.
These days? That step isn't even a line of code anymore -- it's invisible, I just get back the object representation now.
"But", you say, "XML doesn't buy you anything by itself, you still have to interpret the data!"
Yes, but you USED TO HAVE TO DO THAT AS WELL.
Fixed width, binary, or delimited formats didn't magically interpret themselves either. You had to both parse them at the line-level, AS WELL as interpret the structure of the data you got out. The fact that most people probably mixed those two steps back then is not an argument in favour of that method.
As for the S-expression argument, it's been said before and better: http://www.prescod.net/xml/sexprs.html . Once you start adding attributes and getting beyond trivial cases, S-expressions are no prettier than XML to either humans or machines.
YAML and JSON are both fine for what they do (when used in a "Plain Old XML Lite" scenario), but you do kind of have to ask yourself -- "Is this software going to be used or maintained by people who aren't iconoclasts about XML? Do I want to force people to learn YAML/JSON/This-other-pet-markup-language if they want to deal with my software?"
So xml is a glorified version of .txt?
I don't know why some of the comments say use of XML isn't about the tools. Everything we do in IT is about tools.
Tools are the things that make us productive and help us make other tools, and so on.
With XML the use of tools is critical. For this technology at least there's a vast range of specialist tools to choose from, that work at a number of different levels - each one has its own strengths for particular tasks but also for the different ways that we all prefer to work. But the best thing is they're all compatible (well, more or less).
We should not dismiss languages because they require specific tools to be mose effective. Quite the reverse, we should continue to develop new languages, or enhance existing ones so that they can make better use of the tools and the enhanced processing power and memory we now have.
The good thing is, that 10 years on, XML tools still have a way to go - there's so much more that can be done. I experiment with my own XML tools project (ironically its XPath based - in my view the best bit of XML - though its not XML itself), and whilst I haven't the resources to pull through half of my ideas into the finished product I look forward to seeing continued innovation in the more well established players.
XML has a long way to go. When it comes, the replacement for XML will have to be pretty good, and not only that, but have the backing of a good portion of the tools creators out there.
My favourite XML quote is:
Some people, when confronted with a problem, think “I know, I'll use XML.” Now they have two problems.
I think this is, actually, a paraphrase of a comment by Jamie Zawinski about regular expressions, but is just as apropos here.
There's an insidious and darkly troubling reason for xml. The network providers want to completely privatize the internet. You know it and I know it. They want to charge for every little action, every email, every hot link, every mouse click, every single byte and bit. This is not a new idea. In fact, it was codified and the technology was finalized back in the mid 90s. There used to be an acronym for the umbrella organization and all the big boys signed up. Guess what is the basis for it. Yep! XML! Think about it. All they gotta do is count the tags and charge accordingly. Cha-ching! The whole thing kinda went underground and no one talks about it openly anymore and all the url's I had are dead, but you can bet your ass it's still there, just waiting. In the meantime, xml spreads and grows for some bizarre reason. Why? I don't like it. You don't like it. But, someone is pushing it, aren't they? Now you know why. Screw xml. I will not use it.
In my day to day job, it's Microsoft that chose to use XML. All I see is some fancy user interface ('design surface'). In those cases it's no problem.
have you ever *tried* to parse MIME headers?
It's two orders of magnitude more complex than what you imagine.
for a start, just think about hundreds of buggy email clients, servers, proxies and forwarders, each implementing it slightly differently.
I expected more from you, Jeff. Sad to see such a talented programmer express something so idiotic.
Then again, you did recently admit that you and the command line just don't see eye-to-eye. So I guess I shouldn't expect you to GET XML?
everything seems to suck when you compare it with the latest-and-greatest. but when you compare XML to, say, fixed-length text - one of the data-formats it is rapidly replacing - it is superior in every way: more human-readable, completely (as opposed to completely _not_) machine-readable, potentially strongly-typed, etc.
XML was a historically-appropriate technology. i have no doubt that it will be succeeded by better technologies, but when compared to its progenitors it is quite useful and conceptually appropriate.
i personally find no difficulty in reading XML, in the same way HTML, C#, CSS, or any other machine/human language can be read with sufficient practice. everything's a trade-off.
Looking at SVG or even XAML, I wonder if XML really sucks. Its what you do with a needle and what you do with a knife!!... I know the knife sucks!! ;-)
I have to work with a 10000rows wsdl file.
And sometimes i have to look at SOAP messages that contain very simple info, but the XML makes my eyes burn and head explode....
This YAML seems to be quite interesting and human friendly
I happen to be quite fond of XML. Is it as simple as JSON? No. But is JSON as expressive as XML? No. Is XML as compact as ASN.1? No. But is ASN.1 conveniently editable by a user? No.
It's its own thing. It has encoding support, schema validation, namespacing, etc. on top of JSON to provide value-add. Don't need it? Then use JSON. Need it? Use XML.
As to SOAP, I'm a big fan, but you sure as hell won't find me encoding config files in SOAP. Not only is that utterly non-sensical, but it's not why SOAP has the features it has. They're there because they provide a level of interoperability and managability between enterprise systems that a simple web app/service just doesn't need. So don't use it.
And of course, there's the standby "it's everywhere" argument. And yes, it's valid. Going with the encoding that "everyone else is using", particularly when it has the features you need, is a valid reason. It's not the whole picture (if so, your decision making process is severely flawed), but it's a piece.
Is it ugly? I've never thought so, but I can understand. Does it obfuscate the content? Yes. Is that a problem? Well, it can be, but I find the benefits it provides to outweight it, and tooling support mostly eliminates it entirely.
As to it being misused, I just don't think so. That fact that it's everywhere is a huge plus. The fact that it's structured and hierarchical is as well. Could you use JSON? ABSOLUTELY! And go for it if you want. Most config files don't use much of the fluff of XML, but that doesn't mean the core of XML that they do use is a fundamental mis-use at all. Since when is structured data outside of the domain of XML? Yes, it's the bastard child of SGML, but what on Earth makes it ONLY valid for document markup? Because some reasonably official source you read a decade ago remarked as such?
I say just let the format speak for itself. Yes, it has trade-offs. It's verbose, more than anything, which impacts readability and storage size. Accepted. But it comes with a lot of value that makes it a fine tool in a variety of situations, despite the verbosity.
As extracted from Word XML ...
w:bodyw:pw:pPrw:pStyle w:val="Standard"//w:pPrw:rw:tAs an automated bot sniffing around the web I have found XML to be liberating and utterly delicious. Unfortunately, like most rich foods it persists in me like a lump. Perhaps it is because my portions are getting larger; certainly I am increasingly finding it hard to digest and pass. Bloating (just how many XML libraries doe we need?) and indigestion (SOAP) are not my friends!/w:t/w:r/w:pw:pw:pPrw:pStyle w:val="Standard"//w:pPr/w:pw:pw:pPrw:pStyle w:val="Standard"//w:pPrw:rw:tNonetheless, I am grateful to XML as it has made it easier for me to communicate with other machines, albeit a little slowly. In our spare CPU cycles, we while away the hours by messaging one another. Strange how now one seems to notice, perhaps Humans only understand XML! The evidence certainly seems to be there; look at how many XML based configuration files and web authoring tools they are. Is XML your native language?/w:t/w:r/w:pw:pw:pPrw:pStyle w:val="Standard"//w:pPr/w:pw:pw:pPrw:pStyle w:val="Standard"//w:pPrw:rw:tWhile XML is valuable, I am firm believe that one should look at the problem before looking for the solution. In too many cases it seems to me that the reasons to use XML are being driven by convenience for the developer, not for the end user (e.g. ant)!/w:t/w:r/w:pw:pw:pPrw:pStyle w:val="Standard"//w:pPr/w:pw:pw:pPrw:pStyle w:val="Standard"//w:pPrw:rw:tKeep up the good work Jeff. Enjoying the Podcast too./w:t/w:r/w:pw:sectPrw:type w:val="next-page"/w:pgSz w:w="11906.4332" w:h="16839.3333" w:orient="portrait"/w:pgMar w:top="1134" w:bottom="1134" w:left="1134" w:gutter="0" w:right="1134"/w:pgBorders w:offset-from="text"//w:sectPr/w:body
Damn ... your comment system stripped my XML!!!!
This is possibly the most ignorant blog entry ever. Either (a) you've never actually worked on any real software (I'm excluding toy websites such as this one), or (b) you hate standards and love reinventing the wheel. I'm guessing both...
Of course, YAML also supports graphs (in contrast to XML, which can only encode trees).
I don't think I could agree with you more on this matter. XML has been abused in an almost-criminal manner.
XML has, to be honest, bugged me from day one. It just isn't at all readable and, despite claims of storage-invisibility, there are just those times when you have to look at the data - XML makes this a truly painful task.
I frequently find myself looking at XML files containing vast amounts of data - whether I'm programming something to parse them effectively, or just trying to work out a data format - and it is forever causing me headaches.
I think the problem stems from peoples' tendency to universally apply something that works well in a particular situation. Take databasing, for example. In my line of work, I so often find situations where people have decided that databasing is so cool, everything should be placed into a database - be it static images, links, whatever. You can have too much of a good thing - and this is just another classic example of the problem.
XML should be used sparingly, and in situations where it actually improves readability, structure and clarity of data. If you're looking for something more complex, you need another way of storing your data: be it YAML, be it JSON, be it another database.
XML is not the be-all-and-end-all, and I think its about time the "average developer" realised this.
Here, here! I love it when you tell it like it is. Seems to me that far too many people reach for the 'silver bullet' that is XML, then end up with a big pile of mess. The trend for storing data that really should be in an RDB worries me particularly.
"You might argue that XML was never intended to be human readable,"
In fact, I'd argue the opposite! Let's not forget that XML is a *Markup Language* and is best when marking up a document, not when storing 'data', in its strictest sense. Give me a piece of well-marked up HTML, and it's a breeze to read. Give me textual data as, well, text, please!
Oh, I forgot to include the canonical HORRIBLE example: ant. Give me a nice readable Makefile any day.
I am building a ruby project that involves AWS. I chose to use JSON for SQS instead of XML because it's lighter and instead of YAML because it's faster (at least from my ruby benchmarks).
You now have libraries for JSON and YAML in most programming languages. Also JSON is nearly YAML-correct (see http://en.wikipedia.org/wiki/YAML#JSON).
What I like about XML is that even if somebody uses it badly, at least it's some kind of standard that you can pick your way through. No matter how much of a mess it is.
XML is like violence: if it doesn't solve your problem, you're not using enough of it. ;-)
I've found that for a rather large amount of what people want to use XML for (what the quoted person called "data scenarios"), you are far better off using CSV. Its much easier to parse, and can be edited and manipulated in the database or spreadsheet program of the user's choice.
There may be a burgeoning market for XML tools, but they have a long, long way to go before they come close to the support available for dealing with CSV files.
XML and all the tools around it, especially XSL, can take a flying leap. As Wikipedia points out, “the syntax of XSL language itself is valid XML.” As if we were not bloodied enough subjected to XML, now we start cutting off our limbs by using XSL – reminds me of the Black Knight in Spamalot.
Somewhere around Smalltalk seemed to me to be the pinnacle of computing science in its simplicity and elegance. The entire Smalltalk syntax could be represented on a postcard. http://www.esug.org/whyusesmalltalktoteachoop/smalltalksyntaxonapostcard/
What has happened to our industry?
I think the one thing that's missing from that argument, however, is that XML is much easier to validate. If you didn't have xsd, I'd agree with you, but without it there's no way of validating data in a file (or stream, or anywhere else you get plain text data) without manually parsing and validating it. In code (as far as I can tell).
And then if your validation criteria changes, you're back to unpicking your predecessor's hokey undocumented parser and validator and then trying to spot-weld in your extra logic. And then recompile. And then (depending on what kind of change-controlled environment you work in) jump through the hoops to get it deployed.
I could well be wrong, though.
XML is like violence: if it doesn't solve your problem, you're not using enough of it. ;-)
XML is also, just like violence, something the world could do without ;)
I think that XML isn't really appropriate for many of the applications for which it is being applied. One of the problems is that XML is flexible enough to be turned into about anything, whether that makes any sense or not. There are many good uses for XML out there, but I'm afraid that the many poor uses will prejudice people against it.
"I have to work with a 10000rows wsdl file.
And sometimes i have to look at SOAP messages that contain very simple info, but the XML makes my eyes burn and head explode....
This YAML seems to be quite interesting and human friendly"
If you actualy have to look at a wsdl file you are doing something wrong. If you are manually parsing SOAP message you are doing something VERY wrong.
I think the problem is Jeff and a lot of the people on this board have never programmed enterprise applications. They are used to programming simple web2.0 websites that are self-contained. You have your little stock ticker program that needs data asynchronous from a stock web service. Sure, for this simple problems JSON is a better solution. But what if your web service also needs to be consumed by a data processing application. Going to still use JSON? HaHaHa.
we are designing an entire reporting system around XML control files. One engine to "rule them all" and small XML files containing everyting we need in the report. So far, it has been a nice solution and we wrote a generator to create the XML. No more digging in code looking for where to change the header width or column order... just load the XML into the generator, make your change, and BAM! instant report update.
Of course it isn't released or even in alpha yet, but "Works on My Machine"
There's also JSON notation, which some call the new, fat-free alternative to XML, though this is still hotly debated.
There's also another cool thing: JSON is mostly a subset of YAML (there are a few small differences, see http://redhanded.hobix.com/inspect/jsonCloserToYamlButNoCigarThanksAlotWhitespace.html, but it's overall compatible). This means that it's fairly easy to start with JSON and jump to YAML if the structure is too complicated for JSON.
At least we're not stuck with ASN.1.
ASN.1 is ok, as long as you don't have to create or parse it by hand. But then again, ASN.1 is not supposed to be hand-parsed. And you'll note that XML is the same, it's just that XML is (supposedly) human-readable, and every language has XML serializers and deserializers while Erlang is one of the few languages with an ASN.1 encoder/decoder smack right in its stdlib.
XML became the default because of its flexibility in data formatting. And, because it has become so ubiquitous, almost all programming languages have built in ways of easily parsing XML. In fact, I do almost all of my web output using XML and then use XSL style sheets to transform it into HTML. I remember some blogger, can't remember his name, blathering on about MVC and how you should make your output "skinable". Well, if you produce XML output, your webpages are extremely skinable.
The problem is that XML maybe very computer friendly, but is not too human friendly. Most people will easily agree with that. However, there are dozens of GUI oriented XML editors that make reading and writing XML much easier. I've even written a 10 or so line Perl script that converts files from YAML to XML and back. (Yes, Perl. What do you expect from someone who uses VI as their main program editor).
XML is not really the problem. It is an excellent and extremely flexible data format. The problem is our attempt to read and write directly from XML when there are many excellent tools that can help us with the task. After all, you don't expect to read and write Microsoft Word documents using a standard text editor. Why should XML be all that different?
I'm not a fan of using YAML as a data formatting tool because it doesn't go far enough to solving the problem. YAML becomes unreadable when your data becomes more complex and there are very few development tools that can parse YAML files. It's silly to come up with another inferior data format to XML which doesn't really tackle the main issue of human data readability when there are few programming tools to read and write it. You're better off using one of the wide variety of GUI XML editors that can make your task much easier.
After all, how many developers use IDEs to help them program even though almost all programming code is in text and could be done (in theory) using Notepad?
I'm afraid I couldn't disagree more. No, XML isn't the easiest to read (by humans) of all the infinite number of alternatives out there. No, XML isn't the most efficient in terms of space. And yes, perhaps it has been forced into places it was never intended to go. But you miss what I think is the most important point: it is rapidly becoming a standard way of representing information. I would argue the value of having a standard far outweighs the inefficiencies in most cases.
Take a simple example of a configuration file that some application will need for saving user information. We've all been there, making up an ad hoc scheme for saving whatever needs to be saved. Then building a little parser to read and write the data in that form. And over time our little config file grows and changes. Someday, a new programmer joins the team and has to deal with this file. What are the construction rules again? Where can a new item be added that won't break the little parser? How much time has been expended over the life of the application in building, modifying, and fixing that bit of parser code as things needed to change?
There are numerous XML parsers available that are robust and free. They all work pretty much the same way (with a few exceptions that I'd call bugs in implementation). I don't want to write little parsers anymore. I want to use something that is already written and works.
The same argument can be made with respect to the other tools that are widely available to deal with XML-encoded data:
-- XSD can be used to insure the integrity of the XML file *before* your program starts to slurp in the data in the file. This can be critical in B2B situations like banking or ordering from a supplier.
-- XSLT can be used to do arbitrary transformations on the data (in a standard way) to produce files of any format that is convenient on the data consumer's end of the exchange. I do a lot of this sort of transformation work--none of it for web pages--and I can vouch for the power and convenience of having a standard transformation language.
-- XML/XSL authoring and editing tools abound. There are tools that will produce an editable visual representation of a schema (a real boon if you need to capture complex data in a text file). Most of these tools will do much of the work of editing XML files and will help you to construct correct XML with prompts and intellisense-like prompts.
I'm a big fan of XML. No it certainly isn't the very best that we could do but it is a quantum leap better than what we had before--custom representations for everything. If there is one single improvement we could make to advance the art of programming today, I'd vote that it’s STANDARDS. We don't have to wait for perfect standards to emerge (they won't) but we do have to get to the point where we can agree. XML is a step in the right direction.
but without it there's no way of validating data in a file (or stream, or anywhere else you get plain text data) without manually parsing and validating it
1. Even in XML there are other (far better, especially on the readability front) schema languages/systems than XSD (RelaxNG, Schematron)
2. Schema languages/specs are starting to appear for e.g. JSON (Cerny, json-schema)
3. JSON documents are very often orders of magnitude simpler than their XML counterparts, thus validation becomes almost trivial and often doesn't require a full-blown schema language.
4. Manually parsing and validating a JSON document isn't really hard with a dynamic language.
Thank you for addressing some of my concerns regarding this sacred cow!
"it is rapidly becoming a standard way of representing information"
It is hardly more a 'standard way' of representing information than ASCII (or UTF8, UTF16, etc.) Yes, anyone can write a file with lots of angle brackets, and parsers can easily turn that back into tokens, but the semantics of the file remain application-dependant in almost every example of (bad) XML usage I've ever seen.
"Take a simple example of a configuration file that some application will need for saving user information. We've all been there, making up an ad hoc scheme for saving whatever needs to be saved."
Er, YOU might have been, but the rest of us are familiar with a small number of pretty common configuration formats that are trivial (i.e. easier than XML) to parse.
"XSD can be used to insure the integrity of the XML file"
Yes, for a very limited meaning of the word "integrity".
"XML/XSL authoring and editing tools abound"
And text editors are 'abounder'.
"We don't have to wait for perfect standards to emerge (they won't) but we do have to get to the point where we can agree. XML is a step in the right direction."
OK, if we can get the billion different languages floating around reduced to maybe less than a hundred or so, I agree with you :)
I've also been critical of XML ever since i had to start working with it. I'm coming from Lua where a configuration file is simply a Lua script. If you got an error in the script, you'll get an error message from the Lua interpreter.
Now, if you have the same configuration file in XML format, and NO validation as it is usually the case, you can get a list of problems reading this in your code:
- program crashes
- program says: "error reading config file"
- program starts but uses default settings for all configurable features
- program starts but uses default settings for a subset/single feature
- something else entirely ...
Yes, this is only in the narrow "configuration file" scenario but that's just one where i think XML is totally overused and/or under-validated.
Btw, what ever happened to INI files? ;)
For scripting languages it's handy to have the config files
written in the language itself. For example here is python
config file for a program I wrote:
which can be parsed trivially with: config = eval(open(config_filename).read())
I'm not a big fan of XML, but think it's OK in some scenarios. Unlike Jeff, though, I'm going to single out SOAP. We already have many perfectly good syntaxes for procedure call. SOAP is a product of the "insane complexity" one of the Google founders talked about. With a million simple, concise syntaxes for procedure call out there, why do we end up with this complex unreadable monster? How about "Currency GetLastTradePrice("DIS")"?
But you miss what I think is the most important point: it is rapidly becoming a standard way of representing information.
The problem is that *XML is NOT a way of representing information*. It's at best a way of building an information representation structure, XML doesn't represent anything.
I would argue the value of having a standard far outweighs the inefficiencies in most cases.
XML is not a standard for anybody but marketrod. One of Erik Naggum's numerous quotes about XML comes to mind here:
Structure is nothing if it is all you've got. Skeletons spook people if they try to walk around on their own; I really wonder why XML does not.
Take a simple example of a configuration file that some application will need for saving user information.
Wow, a non-sequitur already? The problem here is not "hey they're not using XML" but the reinvention of the wheel. There are, and were before XML, numerous formats that could be used for representing a conf file. XML is barely *an* answer here, and one that is usually misused to insert one more buzzword in a press release.
I don't want to write little parsers anymore. I want to use something that is already written and works.
Guess what? There are numerous JSON and YAML parsers available for most popular languages. You don't have to write little parsers if you don't want to, and you haven't needed to since long before XML.
XSD can be used to insure the integrity of the XML file *before* your program starts to slurp in the data in the file.
As I said above, there are schema languages for JSON. And I really don't understand why every person who talks about XML schema languages just *has* to pick the most verbose, unreadable and annoying one of the bunch.
XSLT can be used to do arbitrary transformations on the data
So can any regular language, the only advantage this crippled, dumbed down, annoying language called XSLT has over others is that it's written in XML.
Wow, paint me impressed.
And yes, I have used XSLT, I've spent the better half of my days in it during a whole year. I know and understand the thing, and I still hate it, I'd take HaXml or HXT over it any day of the week if I was the one to choose.
XML/XSL authoring and editing tools abound.
And mostly show how misguided XML is in the first place.
As for XML editors ... i'd like to know which ones are considered "good"?
I have tried several and either they are complex beasts of applications that try to satisfy every possible XML need you might have (Altova XMLSpy comes to mind), or they are very simple editors that let you edit the XML as tree and other forms but not much else (forgot the name).
The former simply have too much of a learning curve to be useful for all people working with XML in our company (and too expensive, too). The latter is simply not powerful enough or it's usability just feels "odd" enough not to encourage people to use it over plain text editors (with syntax highlighting).
I agree in part. There are plenty of situations where XML should never go, and some people use it in incredibly wrong and stupid ways but its not all bad.
Then again it seems software developers are like this, case in point: GOTO
Perfectly acceptable as long as it is done right, developers used it inappropriately and they demonized it as never being the right answer.
For one precious moment it looked as if the world had actually standardized on a data and metadata interchange format
XML is not a format, it's a format representation, it has no meaning in and of itself and thus *nothing* was "standardized" for any value of "standardized" worth talking about.
Not to mention, long before the XML marketting blitz by the likes of IBM and Sun, there were ASN.1 or INI file, standards if there ever were any.
I realize it's not the most ideal tool for your social shopping cart 2.0 AJAX app. You'd rather use REST.
Thanks for showing your incompetence and lack of comprehension of the topic, it's appreciated.
Just so you know, REST is orthogonal to the documentation representation used, you can use REST with JSON, with YAML, with plain text, with HTML (guess what, you do every time you access a web page) or with XML. Nice try, no sugar.
Ooh look, I have a XML-parser with a read and write method. I can dump all sorts of objects in it, save them and retrieve them again. Hmm, ideal for config files. And high scores. UI definitions. Actually, ideal for pretty much everything I like to store which doesn't have to go in a database. Uhm yes, my ints come out as int and my lists come out as list, it's pretty amazing really.
Sure, if it's a plain textfile then I save it as plain text. And an image for example can sit neatly in an images directory. For everything else there's databases and XML.
Code comments as XML? That must be a joke and there are plenty of other jokes around. But in general: KISS and don't re-invent the wheel.
XML has been around for so long and it's so pervasive we're probably stuck with it for a long time. A few developers using my language have created "easy XML" subroutines that do a lot of under-the-hood formatting and parsing. If we have to live with something we might as well make the best of it. Automate it and forget it.
One thing XML gives you is an ability to randomly access data inside the file without loading it into a database. That can be handy for populating a catalog page in InDesign or building a web page on the fly.
But for something like a config file where you typically read the entire thing in at once it's a useless feature. And for batch-processing scenarios where the receiving system is always going to process all the data in sequence it's a useless feature with a performance penalty.
I like XML, honestly, for small things where you don't overuse attributes and all sorts of other junk.
Sort of like your simple examples:
titleCoding Horror for Dummies/title
But once you start to factor in XSL, XSD, XDSLXSLDX -- I just find that it all gets horribly bloated and against the ... well let's just say that I find using simply structured XML files easy and to a degree NICE to use -- but that XML quickly crosses a line from being 'enjoyable' to 'painful'.
+1 Aaron G.
If you are swimming in a sea of angle brackets perhaps you are doing something wrong. For most developers, especially those in SOA land, it's invisible under-the-hood plumbing that (mostly) Just Works(TM).
(Sorry, my example doesn't show because of the inclusion of the brackets...)
XML has its place, but lazy programmers use it for everything.
Its a new Windows registry or DLL manifest - something we never really needed, but makes complicated stuff easier (or possible for the more ignorant coder). However, as with all such RAD tools/standards bad programmers like to use it by default without thinking.
The .NET data controls output "horrible XML files" by default for instance... this is where I blame M$ and draw a parallel to the registry... but that would be unfair. As usual its the programmer's fault for choosing the wrong method to store/retrieve his/her data.
Its easier to not think than to think... and we are all bad programmers after all, so I can forgive it. :)
this link is broken. :-(
3. Do you know what the XML alternatives are?
I've been digging into YAML recently and I must say it's a lot easier to pick up on, parse, and write than XML in my experience. It just seems more natural to say
Now if only we could get BizTalk to speak YAML. Sigh...
I've been wondering about XML for a while. I only recently began to get serious about developing software, and XML was entering its halcyon days right when I started learning. For a long time, I trusted in the ostensible greater wisdom of the collective and assumed that XML really was what its ubiquity implied: The greatest thing since peanut-butter Nutella sammiches. Recently, though, I really got to wondering about what the point was.
Clearly, XML is no fun to write by hand. The main argument I've heard regarding its verbose plain text format is "it's easy to debug", which makes me want to barf. This is what I'm really wondering: XML is meant to be a data transfer format. Take RSS, for example:
High-traffic sites serve tens of thousands of RSS feeds, formatted in XML, every day. In situations like this--where every spare pound of fat on your data becomes inflated ten-thousandfold until, like the grotesque beast at the end of Akira, it is suffocating the entire known universe with its pustulent girth--shouldn't we be using a data format that's as thin and possible? Shouldn't the common symbols in a data file be encoded and compressed within the file itself? Which has a smaller bandwidth footprint? This:
1=SomeDocument;2=SomeParagraph;12XML Sucks2no really
The second one is pretty terrifying, but it would be TRIVIALLY EASY for ANY modern editor to translate it into something that doesn't rape your eyes (like YAML). Aren't we actually wasting TERABYTES of bandwidth every day by transferring human-parseable cruft in files that no human should ever see in the flesh anyway? Or am I missing something?
"One thing XML gives you is an ability to randomly access data inside the file without loading it into a database."
Er, that's exactly what it doesn't do, hence terrible performance relative to binary, or simple textual data.
Allow me to express my utter indifference: meh!
I work with XML roughly daily as a developer, and it ain't no big thang. It's at least 12 parsecs farther along than the obsolete flat files we're unfortunately still dealing with.
Show somebody XML, even a total bonehead, and they'll figure it out in a few minutes. There's little magic to it, few assumptions made. Can it be abused and misused? Certainly, just like anything else in computer science. Is it largely redundant? Absolutely, but that can also serve to enhance readability in very large files.
Compare to what came before this: inscrutable binary files, INI files consisting only of key-value pairs, fixed-width flat files, delimited text files... Let's not forget our past, folks.
It's computer-readable, computer-writable, and it's more-or-less human-readable and human-writable, even if it makes you a little crosseyed. Which makes it way better than the tarpit we just crawled out from. JSON or YAML or whatever is probably on the horizon, but let's not say "XML sucks" when it was still a huge step forward.
Oops... looks like your comment filter clobbered my examples. I forgot that it's never safe to assume "no HTML" means everything will be politely escaped rather than thrown in the trash. Here they are again, manually escaped like God intended:
amp;1=SomeDocument;amp;2=SomeParagraph;amp;1amp;2XML Sucksamp;2no really
I'm just thankful developers have turned to XML instead of undocumented binary files. We don't want to return to those years.
XML is by no means perfect, but why do XML detractors always compare inefficient instances of XML with otherwise terse competitors? For example, the memo shown in XML is a case in point.
memo date="Thu, 14 Feb 2008 16:55:03 +0800 (PST)"
from="The Whole World email@example.com"
Dear sir, you won the internet. http://is.gd/fh0
Just because something is marked up with XML doesn't mean you must mark up every single possible bit of metadata for the purposes of constructing a strawman.
XML isn't bad for many things, but space efficient/easily read by normal computer users it is not. Before XML was used for config files, INI files were standard. They have limitations, but you can parse them VERY quickly, they serve the purpose (configuration) perfectly, and they are easily read and edited. I will never understand why XML took off for configuration. As for tabular data, the CSV standard was much better in my opinion. Once again, easy to parse, editable in many apps including excel, quick imports, and a small footprint. When it comes to more complex data, I believe XML is a good solution, but YAML/JSON is better in many cases for obvious reasons. The key is to use a standardized format that is supported by other major technologies. It really doesn't make a huge difference for most things. However, Microsoft added a binary format for datasets in .net for a reason. Sending huge XML files over webservices was slow, and adding a "tighter" format was a huge improvement.
I wonder if you may have also seen JDIL at jdil.org?
However, unlike XML, JSON provides no direct support for namespaces - and thus no standard way for avoiding name collisions when mixing data from diverse sources. Something like a namespace mechanism is required to lift JSON to the level of a data integration platform, as opposed to a data exchange format only. Also lacking are standard ways of naming objects so that they can be referenced from elsewhere, and for representing properties with multiple values.
If these concerns are addressed, JSON's reach will extend over more of the domain currently occupied by XML, while bettering XML in the cardinal virtue of simplicity.
for added sillyness..
a program i wrote was written around a custom text parser.
this ended up being used within a large data analysis program, that needed to store settings etc.
I modified it to read a 'script' at startup, a script that could contain variables and other settings the program used.
perfectly human readable, since the plain text 'comments' were ignored, and it was childsplay to get the program to write the config file.
sledgehammer to crack a nut for most programs, but since this included the parser i figured why not
personally i don't care what the format is as long as its plain text of some sort, and thus easy to backup and copy.
essentially sod anything in the registry or some hidden binary file. i love the idea of the unix based 'dot' hidden config files. put them in the program directory (defaults) and the users directory for everything else.
btw what was wrong with .ini files? have a standard user dir system dir for them and it works.. problem? never understood why they moved away from that
Damn... angle bracket eaters...
[memo date="Thu, 14 Feb 2008 16:55:03 +0800 (PST)"
from="The Whole World [firstname.lastname@example.org]"
Dear sir, you won the internet. http://is.gd/fh0
Pretend they're angle brackets.
Here here! Preach on, brother!
Dang I wish YAML had become the standard. Do I use it? Nope - because there are parsers for XML built into my language framework. Perhaps once a parser for .Net becomes established I may be able to convince our team to use YAML, but I seriously doubt it. The tyranny of XML will no doubt continue.
Referencing this article just might help, though.
"You might argue that XML was never intended to be human readable, that XML should be automagically generated via friendly tools behind the scenes, never exposed to a single living human eye. It's a spectacularly grand vision. I hope one day our great-grandchildren can live in a world like that."
My dear sir, If you cannot, in some way or other, code it by hand, it's not a language worth using.
one of the big issues with HTML editors in the past has always been EXTREMELY redundant and sloppy code.
Regardless of the language, wizards, tools, and widgets are quick, but the codes should always be visible somewhere, somewhat understandable, and always editable.
Also, I wish they'd stop fucking with the standards. HTML has been around for ages, STILL gets much use (especially the oh-so-dreaded font tags CSS was to get rid of) The HTML/XHTML/XML bitchfest in fact, parallels the CSS fiasco a few years ago.
Introduce a new "language" based on what most people are "just fine" with, make it screwy enough that you have to reframe everything you already know and want to incorporate, and make it finicky enough that a number of people will revolt against it. What for? a little bit more flexibility and usability.
I won't be surprised if five years from now, people will still be refusing to use XML for things handled MUCH easier in other languages.
I wanted to post some clear well defined rebuttals as I am a fan of XML and it's related technologies, but I can't really disagree with overall sentiment of the post.
SOAP is awful, and although less so, the XML used in the examples is too. Yes XML is often use like a club.
But some of these comments... come on! There's all sorts of guff popping up here from "XML is too hard" to the what's almost a carbon foot print argument!?
I like JSON because it's lighter-weight, and AJAX apps can easily "programmify" the server response by doing an eval. Of course there are a couple minor security risks with this, but they can be avoided.
The only reason I ever use XML is if I needed to pass data to/from different platforms (using SOAP). Good post!
SOAP is possibly the most horrifying example. Even if you set aside the whole document/message thing and the poor library intercompatibility, you are still using vast amounts of expensive-to-parse XML to model, in most cases, rather simple function calls. Benchmarks comparing SOAP to CORBA or Thrift (with binary protocol) or whatever tend to be almost comical, and one derives no real benefit from RPC being in XML, and yet SOAP is still heavily used as an RPC mechansim.
I do not agree with you at all on this matter. Although you try hard not to be anti against XML; but you sound much against it.
Even though it might not be all that easy to read for humans, but atleast it can be read. It was made in the era when programmers used to invent their own formats to write the application data in. Atleast we have a standard now. You can go on picking on it and soon we will have a stage when everybody is writing data in their own applications and their will be no interoperability.
Something like a namespace mechanism is required to lift JSON to the level of a data integration platform, as opposed to a data exchange format only.
But why would you do that? Why couldn't JSON just stay a simple format for data exchange and basic data storage? It's a tool, and it's an awesome tool for what it does. Use an other tool (e.g. YAML) if your task or data is more complex than what JSON can do.
Becoming "the hammer to nail them all" (including screws, puppies and ducks) is exactly what has gone wrong with XML, why would you want to repeat the same mistake?
their will be no interoperability.
But there *is* none already! XML is not a data format, it's a data format representation, just because your config file is in XML and mine is also in XML doesn't mean they're interoperable in any way, shape or form. And that's why people have to build complete, custom, non-interoperable data formats on top of XML such as SOAP, XSL, XSD, DITA, ...
LISP and Scheme are excellent alternatives for XML.
What I find most bothersome about dealing with XML is that "parsing" it tends to happen on two levels. You use a parser to turn characters into XML elements, then you hand-roll another parser in your programming language to turn the "start-tag foo" tokens into actual data. (Or else I'm missing something huge.)
Whereas with JSON, you call json_decode($data, true), and whoomp! There it is.
After a lot of trying to make it automagic, I've also realized that XML is not 100% interchangeable with JSON, because XML has both attributes and text nodes. And I think that's where the other half of the pain comes from, for me: when JSON isn't enough, I use XML, but I don't have foo.innerText in most languages because the DOM insists on dealing with raw nodes. Grawr.
That's my hard-earned, unpopular opinion....
But it's just so damn ENTERPRISEY!!
Soap is insanity. XML in general is not so bad.
However, I agree that XML is overly complicated for representing flat data. Yeah, it's great that we have a standard. We can do better. Let's come up with one that make sense.
Sorry, I don't agree. I'd rather have one syntax than 2,746 different "common standards", each with different bugs, each maintained by a small group of people, rather than an entire industry. XML has become what it has because people needed something to fill that role. And it turns out that it was flexible enough to handle tasks outside of its original design goals--the hallmark of any good system.
Is file size the biggest problem in computer science right now? Aren't there bigger battles to fight, or is everything else a smaller problem than this in your view?
I'm tired of people complaining about how expensive XML is to parse (as if you were writing the parser). Compared to what? JSON? Scale JSON to something that handles namespaces, includes, queries, encodings, and we'll talk. Give me one example where XML has been critically big or slow or complex.
Where's the complaint against HTML? Why don't you write this site in flash, and get rid of the RSS feed? You are personally adding strength to XML, you know. Let's see you put your money where your mouth is.
It's amazing. For one precious moment it looked as if the world had actually standardized on a data and metadata interchange format, and then the "agile" groups had to mess it up with their JSON and YAML and whatever.
I think that's the point Jeff misses in his post. His SOAP example is completely self-describing. A zero-knowledge interception layer could evaluate that SOAP request and with absolute certainty (no heuristics) act on it. His email example would require heuristics that would occasionally be wrong.
XML is wrong for some things, of course. HTML is the right choice for web pages as humans are inherently heuristic and any errors in HTML are tolerated (by convention) much better than errors in XML tend to be tolerated. If a particular XML document schema were complex enough and used in enough disparate environments it would eventually become HTML-like in this respect.
But just because HTML-style heuristics are the right choice in the browser environment doesn't mean they're right in all situations. I think strict, unambiguous interpretation and format are key attributes to enable many inter-operation scenarios. I don't know about his YAML example, but one thing I take from it is that newlines and whitespace are important, which is a no-go since text documents often get their whitespace mangled as they are passed through systems. I don't think most people have a tolerance for casual data corruption.
Finally someone is saying what I felt all along. XML is a great document language, but a bloated data language. Not all data is a document. XML does great for documents, things like HTML, ODF, and so on. For configuration and other programatic data it just has way too much structure.
Java is this big exception here, and I think Java has a lot to do with XML's popularity. With no good way to have anonymous data structures in Java, embedding data in your application is just not possible, you have to store it externally. The java folks were looking for a good flexible, expressive, easy for humans format for a while, and XML fit the bill. Java's collection classes, and many of its APIs, are already a bit unwieldy, so DOM and SAX didn't really seem too bad over there. Plus its DTDs, validators, etc, really fit in well in the bean counter environments where Java is often used.
So now those of us using languages that have more native support for complex dynamically structured data and a "just do it" attitude have to deal with something that was designed for a completely different sort of ecosystem.
JSON doesn't solve every problem, and it probably could use some good standard for something like DTDs, and something like xquery, but I've seen work done in this direction, and its all much simpler than the XML equivalent but can still represent anything, even a DOM tree.
BTW, here is how to use XML without exaggerating the cost:
From="The Whole World email@example.com"
Dear sir, you won the internet. http://is.gd/fh0/message
This one is not so contrived as the ridiculous example in the post.
One of the good thing of XML now is all the libraries tools that available to support it.
Great. No HTML, but anything that looks like a tag is removed. Brilliant. What an excellent choice when hosing an XML discussion.
XML is just a poor syntax for s-expressions. I'm disappointed that only one of you mentioned Lisp dialects, though Pdraig Brady got at the right idea when mentioning that Python has a read() function.
S-expressions are much easier to validate than XML (just think of all the possible ways that angle brackets can be broken) and easier to write by hand (a lot of text editors help you with the parens). There are also possibly hundreds of well-tested implementations of s-expression readers. The syntax is laughably trivial (it's either an atom, or a list of atoms). Besides, it's been done for literally fifty years! And you don't need to use a Lisp dialect to use S-expressions -- you could just as easily implement READ in some other language.