Why XML is Garbage

XML, the ugly step-child of the uglier HTML, is pervasive everywhere in computers these days.   Since the beginning of time, my bosses have tried to get me to embrace and adopt XML behind the scenes into my designs and since the beginning I’ve been defiant. So what’s wrong with XML?  Why do I hate it so much?

Well, I think a better question is, “what’s right with XML?”.  In the opinion of this programmer, very little is right with XML.  But I suppose I’ll throw it a few bones.


The Good

  • XML is human readable and just about any idiot can understand what it means.  However, we all hope we don’t have to work with an idiot in the next cubicle.
  • It is hierarchical.
  • That’s all.  There’s nothing else good about XML.

The Bad

  • Expensive to Parse.  XML is expensive to parse and not at all easy to address.  Back in the early days I developed my own “dot notation” for addressing different parts of XML documents quickly and easily which are similar to some of the techniques used by people today, but I also wondered why the hell a kid in a cubicle in a Minneapolis suburb was having to invent new innovative ideas that I thought a monkey should have invented already.  It was common sense.  There’s not enough of that in the open source community plagued by wanna-be programmers. Have you ever tried parsing a giant XML document, like, for example, the infamous planet.xml which contains a fully annotated map of the planet earth generated by the community?  Try it… then come crying to me and tell me I’m right.
  • “AMP Encoding”.  XML reemploys useful ASCII/Ansi/Unicode characters in ways that make encoding of those characters difficult.  Maybe “difficult” is too strong of a word for it, but we’ve all bumped into those situations where the idiot in the next cubicle builds a document with a body that contains these special characters and forgets to properly encode it.  But the fact remains that there are plenty of unused characters on the ASCII chart… and even more in the unicode space… so why didn’t they use them instead?Traditionally (long ago) markup languages have used the ESC character to denote out-of-band formatting data and commands.  In my opinion this is a bit more technically sound because when dealing with text data… no-one ever uses the ASCII 27 for anything… ever.  Under those old systems, unless you were slipping truly binary data into the stream, you’d basically never have to encode anything ever.  Again… its called “common sense”.
  • Using XML in a read-write random access fashion is prohibitively expensive. XML it isn’t really designed for that nor is it ever used that way.  But why shouldn’t you want to use it that way?
  • Binary Data encoding has no real universal standard.  There are sorta standards, but not really.  Encoding of Binary data should be handled in a universally useful way and specified.
  • XML is “Chatty”.  Full of redundancy.   If you have 50,000 “user” tags in your document, you’ll probably have a document that contains the word “user” 50,000 times.   In most cases its chattiness is acceptable, but it is not universally useful as a result.

If not XML, then what?  What would be better?

Honestly, just about anything would be better than XML.  I resisted transporting data in XML format because I already had a robust suite of code that allowed for the marshaling and transport of literally any kind of data efficiently that was less prone to encoding errors than XML ever was or would ever be.  A simple protocol using binary header blocks is pretty easy for any beginner programmer to deal with these days.  But even a much simpler approach might be more powerful than XML while still being painfully simple to understand…

Lets start with this painfully simple XML document:

Yes, the above document is simple to understand, but it has all the bad things that I listed earlier.

What if we simply stored all this information in name-value pairs?

Yes, the above formatting is a smidge chattier than the XML, however it has several BIG advantages.

1. It is sortable.
2. Since it is sortable, it is searchable using a binary search and therefore lightning fast.
3. It is random-accessible.  You can easily find the value you’re reading, but you can also easily add new values into the middle or the end of the document without problems, particularly if you load it as a string list, dictionary, or other kind of string array of your liking.
4. It is still pretty darn easy to understand.
5. It has no problem encoding characters like “<” and “>”  and ” “.  The only reserved characters are “=” in the key names and CR/LF in the values.
6. There is only one place for data, making it less confusing.  XML has two places for data, in the “body” of a key or as an “attribute” of the key.

*Poof* *Mind Blown*

All brashness aside.  The use of name-value pairs hardly constitutes out-of-the-box thinking, nor it is the most useful and innovative thing that I or a team of engineers could come up with, but that just goes to further emphasize my point that XML is a garbage format.  I swear this planet is just doomed to blindly follow bad standards and it, frankly, drives this programmer mad!