XML Crash Course

Published: ← 2010-12-17 →
Category: ← Code →
Tags: ← HTML Star Trek XML

A non-nerdy friend of mine (yes, they exist, moving on) changed jobs at her company last week and had a question:

So, it appears that i will have to learn how to use XML for my job. What would you suggest? And by “suggest”, i mean, “talk to me like i’m four.”

I suggest you get a better description of what you need to learn because they didn’t give you enough info. XML is a really general tool used in a lot of different things. (Some of us nerds hate XML and would prefer to see different things in its place, but knowing that doesn’t help you learn XML so I’m not going to talk about it except to finish with a terrible joke.)

XML is a way of marking up data. Imagine I have a catalog card for the (excellent) book I am rereading now:

The Design of Everyday Things
Donald A. Norman
272p, 8.3 x 5.5 x 0.8

Anybody who has ever complained that “they don’t make things like they used to” will immediately connect with this book. Norman’s thesis is that when designers fail to understand the processes by which devices work, they create unworkable technology.
978-00465067107
TS171.4 .N67 2002\

It’s a bunch of data. Now, you are a human. You have vast life experience and your brain is a massively parallel neural network, so you are excellent at pattern matching. You know which bits of data are the title or the description or the LOC number, and you’d know this even if you weren’t old enough to have used a physical card catalog (oh god we’re decrepit, why didn’t I use an iPhone or a pop star as an example?).

Computers suck at this. They like to have their data explained to them. XML is one, fairly human-readable way of doing so.

XML would “mark up” the card by labeling the different fields of information.

` `{lang=”xml”}

The Design of Everyday Things

Donald A. Norman

272 8.3 5.5 0.8

Anybody who has ever complained that “they don’t make things like they used to” will immediately connect with this book. Norman’s thesis is that when designers fail to understand the processes by which devices work, they create unworkable technology.

978-00465067107 TS171.4 .N67 2002

The bits in angle brackets are called “tags”. It’s easy to tell a computer program to look for the <title> tag and save that to a particular field in a database, or display it on a web page, or whatever.

There are also “attributes”, like the ‘units=”inches”’ on the <dimensions> tag. They’re usually used to describe the way the data is formatted.

Tags can be nested - notice everything is in a <card> tag and the measurements are nested in the <dimensions> tag. Let’s make one more nesting:

` Donald A. Norman `{lang=”xml”}

This book only has one author, but there are lots of books with multiple authors, like The Illuminatus Trilogy:

` Robert Anton Wilson Robert Shea fnord `{lang=”xml”}

The design decisions of what tags to invent, how to nest them, and how to format data inside them is called a “schema”. XML files should each have a schema to eliminate the Dread Pirate Ambiguity. Is it “Robert Shea” or “Shea, Robert”? Or should we break it out into <first-name> and <last-name> tags, but then how would we deal with a book by Prince, when he even deigns to use an alphabet at all? Either you adopt a schema or you create your own, and hopefully you spend a lot of time thinking about all these weird things that can go wrong because it’s painful to have to change the schema and get everybody to agree to use the new thing.

The only other major concept is “validity”, which means making sure that tags are written properly (no “<auth Robert Shea</author>” or other typos) and used properly (no “<author>Robert Shea and Robert Anton Wilson</author>”). Programmers, being a lazy and untrustworthy lot (and our XML tools often being big pains in the ass), often slap XML files together by hand and generate subtly invalid XML.

XML gets used for a huge amount of stuff nowadays because, despite some shortcomings (the big ones being verbosity, the difficulty of validity, and a lot of crappy programming tools), it’s a decent way of sharing data. And you don’t even have to know much about the schema to look at a file and start pulling data out of it.

If you’ve ever used a feed reader to follow a blog, those feeds it reads are stored in XML. In your browser, pull up my blog feed, click the View menu, and choose View Source. You’ll see the XML describing my recent posts as <item>s and, inside each, the text of each post in a <description>.

A little confusingly, that text is marked up with HTML, which is very similar to XML. XML and HTML are basically brothers, they’re based on the same older work (SGML), except that XML is like the Borg and tried to assimilate HTML in the early ’00s (XHTML) and then the Federation (web nerds) realized that was a crappy idea and set off at maximum warp (let’s go build some awesome things and let the documentation catch up later) in a different direction (HTML5) and soon it’ll be holodecks for everybody (canvas, video, audio tags) and fire photon torpedoes at the Ferengi (Flash). Exactly like that.

So there’s a crash course in XML. I can’t tell you a lot more because I don’t know what your job will have you do with it. Saying they use XML is not much better than saying they use electricity. I know that you’re probably going to take some data from one place, format it, and put it in some other place but that’s a such a vague job description it even applies to those guys with the vacuum trucks who empty latrines (a metaphor that will become increasingly rich, predictive, and pungent to you the more time you spend with XML).

Questions?