Ripping Unicode
I love shoving around large amounts of data. Unicode is an industry standard for encoding data in most every written script there’s ever been. It has over 97,000 characters. A while ago I read about a guy who made his own Unicode poster and I realized I had an opportunity for a fun project. I think Unicode is an invaluable and beautiful project, and this is my tribute to it.
Unicode puts out PDF code charts of each and every glyph so people can have some idea of what they should look like (though the actual display to the user is left up to fonts). If you’d like to see one, the chart for Linear B is a nice little example. Because I didn’t want to mess around with putting dozens of PDFs together, I just grabbed the single 30MB 777-page PDF for the latest version, 4.1.0. (New versions are released as they add more scripts -- cuneiform is scheduled for 5.0, yay!)
Looking at the PDF (whether you grabbed the small example or decided to be macho and pulled up the whole thing), it’s obvious that the “one big table per page” approach makes it pretty easy to extract the glyphs. I wrote a program to do this, because it would be stupid and ultimately self-defeating to open the PDF with GIMP to manually select and save each glyph.
First, I used ghostscript to convert the PDF to a series of 600 DPI pngs, one image per page. This took about 39 minutes and 471MB of disk space. (I’ll explain why I converted at such a high resolution in about two paragraphs.)
Next, I wrote a Python script using the Python Imaging Library to analyze every page, locate the glyph table if the page has one, and walk through the table exporting each glyph to its own individual PNG. The 769MB of individual glyph images took about 5.6 hours to rip. And that’s OK! I kept the code simple to read because performance doesn’t matter. That’s an odd thing to say for a program that takes hours to finish, but you only ever run this program once. Note, that’s never an excuse for shoddy quality, because you never run a program just once. Yes, those two statements are at odds. If you don’t see that they’re both true, you have not yet learned the Tao of Code.
Well, there’s one important step elided from the previous paragraph: figuring out which glyph is which so I know what the filename should be. Every glyph is labeled with a four or five-digit hexadecimal number, but of course the label is part of the image rather than text. Converting images to text is known as OCR (optical character recognition) and it’s difficult to avoid errors. In my favor I had images that were big (600 DPI, see, I told you I’d get back to it), error-free, and black-and-white; but a strike against me was that it can’t spell-check a word like “A29C”. After trying a few different OCR programs, I got ocrad working well (with a little tinkering, it often mistook the number 0 for the letter O).
Well, mission accomplished. I’ve ripped 97,715 glyphs from the Unicode code charts to individual PNGs. So what should I do with them all?
I work at Hostway, a web hosting company with about a dozen offices worldwide. I picked up a copy of our logo in vector format from one of our graphic designers and resized it until it had about the same number of black pixels as I have glyphs.
{.content width=”392” height=”401”}
Then, of course, I wrote another script to render a final image. It created a huge new image in which every pixel from the source was a downsampled (because I’m printing at 300 DPI) glyph, and it tries to make sure it puts the darkest glyphs in place of the darkest pixel so I have a (very) rough antialiasing effect to soften the edges of letters. (If it’s not already painfully obvious, yes, I did ascii art as a teenager.) This script immediately crashed and burned.
{.content width=”392” height=”399”}
The problem is scale. The source image is 1618 x 151 pixels, and I’m outputting at 300 DPI: 1618 x 151 x 300 x 300 = 20.9GB. As cool as it would be, I just don’t have that much RAM. I had to break up the image into pieces three feet wide, which is good because that’s also how wide the biggest printer in the office is. And I got to write a great error message: “Can’t create images this big. Reduce max_piece_width or buy more RAM.”
{.content width=”392” height=”400”}
Made of 16 images totaling 310MB (it took 45m), the Hostway logo worked out to be 6 feet tall and 46 feet long. My boss told me it was “too big” to hang up. I told him that was “exactly the point of the whole thing”. I’ll change his mind yet -- everyone in the office I show it to loves it, so it’s just a matter of time.
{.content width=”170” height=”343”}
I presented the code at the June 2006 meeting of ChiPy and it was well-received. One attendee told me afterwards it was the funniest, nerdiest presentation he’s ever seen, which I consider high praise. As promised, I’m sharing the source code.
You’ll have to be a little familiar with graphics and Python to make use of this. It is not a copy of all the glyph graphics, and I will not provide you with them or make posters for you, so please don’t ask. I’d love to hear what about what you do with it, please do something cool and post a comment or trackback.