There are two ways for programs to collaborate: they can communicate their shared state or they can partition the work so they don’t need to know each other’s state. As I’ve been rehabbing the code for my mailing list archive Chibrary I ran into this issue, and the design constraints made for an interesting puzzle. I need to generate unique IDs in multiple programs at once, and they need to be short enough to fit in URLs.
Writing is nature’s way of letting you know how sloppy your thinking is. Guindon
At more than 3,000 words, this is an unusually long blog post for me. This problem has a lot of considerations to balance (and you’ll see a lot of parenthetical caveats). I like to work through problems by writing out my thoughts, constantly tinkering with possibilities and revisiting ideas in light of new constraints or insight. I thought my (heavily) revised notes might make for an interesting post as I attack the problem from several different approaches and end with a solution that nicely connects theoretical and practical considerations. As a bonus, maybe someone will tell me I’m approaching this all wrong or ignorant of a known best practice for this sort of thing before I go live. With that preamble rambled, let’s dig into the problem.
Every message in Chibrary needs to be assigned a unique identifier, a “call number”, as it’s imported. This name is a little too cute because call numbers in library classification systems usually embed some kind of addressing information (that is, books are topologically sorted, like in the classic Dewey decimal system) while my users never care where on my hard drive a message happens to be stored. But the obvious term “Message ID” was already in use as a (hopefully) unique ID attached to each email by the program first used to send it, and synonyms like “Email ID” would have been confusing.
A call number could be as simple as a monotonically increasing number (first message gets the call number “1″, second gets “2″, etc.) but this has two obstacles.
I opened this post with the big problem: collaboration. There’s always a program running to watch the system’s email inbox for new messages from lists the system is subscribed to, so what happens I want to run the bulk importer on a new archive? If both programs run at the same time they might both see the highest ID in use is 123, add one to it, and both try to add a message under the call number 124. Avoiding this collision means writing another fairly complex program to uniquely dole out IDs and all the importers slowing down to request an ID for each message they want to store.
It’s tempting to send all messages to a single importer, but this creates a bottleneck and wastes the concurrent insertions available from storing messages in the distributed database Riak. And if I’m entirely honest, I didn’t want to give up this puzzle. Chibrary exists in part for me to learn from experiments.
The second, less important problem is that sequential IDs invite abuse. I’d rather people downloaded packaged archives from me rather than writing a trivial program to get message 1, then 2, and so on. (If this was a commercial app, there might also be competitive concerns with URLs advertising how many items are in my database, what order they were added, and how many German tanks I’ve rolled off the line.)
The solution I originally used had some rough edges, I hadn’t thought through all the consequences. I’m looking for a new approach without these risks, and I’m not maintaining the old IDs because no one relies on them; the site has been offline almost four years and nothing still links to it.
I partitioned the space of all possible IDs so programs wouldn’t have to coordinate. I gave each server a unique ID number, and the operating system gives each running program a unique process ID (“pid”). With each process guaranteed to have different values for those first two fields, each instance of a running program keeps a simple sequence number of its messages. A 40 bit call number was the combination of these three fields:
There is a non-obvious communication problem here. Each time a program is run, that’s called a “process”. For instance, I might have three copies of the import program running at once to import three archives — one program has three processes. This is all fine, because each combination of a server ID and pid is unique. Well, it’s unique while a process is running. After a process ends, the operating system may reuse the pid, for a different or same program.
Each process needs to start its sequence number from the last sequence number used by that server and pid. That number was in a small database, and the different runs of the programs were communicating temporally. Even though this was only a single number, there was significant code to persist this communication until it was next needed.
This approach doesn’t remove the problem of programs coordinating, it shifts it to server startup time (for the server ID) and the end of the process.
numbers how do they add
There’s an important side topic here of how those three ID numbers were combined into a single call number and used publicly. If I’d just added them together there’s the chance of collision, of a duplicate ID being generated. On server 1, the process ID 1′s second call number generated would be 1 + 1 + 2 = 4, and process ID 2′s first number would be 1 + 2 + 1 = 4.
The right way to combine them is to concatenate the digits in order: 1, 1, 2 becomes 112 and 1, 2, 1 becomes 121 (technically this is a series of arithmetic right shifts and logical ORs, but that distinction doesn’t improve this post). And to avoid collisions as these numbers get large and roll over to multiple places (eg. pid 34 sequence 5 vs. pid 3 sequence 45), we have to pad with zeroes for the largest number we can support: 010000200001. (This also means a little more state for the program to handle, if it has exhausted all the sequence digits available for a given server and pid.)
Meanwhile, a wrinkle. A pid is a 15 bit binary number (on Linux by default; I don’t want to change that or have to worry about other architectures, though I did goof and allocate 16 bits for it) so the max is 32,768. If I use 5 decimal digits to store it I’m wasting space because there are 100,000 – 32,768 = 67,232 values I’ll never use. (Here’s a nice intro to number bases, if you need it.) I really want to think about these as binary numbers so I can have that efficiency.
So here’s an example call number according to the generation scheme I described above:
Why be efficient? It’s a fair question, it’s not like I’m going to run out of numbers to count up to. I could pad a hundred extra digits and be happy until the heat death of the universe. The answer is that call numbers get used in URLs. I want a link to a specific message on Chibrary to fit in a tweet, not be a full page long. If people ever have to type them in from a book or read them over the phone, they’ll be sad to type an extra eighty 0s.
I can keep URLs short by generating IDs in binary for efficiency but representing them to users in a higher base, to pack the most number of bits in each character of the URL. Base 10 is obvious, but base 16 uses 0-9 and A-F to allow 16 values per character instead of 10. We can do better than that, though.
When I first designed this system, I tried to fit 64 values per character by using 0-9, A-Z, a-z, and – and _. Base 64 has the nice property of being a power of 2, so each character fits exactly 6 binary digits. But that dash and underscore, that’s trouble. In the middle of a call number they work fine (8laN-Ft2 or 0m_mDW4z), but when they appeared next to each other (AT-_7AWL) or start/end a call number (_8YDVzJa or A5vvY63-) they’re simply confusing. Programs may not even correctly highlight them as clickable links. There aren’t any other good characters to use because URLs have a limited character set (before percent-encoding, which doesn’t help me keep my character count down). It’s a shame thorn didn’t survive.
With the lamentations of English orthography as regards binary encoding duly noted, I decided to drop – and _ to use a base 62 for call numbers. This is only log2(62) = ~5.954 bits per character, but I’m OK with that small waste.
The old system was spending a lot of bits (4 for server ID and 16 for pid) just to ensure that processes didn’t generate collisions. The standard tool for creating short collision-free identifiers is hashing.
Hashes are typically long (128 to 512 bits), but can be truncated to reduce length at increased risk of collision (this may be unsafe in cryptography, but thankfully this exercise doesn’t call for that level of insanity). The real problem is that I don’t have anything good to hash.
The message IDs I mentioned a thousand words ago are not reliably unique; a few mail programs reuse them and many archives I’ve scraped don’t include them at all. I could also hash some combination of a message’s subject, from address, and date, but all of these (and other, less stable fields) are user-controlled (when they’re not improperly edited by poor archiving programs) and someone could overwrite a message by duplicating the relevant fields in another.
GUIDs could be used to avoid collisions, but they’re 128 bits wide, which encodes to 22 base 62 digits. The idea of GUIDs is to have such a huge number that the chance of a collision is infinitesimal, but it means a lot of wasted bits in long URLs. Unlike hashes, they’re not safely truncated.
There are a lot of hashing/random generation schemes, but they all entail trading off length vs. the birthday paradox. Generating a duplicate ID in Chibrary doesn’t mean twice the cake, it means losing what may be the world’s only copy of a message.
A Different Approach
With all this in mind, I restated my goal: to create a unique integer that’s as short as possible. I pondered “shortness” and looked at it from the other side: how may bits do I get for each base 62 character?
|characters||bits||number of messages|
That’s quite a lot. I only have a half-million messages now, and my imagination for the scale of Chibrary tops out around billion messages, which fits in six characters. And I was happy with eight-character identifiers, which would let me have 218 trillion!
What would it take to use that up? Well, if I imported a thousand records every second… wait, that’s an interesting idea. What if I used a timestamp instead of server and process ID? Unix systems keep time by counting the seconds using (effectively) a 31 bit integer. Unix timestamps roll over in 2038, so add a bit and it’s fine until 2106; another gets to 2242. For that thousand-per-second sequence number, I can count to 1024 in exactly 10 bits.
If I have 47 bits available and spent 33 on time and 10 on how many were imported that second, I have 4 left over to avoid parallel generation. That’s a little tight, so I tinkered and came up with:
Version tracks the scheme used for laying out this information. If I got anything wrong I have 7 more tries to get it right before I have to add a character. It also could be considered to be 3 more bits I could later allocate to another field.
Timestamp doesn’t need the full 33 bits I mentioned before. I was thinking about bit patterns and remembered the billennium when the highest bit went on. I don’t need to track the number of seconds since 1970 because I’m not going to be importing any messages from before this system is installed. I can subtract the timestamp of when I turn it on from the current timestamp to get an extra 34 years… and as mentioned, it can steal a bit from the version field in 2150 when it would otherwise roll over.
Sequence allows 256 messages imported per process per second. This doesn’t seem like much, an importer may hit that limit and wind up sleeping until the clock ticks another second. But there’s only so many messages a process needs to load at once. The most popular mailing lists get 200-300 messages per day, so if one averaged 256 messages per day for 10 years, that’s only 256 * 365 * 10 / 60 / 60 / 24 = 11 days to import the 9.3 million messages.
Parallelization: A replacement for server ID + pid. Rather than maintain state over time, this scheme only needs to ensure two processes don’t use the same value for these 4 bits at the same time. For now the simplest way to deal with this is hardcode a hostname/program —> value table and only run one at a time. I can’t imagine needing to run 15 imports at the same time (leaving one value for the subscription watcher).
But that brings up an interesting point: if I wanted to import that hypothetical huge list, I could partition the messages 15 ways and have it imported in 17 hours.
What’s happening there is a very pure expression of the interchangeability of bits. Rather than change bits in the timestamp or sequence, this changes bits in the parallelization field. Because these call numbers are arbitrary, opaque IDs, changing one bit is as good as changing any other, as long as there’s no chance of collision.
Let’s back up and reconsider this realization in light of the goals for ID generation: unique IDs, generated across several hosts, that are not trivially sequenced by outsiders.
The parallelization field, or server + pid fields in the old scheme, keep two processes from generating the same sequence numbers. The timestamp keeps a process from generating the same ID numbers in two different runs. What all three have in common is that they mean doing a single check against global state at startup and after that the process is free to generate sequence numbers indefinitely.
The timestamp is an improvement over the server + pid field because a process doesn’t have to touch that shared state again at the end of its run. That’s not a significant performance/networking cost, it’s a risk: maybe an entire process or server dies in such a way that it fails to contact the global state to note what sequence number it got up to (eg the server is unplugged as it’s running). The next time that server + pid combination comes up it will generate duplicate IDs and overwrite messages.
This benefit of the timestamp’s has an efficiency cost: every second that it’s not inserting 256 messages (times the number of values in the parallelization field…), it’s wasting potential call numbers. I do have 218 trillion potential call numbers available, but if I waste a billion here and a billion there, sooner or later you’re talking real numbers.
I look at those tradeoffs and I recognize: to generate short IDs there’s no getting around some amount of collaboration. I don’t mind a process touching global state every startup, I can’t rely on it touching global state when it completes, and I don’t want to pay the performance hit of touching global state for every message. A lot of what I’m doing is trading efficiency for parallelism.
So let me just do that explicitly. A thousand words ago I noted parenthetically that my original scheme would sometimes run out of sequence numbers for a given server ID + process. The strategy for dealing with that was simple: die. A new process would have a new process ID and could go on importing. I can reuse that strategy:
That global state will be a “run ID”. A central coordinating process will dole out unique run IDs to each process that starts up and requests one. That coordinator can have all the paranoid sanity checks to make sure it doesn’t re-use shared state rather than the processes sharing any code for that.
Once it has obtained a run ID, each process has 14 bits for its sequence number, allowing it to insert 16,384 messages before it dies or re-contacts the coordinator to obtain a new run ID. (Which strategy it uses can depend on what it does). The number of bits allocated to run ID vs. sequence number is a balancing act of how long I’ll want to be running this system vs. how long each process can insert before needing to contact global state. This feels right to me now, but I have that version ID (and any unused bits in the run ID) to try again if it doesn’t work out. So here’s an example ID:
This is a lovely balance of efficiency and collaboration.
You may remember that I’d like these ID numbers to not be trivially enumerable so visitors who want messages in bulk prefer my packaged archives to scraping the site. With all these reminders of how interchangeable one bit is, the answer is clearly a simple transformation. Generate call numbers as described above, but stably shuffle the bits. Rearrange them so they are in a new order. As long as that order is consistent I won’t generate collisions.
This drives it home at a visceral level that it doesn’t matter whether a given bit partially identifies the schema revision or is part of an integer. Every bit disambiguates the message from half of the other 2 ** 47 possible messages. What I’m using a bit for doesn’t matter as long as it’s not reused.
This meets all my needs and even leaves me room for mistakes. Unless someone pipes up in the comments to tell me what a mistake this is, this is the system I’ll be coding and deploying for Chibrary. Thanks for reading.
id = rand(2 ** 64)