Distributed ID Generation and Bit Packing

There are two ways for programs to collaborate: they can communicate their shared state or they can partition the work so they don’t need to know each other’s state. As I’ve been rehabbing the code for my mailing list archive Chibrary I ran into this issue, and the design constraints made for an interesting puzzle. I need to generate unique IDs in multiple programs at once, and they need to be short enough to fit in URLs.

Writing is nature’s way of letting you know how sloppy your thinking is. Guindon

At more than 3,000 words, this is an unusually long blog post for me. This problem has a lot of considerations to balance (and you’ll see a lot of parenthetical caveats). I like to work through problems by writing out my thoughts, constantly tinkering with possibilities and revisiting ideas in light of new constraints or insight. I thought my (heavily) revised notes might make for an interesting post as I attack the problem from several different approaches and end with a solution that nicely connects theoretical and practical considerations. As a bonus, maybe someone will tell me I’m approaching this all wrong or ignorant of a known best practice for this sort of thing before I go live. With that preamble rambled, let’s dig into the problem.

Call Numbers

Every message in Chibrary needs to be assigned a unique identifier, a “call number”, as it’s imported. This name is a little too cute because call numbers in library classification systems usually embed some kind of addressing information (that is, books are topologically sorted, like in the classic Dewey decimal system) while my users never care where on my hard drive a message happens to be stored. But the obvious term “Message ID” was already in use as a (hopefully) unique ID attached to each email by the program first used to send it, and synonyms like “Email ID” would have been confusing.

A call number could be as simple as a monotonically increasing number (first message gets the call number “1″, second gets “2″, etc.) but this has two obstacles.

I opened this post with the big problem: collaboration. There’s always a program running to watch the system’s email inbox for new messages from lists the system is subscribed to, so what happens I want to run the bulk importer on a new archive? If both programs run at the same time they might both see the highest ID in use is 123, add one to it, and both try to add a message under the call number 124. Avoiding this collision means writing another fairly complex program to uniquely dole out IDs and all the importers slowing down to request an ID for each message they want to store.

It’s tempting to send all messages to a single importer, but this creates a bottleneck and wastes the concurrent insertions available from storing messages in the distributed database Riak. And if I’m entirely honest, I didn’t want to give up this puzzle. Chibrary exists in part for me to learn from experiments.

The second, less important problem is that sequential IDs invite abuse. I’d rather people downloaded packaged archives from me rather than writing a trivial program to get message 1, then 2, and so on. (If this was a commercial app, there might also be competitive concerns with URLs advertising how many items are in my database, what order they were added, and how many German tanks I’ve rolled off the line.)

Previous Approach

The solution I originally used had some rough edges, I hadn’t thought through all the consequences. I’m looking for a new approach without these risks, and I’m not maintaining the old IDs because no one relies on them; the site has been offline almost four years and nothing still links to it.

I partitioned the space of all possible IDs so programs wouldn’t have to coordinate. I gave each server a unique ID number, and the operating system gives each running program a unique process ID (“pid”). With each process guaranteed to have different values for those first two fields, each instance of a running program keeps a simple sequence number of its messages. A 40 bit call number was the combination of these three fields:

bitsfor…
4server
16pid
20sequence number

There is a non-obvious communication problem here. Each time a program is run, that’s called a “process”. For instance, I might have three copies of the import program running at once to import three archives — one program has three processes. This is all fine, because each combination of a server ID and pid is unique. Well, it’s unique while a process is running. After a process ends, the operating system may reuse the pid, for a different or same program.

Each process needs to start its sequence number from the last sequence number used by that server and pid. That number was in a small database, and the different runs of the programs were communicating temporally. Even though this was only a single number, there was significant code to persist this communication until it was next needed.

This approach doesn’t remove the problem of programs coordinating, it shifts it to server startup time (for the server ID) and the end of the process.

numbers how do they add

There’s an important side topic here of how those three ID numbers were combined into a single call number and used publicly. If I’d just added them together there’s the chance of collision, of a duplicate ID being generated. On server 1, the process ID 1′s second call number generated would be 1 + 1 + 2 = 4, and process ID 2′s first number would be 1 + 2 + 1 = 4.

The right way to combine them is to concatenate the digits in order: 1, 1, 2 becomes 112 and 1, 2, 1 becomes 121 (technically this is a series of arithmetic right shifts and logical ORs, but that distinction doesn’t improve this post). And to avoid collisions as these numbers get large and roll over to multiple places (eg. pid 34 sequence 5 vs. pid 3 sequence 45), we have to pad with zeroes for the largest number we can support: 010000200001. (This also means a little more state for the program to handle, if it has exhausted all the sequence digits available for a given server and pid.)

Meanwhile, a wrinkle. A pid is a 15 bit binary number (on Linux by default; I don’t want to change that or have to worry about other architectures, though I did goof and allocate 16 bits for it) so the max is 32,768. If I use 5 decimal digits to store it I’m wasting space because there are 100,000 – 32,768 = 67,232 values I’ll never use. (Here’s a nice intro to number bases, if you need it.) I really want to think about these as binary numbers so I can have that efficiency.

So here’s an example call number according to the generation scheme I described above:

fieldvalue
server2
pid14013
+sequence number983

=0010000001001111010000000000001111010111

Why be efficient? It’s a fair question, it’s not like I’m going to run out of numbers to count up to. I could pad a hundred extra digits and be happy until the heat death of the universe. The answer is that call numbers get used in URLs. I want a link to a specific message on Chibrary to fit in a tweet, not be a full page long. If people ever have to type them in from a book or read them over the phone, they’ll be sad to type an extra eighty 0s.

I can keep URLs short by generating IDs in binary for efficiency but representing them to users in a higher base, to pack the most number of bits in each character of the URL. Base 10 is obvious, but base 16 uses 0-9 and A-F to allow 16 values per character instead of 10. We can do better than that, though.

@pushcx tweets: I wish English had one more letter so I could pack 6 bits in [0-9A-Za-z].

When I first designed this system, I tried to fit 64 values per character by using 0-9, A-Z, a-z, and – and _. Base 64 has the nice property of being a power of 2, so each character fits exactly 6 binary digits. But that dash and underscore, that’s trouble. In the middle of a call number they work fine (8laN-Ft2 or 0m_mDW4z), but when they appeared next to each other (AT-_7AWL) or start/end a call number (_8YDVzJa or A5vvY63-) they’re simply confusing. Programs may not even correctly highlight them as clickable links. There aren’t any other good characters to use because URLs have a limited character set (before percent-encoding, which doesn’t help me keep my character count down). It’s a shame thorn didn’t survive.

With the lamentations of English orthography as regards binary encoding duly noted, I decided to drop – and _ to use a base 62 for call numbers. This is only log2(62) = ~5.954 bits per character, but I’m OK with that small waste.

Parallel Generation

The old system was spending a lot of bits (4 for server ID and 16 for pid) just to ensure that processes didn’t generate collisions. The standard tool for creating short collision-free identifiers is hashing.

Hashes are typically long (128 to 512 bits), but can be truncated to reduce length at increased risk of collision (this may be unsafe in cryptography, but thankfully this exercise doesn’t call for that level of insanity). The real problem is that I don’t have anything good to hash.

The message IDs I mentioned a thousand words ago are not reliably unique; a few mail programs reuse them and many archives I’ve scraped don’t include them at all. I could also hash some combination of a message’s subject, from address, and date, but all of these (and other, less stable fields) are user-controlled (when they’re not improperly edited by poor archiving programs) and someone could overwrite a message by duplicating the relevant fields in another.

GUIDs could be used to avoid collisions, but they’re 128 bits wide, which encodes to 22 base 62 digits. The idea of GUIDs is to have such a huge number that the chance of a collision is infinitesimal, but it means a lot of wasted bits in long URLs. Unlike hashes, they’re not safely truncated.

There are a lot of hashing/random generation schemes, but they all entail trading off length vs. the birthday paradox. Generating a duplicate ID in Chibrary doesn’t mean twice the cake, it means losing what may be the world’s only copy of a message.

A Different Approach

With all this in mind, I restated my goal: to create a unique integer that’s as short as possible. I pondered “shortness” and looked at it from the other side: how may bits do I get for each base 62 character?

characters bits number of messages
4 23 14,776,336
5 29 916,132,832
6 35 56,800,235,584
7 41 3,521,614,606,208
8 47 218,340,105,584,896
9 53 13,537,086,546,263,552

That’s quite a lot. I only have a half-million messages now, and my imagination for the scale of Chibrary tops out around billion messages, which fits in six characters. And I was happy with eight-character identifiers, which would let me have 218 trillion!

What would it take to use that up? Well, if I imported a thousand records every second… wait, that’s an interesting idea. What if I used a timestamp instead of server and process ID? Unix systems keep time by counting the seconds using (effectively) a 31 bit integer. Unix timestamps roll over in 2038, so add a bit and it’s fine until 2106; another gets to 2242. For that thousand-per-second sequence number, I can count to 1024 in exactly 10 bits.

If I have 47 bits available and spent 33 on time and 10 on how many were imported that second, I have 4 left over to avoid parallel generation. That’s a little tight, so I tinkered and came up with:

bitsfor…
3version
32timestamp
8sequence number
4parallelization

Version tracks the scheme used for laying out this information. If I got anything wrong I have 7 more tries to get it right before I have to add a character. It also could be considered to be 3 more bits I could later allocate to another field.

Timestamp doesn’t need the full 33 bits I mentioned before. I was thinking about bit patterns and remembered the billennium when the highest bit went on. I don’t need to track the number of seconds since 1970 because I’m not going to be importing any messages from before this system is installed. I can subtract the timestamp of when I turn it on from the current timestamp to get an extra 34 years… and as mentioned, it can steal a bit from the version field in 2150 when it would otherwise roll over.

Sequence allows 256 messages imported per process per second. This doesn’t seem like much, an importer may hit that limit and wind up sleeping until the clock ticks another second. But there’s only so many messages a process needs to load at once. The most popular mailing lists get 200-300 messages per day, so if one averaged 256 messages per day for 10 years, that’s only 256 * 365 * 10 / 60 / 60 / 24 = 11 days to import the 9.3 million messages.

Parallelization: A replacement for server ID + pid. Rather than maintain state over time, this scheme only needs to ensure two processes don’t use the same value for these 4 bits at the same time. For now the simplest way to deal with this is hardcode a hostname/program —> value table and only run one at a time. I can’t imagine needing to run 15 imports at the same time (leaving one value for the subscription watcher).

But that brings up an interesting point: if I wanted to import that hypothetical huge list, I could partition the messages 15 ways and have it imported in 17 hours.

@pushcx: The bit length of IDs in #chibrary is inversely proportionate to my cleverness * my need for uncoordinated parallel generation.

What’s happening there is a very pure expression of the interchangeability of bits. Rather than change bits in the timestamp or sequence, this changes bits in the parallelization field. Because these call numbers are arbitrary, opaque IDs, changing one bit is as good as changing any other, as long as there’s no chance of collision.

Final Form

Goofy anime character saying "You fool, this isn't even my final form!"

Let’s back up and reconsider this realization in light of the goals for ID generation: unique IDs, generated across several hosts, that are not trivially sequenced by outsiders.

The parallelization field, or server + pid fields in the old scheme, keep two processes from generating the same sequence numbers. The timestamp keeps a process from generating the same ID numbers in two different runs. What all three have in common is that they mean doing a single check against global state at startup and after that the process is free to generate sequence numbers indefinitely.

The timestamp is an improvement over the server + pid field because a process doesn’t have to touch that shared state again at the end of its run. That’s not a significant performance/networking cost, it’s a risk: maybe an entire process or server dies in such a way that it fails to contact the global state to note what sequence number it got up to (eg the server is unplugged as it’s running). The next time that server + pid combination comes up it will generate duplicate IDs and overwrite messages.

This benefit of the timestamp’s has an efficiency cost: every second that it’s not inserting 256 messages (times the number of values in the parallelization field…), it’s wasting potential call numbers. I do have 218 trillion potential call numbers available, but if I waste a billion here and a billion there, sooner or later you’re talking real numbers.

I look at those tradeoffs and I recognize: to generate short IDs there’s no getting around some amount of collaboration. I don’t mind a process touching global state every startup, I can’t rely on it touching global state when it completes, and I don’t want to pay the performance hit of touching global state for every message. A lot of what I’m doing is trading efficiency for parallelism.

So let me just do that explicitly. A thousand words ago I noted parenthetically that my original scheme would sometimes run out of sequence numbers for a given server ID + process. The strategy for dealing with that was simple: die. A new process would have a new process ID and could go on importing. I can reuse that strategy:

bitsfor…
3version
30run ID
14sequence number

That global state will be a “run ID”. A central coordinating process will dole out unique run IDs to each process that starts up and requests one. That coordinator can have all the paranoid sanity checks to make sure it doesn’t re-use shared state rather than the processes sharing any code for that.

Once it has obtained a run ID, each process has 14 bits for its sequence number, allowing it to insert 16,384 messages before it dies or re-contacts the coordinator to obtain a new run ID. (Which strategy it uses can depend on what it does). The number of bits allocated to run ID vs. sequence number is a balancing act of how long I’ll want to be running this system vs. how long each process can insert before needing to contact global state. This feels right to me now, but I have that version ID (and any unused bits in the run ID) to try again if it doesn’t work out. So here’s an example ID:

fieldvalue
version0
run8,452
+sequence number11,827

=00000000000000000001000010000010010111000110011

This is a lovely balance of efficiency and collaboration.

You may remember that I’d like these ID numbers to not be trivially enumerable so visitors who want messages in bulk prefer my packaged archives to scraping the site. With all these reminders of how interchangeable one bit is, the answer is clearly a simple transformation. Generate call numbers as described above, but stably shuffle the bits. Rearrange them so they are in a new order. As long as that order is consistent I won’t generate collisions.

generated00000000000000000001000010000010010111000110011
shuffled01000010000100100110000100001000100000010000100

This drives it home at a visceral level that it doesn’t matter whether a given bit partially identifies the schema revision or is part of an integer. Every bit disambiguates the message from half of the other 2 ** 47 possible messages. What I’m using a bit for doesn’t matter as long as it’s not reused.

This meets all my needs and even leaves me room for mistakes. Unless someone pipes up in the comments to tell me what a mistake this is, this is the system I’ll be coding and deploying for Chibrary. Thanks for reading.

TLDR: id = rand(2 ** 64)

Posted in Code at 2014-04-14 14:59 | No comments | Tags: , , , , ,

Ironwood, a Roguelike Game in 7 Days

Ironwood is a roguelike game developed for the 2014 7 Day Roguelike Challenge. “Roguelike” is a pretty poorly defined genre, but generally speaking it’s a turn-based game with generated environments, emergent gameplay from simulationism, and a steep learning curve. (For more, see an informal definition, a formal attempt, or a current take on its broad use.)

Updated 2014-04-07: You can play Ironwood in your browser right now! Big thanks to Snarky for the Javascript port: Ironwood

(more…)

Posted in Games at 2014-03-19 21:26 | 3 comments | Tags: , , , ,

Extracting Immutable Objects

In the last few weeks I’ve been rehabilitating some of the first object-oriented code I wrote in the hopes of bringing my mailing list archive back online. Lately I’ve been refactoring some of the earliest, core code: the Message class. It manages the individual emails in the system and, because I didn’t understand how to extract functionality, had turned into something of a God Class.

Yesterday I tweeted about a really satisfying cleanup:

tweet

Scott Vesely replied to ask to see the code, so I figured I’d post it.

First off, I broke the Message down into many classes:

Classes Extracted

The MessageStorage, EmailStorage, and ListAddressStorage classes are part of a Table Data Gateway layer that separates object de/serialization, saving, and searching from the domain objects themselves.

I’ve been increasingly dissatisfied with the Active Record pattern. Domain and persistence code comingled in the same object quickly become hopelessly intertwined, leading to tests that have to persist large object graphs to test simple things (eg. any time FactoryGirl is needed outside of a full-stack integration test something is very wrong), and code that treats the database as the ultimate global variable. Gary Bernhardt’s videos “What Goes In Active Records” Part 1 and Part 2 crystallized this realization. In them he strips out his domain logic to service classes in a Rails-sick implementation of the Row Data Gateway pattern (though he never uses that name).

An Aside About PoEAA

You may have noticed I’ve linked to the site for Martin Fowler’s book Patterns of Enterprise Application Architecture three times in the last two paragraphs. It’s a great book on organizing code, and I wanted to point out two cute things.

First, the site mentions “Many of these diagrams demonstrate the rather poor GIF output of Visio. The nice diagrams were redrawn for me by David Heinemeier Hansson” and you can see the patterns that appear in Rails are all nicer.

Second, the book presages the last two years of the Rails community struggling loudly with “fat models” and “fast tests“:

As the domain logic gets more complicated and you begin moving toward a rich Domain Model (116), the simple approach of an Active Record starts to break down. The one-to-one match of domain classes to tables starts to fail as you factor domain logic into smaller classes. Relational databases don’t handle inheritance, so it becomes difficult to use strategies [Gang of Four] and other neat OO patterns. As the domain logic gets feisty, you want to be able to test it without having to talk to the database all the time.

What about Immutability?

Right, the reason I started this post. (I know I’ve gone too long between blog posts when I can’t say anything without digressing into mini-rants).

Every email client should generate a globally unique “Message Id” for every email it sends; in practice there are significant numbers of omissions, mis-formattings, and duplications to deal with. The Message class had a fair amount of code related to extracting them from headers, validating them, and generating them. And it was a soupy mess:

 
# If you skimmed down here: this is awful old code I would never write now
class Message
  attr_reader :message_id
 
  # ... lots of other code ...
 
  # called from initialize()
  def load_message_id
    if message_id = get_header('Message-Id') and message_id =~ /^<?[a-zA-Z0-9%+\-\.=_]+@[a-zA-Z0-9_\-\.]+>?$/ and message_id.length < 120
      @message_id = /^<?([^@]+@[^\>]+)>?/.match(message_id)[1].chomp
    else
      generate_message_id
    end
  end
 
  def get_header header
    match = /^#{header}:\s*(.*?)^\S/mi.match(headers + "\n.")
    return nil if match.nil?
    # take first match so that lines we add_header'd take precedence
    match.captures.shift.sub(/(\s)+/, ' ').sub(/\n[ \t]+/m, " ").strip
  end
 
  def headers
    return '' if message.nil?
    message.split(/\n\r?\n/)[0]
  end
 
  def add_header(header)
    name = header.match(/^(.+?):\s/).captures.shift
    new_headers = "#{header.chomp}\n"
    new_headers += "X-ListLibrary-Added-Header: #{name}\n" unless name.match(/^X-ListLibrary-/)
    @message = new_headers + message
  end
 
  def generate_message_id
    @message_id = "#{call_number}@generated-message-id.listlibrary.net"
    add_header "Message-Id: <#{@message_id}>"
  end
end

There’s a lot wrong with this code, and it can be laid at the lack of separation of concerns. My representation of a Message is bound to the email it came from, so Message knows how to parse email headers, validate message ids, generate replacements (and this gets called by outside code to prevent duplicates), and — perhaps worst for an archive — it edits its generated ids into the original email headers! I knew that was bad when I wrote it (hence adding an extra header to tell me which I edited so I could later clean them out), but with messages so closely linked I felt like I didn’t have anywhere else to store the data.

I extracted this to a MessageId class (which a Message‘s Email creates from its EmailHeaders), but it continued to bother me that load_message_id (now MessageId.extract_id) would set its @message_id instance variable to what it extracted (a simple memoization) or, if that turned out to be invalid, generate a new message id and overwrite @message_id with that.

Mutation makes code much harder to reason about. Tests have to check interactions and properties rather than just return values. So I pushed that conditional up into a new class factory method extract_or_generate that creates a MessageId with the given message id and, if it’s not valid, instead returns a new, generated MessageId. Here’s the complete listing; you’ll see I’ve also broken things like the validity check down into small, well-named methods:

 
class MessageId
  attr_reader :raw, :message_id
 
  def initialize raw
    @raw = raw || ''
  end
 
  def valid?
    !raw.empty? and raw.length <= 120 and has_id?
  end
  
  def has_id?
    raw =~ /^<?[a-zA-Z0-9%+\-\.=_]+@[a-zA-Z0-9_\-\.]+>?$/
  end
 
  def extract_id
    @message_id ||= /^<?([^@]+@[^\>]+)>?/.match(raw)[1].chomp
  end
 
  def to_s
    if valid?
      extract_id
    else
      "[invalid or missing message id]"
    end
  end
 
  def == other
    other = MessageId.new(other.to_s)
    return false unless valid? and other.valid?
    to_s == other.to_s
  end
 
  def eql? other
    self == other
  end
 
  def hash
    to_s.hash
  end
 
  def inspect
    "#<MessageId:%x '%s'>" % [(object_id << 1), to_s]
  end
 
  def self.generate_for call_number
    raise ArgumentError, "Missing call number to generate MessageId" if (call_number || '') == ''
 
    new "#{call_number}@generated-message-id.chibrary.org"
  end
 
  def self.extract_or_generate str, call_number
    mid = new(str)
    return mid if mid.valid?
    return generate_for call_number
  end
end

MessageId objects are now small value objects. They’re easy to use and understand. Previously the tests were a slow, cluttered mess that instantiated Messages from fixtures or reached deep inside to set headers. As Sandi Metz said, “When testing is hard, I’m missing an object.”.

Now I test more conditions more thoroughly and they’re totally straightforward reading. Because you really get to see code shine when it’s used, here’s the complete spec for MessageId:

describe MessageId do
  describe '#valid?' do
    it 'is valid if well-formatted' do
      expect(MessageId.new('<valid@example.com>')).to be_valid
    end
 
    it 'is invalid if nil' do
      expect(MessageId.new nil).to_not be_valid
    end
 
    it 'is invalid if empty string' do
      expect(MessageId.new('')).to_not be_valid
    end
 
    it 'is invalid if longer than 120 characters' do
      str = ('x' * 109) + "@example.com"
      expect(MessageId.new str).to_not be_valid
    end
 
    it 'is invalid if it does not include an id' do
      expect(MessageId.new 'I like cats').to_not be_valid
    end
  end
 
  describe '#has_id?' do
    it 'does with and without angle brackets' do
      expect(MessageId.new '<with@example.com>').to have_id
      expect(MessageId.new 'without@example.com').to have_id
    end
 
    it 'does not with multiple @s' do
      expect(MessageId.new 'bad@heart@example.com').to_not have_id
    end
 
    it 'does not with no text before or after the @' do
      expect(MessageId.new 'bad@').to_not have_id
      expect(MessageId.new '@example.com').to_not have_id
    end
  end
 
  describe '#to_s' do
    it 'returns the extracted id' do
      expect(MessageId.new('<id@example.com>').to_s).to eq('id@example.com')
    end
    it 'does not pass invalid ids' do
      s = MessageId.new('cats are great').to_s
      expect(s).to_not include('cats')
      expect(s).to include('invalid')
    end
  end
 
  describe '#inspect' do
    it 'includes the .to_s' do
      expect(MessageId.new('id@example.com').inspect).to include('id@example.com')
      expect(MessageId.new('cats rule').inspect).to include('invalid')
    end
  end
 
  describe '#==' do
    it 'considers equal based on extracted id, not raw' do
      expect(MessageId.new('id@example.com')).to eq(MessageId.new('<id@example.com>'))
    end
 
    it 'coerces strings to MessageIds to test' do
      expect(MessageId.new('id@example.com')).to eq('id@example.com')
    end
 
    it 'does not consider invalid ids equal to themselves' do
      expect(MessageId.new('cat')).to_not eq(MessageId.new('cat'))
    end
  end
 
  describe '#hash' do
    it 'hashes consistently' do
      expect(MessageId.new('id@example.com').hash).to eq(MessageId.new('id@example.com').hash)
    end
 
    it 'uniqs' do
      # http://stackoverflow.com/questions/20388090/arrayuniq-ignoring-identical-hash-values
      a = [MessageId.new('id@example.com'), MessageId.new('id@example.com')]
      a.uniq!
      expect(a.length).to eq(1)
    end
  end
 
  describe '::generate_for' do
    it 'creates based on call number' do
      expect(MessageId.generate_for('0123456789').to_s).to include('0123456789')
    end
 
    it 'raises without a call_number' do
      expect {
        MessageId.generate_for ''
      }.to raise_error(ArgumentError)
    end
  end
 
  describe '::extract_or_generate' do
    it 'given a valid message id, creates from that' do
      mid = MessageId.extract_or_generate('id@example.com', 'call')
      expect(mid.to_s).to include('id@example.com')
    end
 
    it 'given an invalid message id, generates from call_number' do
      mid = MessageId.extract_or_generate('srsly cats man', 'call')
      expect(mid.to_s).to include('call')
    end
  end
 
end

Reward for scrolling down here: a sneak peek of the keyboard I assembled myself and typed this post on.

CTF Liveblog

Friday

It’s on Friday afternoon and time to start working on Capture the Flag game. After chatting with friends I have some vague title/theme ideas. In the sprites there’s this row of pieces that look like a broken crest:

crest pieces

I’m not really sure what they’re supposed to be, but they prompted some theme ideas that fit with teams playing CTF over and over on random maps. Maybe there was a great evil sealed away that has broken free. It was defeated, but its wild magic broke the world into ever-changing pieces. Which is not so bad as things go, but it would be nice to put things back together again. There’s a bit of a tragedy of the commons, though – whoever is the one to reassemble the seal to fix the world will have the power to shape it, so rival groups (families? clans? guilds?) are trying to take the pieces away from each other and remake the world to their design.

Yeah, kinda cheesy. Sue me, I played a lot of video games in the 90s, that does things to your brain. But it explains why people would be playing CTF, and points towards some kind of name about a shattered crest, or collecting the seal, or something.

Please do leave better suggestions in the comments. And leave this post open in a browser tab; I’ll be updating it and twitter over the next two and a half days as I make my game.

Set up a new Rails 4.0.0 project with Ruby 2, added pg, ActiveModel::Serializers, rspec, better-errors. Oh, and lunch. And tunes.

Stood up a Heroku app with backups, Mandrill, papertrail. Very slow initial deploy, hope that’s a one-off. Removed the secret_key_base from the initializer; why is that in source to get checked in by default?

Hm, a shame that the terrains in my tiles aren’t set up for Tiled. This map is going to be ugly. :)

Welcome to uglyworld. I’ve hand-created the first map.

Uglyworld

Map modeled. And I always forget I need the Heroku user env plugin because it is inexplicably not the default.

Game, player, and squad models. Maybe a little out of order here, and no serializers/controllers yet.

I’ve dropped in Crafty, it got a few recommendations and looks nice and small. Let’s see how painful the asset pipeline makes it to add this…

Added Haml and game creation. Going to be seeing uglyworld in a game shortly…

Hey look, it’s time to waste a half an hour debugging. Yes, it’s Asset Pipeline O’Clock.

Lots of plumbing – assets, CoffeeScript, passing json, initializing some game objects… but made the first render of the map with Crafty:

first render of map with crafty

First display of a unit on the map exposes some kind of scaling bug (which only took a few minutes to track down and fix; Tiled will export images that are the size of your current zoom, crazy as that sounds):

002 first render with unit

Spent the last two hours learning the ins and outs of Crafty’s click handling. The user clicks to select units and can click off onto nothing to deselect the current unit. In Crafty, this meant adding an invisible Entity the same size and shape as the Stage:

    @nothing = Crafty.e("2D, Canvas, Mouse")
      .attr({ z: 0, x: 0, y: 0, w: Crafty.stage.elem.offsetWidth, h: Crafty.stage.elem.offsetHeight, z: 0 })
      .bind 'Click', (e) ->
        console.log('nothing clicked', e, this)

And with that, it’s time to call it a night. Good sleep means good code. I’ve now touched all the major tools I’ll be using (which always means burning some time to learn some little particular) so I should be able to get a lot done tomorrow.

Saturday

I woke up and surfed a little, saw a Reddit thread for game design progress and posted in it. I was committing the sin of looking at the internet before I even really woke up, so I have a half-hour of morning chores, breakfast, etc. before I actually get started.

OK, took a lot longer than expected to get everything done, but I’m glad to keep up my daily routine. Time to add UI to selecting characters and doing A* to move them.

Speaking of taking longer than expected, I got movement cost calculating correctly. I meant to do this in more, smaller steps but there were just too many things tangled up together in terrain costs, looping over tiles, and calculating. But now I can click on a Unit to see its movement options and click off it to deselect the unit and hide the movement options.

movement costs correct

Decrementing movement points as a Unit moves around:

spending movement points

Oh right, the other major system I hadn’t touched was animation. But now I have Units taking the shortest path to where you click, walking around obstacles as needed. And then I can click again to move it some more. No screenshot for this one because I don’t want to figure out how to take a video of my desktop and then make an animated gif. But it works.

This went a lot slower than I’d hoped, and all to reinvent the basic movement I’ve seen in dozens of tactical games, which is kind of disheartening. But it’s a hell of a lot more progress than I’ve made on a tactical game, so I’m pretty happy. I’m bumping stances off the weekend to-do list (it’s fiddly). Tomorrow is fog of war (a variation on movement, actually), sending movement info up to the server to be synced between players (should be straightforward), playback of turn events (a little tricky), guard/tagging (uh… probably painful), and flag capturing (should be straightforward).

Which is to say, a lot of stuff. I’ll definitely have something where you can move characters around, but a game that one player can win or lose is unlikely. Hm. I wonder what’s on the calendar for next weekend…

Sunday

I’ve been up for an hour or so, converting the weird recursive code that animated movement (Crafty doesn’t have an animation queueing system, so I had to create callbacks that called tween() to walk to the next tile, and so on) into an AnimationQueue Crafty component that works by iteration (with per-step and completion callbacks). So now I’ve added that to my Unit entity and animation is just matter of converting the unit’s Route into an array of locations to move to, making it much clearer. This was optional for the moment but will be a huge help for playing back turns. Here’s that Queue, it is MIT/GPL dual-licensed like Crafty itself, if you’d like to use it:

# an interative animation queue for Crafty
Crafty.c "AnimationQueue",
  init: ->
    @requires("Tween")
    @step_done_callback = undefined

  # animations should be an array of hashes with the keys
  #   tween: a hash that gets passed to tween()
  #   frames: number of frames, passed to tween()
  #   callback: optional, a callback to run after this step completes
  # done is a callback to run when all steps are completed
  animate: (animations=[], done) ->
    @animation_queue ||= []
    Array::push.apply @animation_queue, animations

    # all done callbacks will fire, even if animate() is called multiple times
    # before finishing. Queueing a new animation from a callback results in
    # that callback executing immediately, because the queue is in the middle
    # of finishing. Use window.setInterval(function () {foo.animate(...)}, 0)
    # to queue after your callback is done.
    @done_callbacks ||= []
    @done_callbacks.push done if done

    # unbind any existing copies of our callback so that multiple calls to
    # animate() don't trigger multiple step endings
    @unbind('TweenEnd').bind 'TweenEnd', (property) =>
      # ignore TweenEnd for anything else
      return unless property == 'AnimationQueueStepDone'

      @step_done_callback() if @step_done_callback
      @step_done_callback = undefined

      if @animation_queue.length == 0
        done() for done in @done_callbacks
        @done_callbacks = []
        return

      # pop off next animation
      next = @animation_queue[0]
      @animation_queue = @animation_queue[1..]

      # get the next animation tweening
      next.tween['AnimationQueueStepDone'] = true
      @tween(next.tween, next.frames)
      @step_done_callback = next.callback

    # kick off TweenEnd callback loop
    @trigger('TweenEnd', 'AnimationQueueStepDone')

    this

It actually turned out to be really easy in Ubuntu to use gtk-recordmydesktop, ffmpeg, and gifsicle to make an animated gif of a guy moving. There’s terrible video compression artifacts (I’m not going to dick around with settings to fix that right now), but here it is:

animated movement

Oh right, FOV is not a variation on movement; it can’t turn corners. Thank you, Peter of 2009, for cleaning up and commenting the shadowcasting code in that roguelike you never finished. This is inverse shading (a little easier to debug) and boolean (see or don’t, no reduction over distance), but I saved myself a ton of time:

inverse fog with hard obstacles

Or not. I forgot this algorithm doesn’t walk tiles going out from the unit, so I can’t decrement terrain visibility costs. I’m going to eat and read raycasting algorithms. If there’s a simple algorithm that works I’ll implement it, but otherwise terrain visibility costs just got bumped off the weekend plans, and trees are going to be obscenely good for hiding in (because a unit can’t see past them, so in the trees visibility is the immediate ring of 8 trees).

Ouch. It took me a couple hours to evaluate other algorithms and then implement a raycasting one (and I lost 90m to roxterm deciding to eat any “i”s I typed, that was painful to track down). I knew this afternoon I should’ve punted on this, but I really, really wanted this mechanic and decided to go for quality over completion. Which partly defeats the point of trying to hack out a game in a weekend… I already knew I tend to doggedly pursue a problem until it’s dead rather than back off and re-evaluate. Tenacity helps make me a good debugger, but a worse developer when I’m failing to prioritize.

But still, I’m happy with this weekend. I made great strides on a game and exercised the dusty algorithms part of my brain. If I can get another free weekend, I can have a playable demo. :)

I’ll close with a couple pictures of how that fog turned out. Here it is with some artifacts: notice the tiles immediately southwest and southeast of the unit have a visibility score like the unit is looking through the tree rather than the grass. It’s arbitrary which their vision passes through, so I’d prefer to give them a little extra sight distance.

raycasting fog with artifacts

Here’s that artifact fixed:

correct vision costs

Here’s an experiment with fog shading. The unit can see into the trees they’re next to, but not into the farther trees, just as I wanted:

fog

If the unit is in the trees, they’ll have a little bit better vision out than someone looking in. A unit looking into the trees would pay the visibility cost for the trees this unit is standing on, but this unit doesn’t have to pay for the trees they’re standing on. Sneaky, sneaky:

fog hiding in trees

And I’ll end with a snap of the Trello board before I close it:

final trello board
Posted in Games at 2013-08-09 18:23 | 2 comments | Tags:

Random The Flag

Oryx 16bit Fantasy Characters

I’m going to spend next Friday to Monday making a game. I was inspired by Oryx’s 16bit sprites, they’re a beautiful, cheap resource for game prototyping (the license is a bit confused for business purposes). I’ve been wanting to (and failing to) make games for years, so I’m going to ensure I succeed by defining the game by the time I spend rather than the infinitely long wish list plan I can invent.

I’m going to make a turn-based strategy multiplayer game of capture the flag. (I’m partially inspired by the old DOS CTF game and also League of Legends, but mostly by the lovely fantasy art.) I don’t want to go into too much detail about the gameplay, suffice to say that it’s largely about making good decisions with very incomplete information.

I have a public Trello board with design notes and my to-do list. If you’re curious to follow along, watch that or follow me on Twitter. That public board accepts comments and some items are tagged as open questions, so please do contribute your thoughts.

Capture The Flag

Visibility and Randomness

Making the most of limited visiblity is the core of the game, directing your units to evade notice while searching out your opponents. This mechanic needs to feel believable at first play and be simple enough for experienced players to think through in detail without painfully slowing down the game.

Visibility starts with a unit’s Sight stat. Then subtract the visibility penalty of intervening terrain. If the resulting number is higher than another unit’s stealth stat (modified by whether they’re running or hiding), they’re seen. Here’s an example of a knight with 28 sight looking across some terrain to perhaps see a rogue with 7 stealth standing in some trees:

Visibility example

This is where randomness might come in. Maybe instead of a simple subtraction, the searching unit could roll 4d6, or the hiding unit could roll for its stealth score, etc. The searcher might be unlucky and miss a poorly-hidden unit, or get lucky and see even a well-hidden one.

This would allow for false negatives: the searcher might be wrong (“false”) in thinking they see no one (a “negative” result). And it opens the door to adding false positives: occasionally display a unit where there is none, just to confuse the searcher. With the randomness, the player must wonder if the unit they saw for just a turn was a glimpse of a well-hidden unit or a mistake.

Rather than keep digging into that, let me explore other opportunities for randomness. Though I doubt I’ll be able to fit it into the weekend, I’d like to randomly generate maps.

Each player will control five or six units in each game, but I want to eventually have something scores of character classes (but only three or four this weekend). So there must be a selection process of some kind, which I’ve created a Trello card for.

I’m leaning strongly towards the Highgrounds-like system, where a player selects a pool of 10 characters and are given a random five when the game begins. Perhaps a new, unused charackter will be chosen to replace units that get tagged out. The player will have choice in team composition, but also have to adapt to the circumstances. (If players have perfect choice in their squads, it’s likely a few combinations will come to dominate unless the game is balanced incredibly well.)

Internal vs. External

I’m averse to randomness in visibility (or tagging, for that matter) but eager for randomness in maps and squads. I’m writing this post after kicking around game design ideas with @karstencode and @sourcecodenomad all day and struggling to articulate why.

After writing all this, the best I can explain is that the difference between randomness in visibility and in map or squad selection is whether that randomness is in the core game actions or not.

Maybe I’ve been brainwashed by David Sirlin, but I don’t want randomness in the heart of the game. I want it to be a test of skill. Randomness there is noise.

When the map or squad layout is random, that’s external to the core mechanics of game. It sets up the circumstances under which the players act rather than tinker directly with their ability to execute on their plans. I like Chess960 for similar reasons, randomness changes the early game from a test of how well the player has memorized standard openings to dynamic reaction to changing circumstances.

What I really don’t want is for the outcome of a game to be based on randomness. Late in the game a single missed spot, false positive, or other quirk of fate could push a pivotal unit out of place and cost a player the game. I suppose another difference is that the places I like randomness are the setup of the game rather than during interactive gameplay.

Designing

I’m having a lot of fun designing this game and talking through tradeoffs with friends. If you liked this glimpse into the process, please drop by my Trello board to see more and throw in your two cents. And watch this space; I’ll be tweeting and liveblogging as I go.

Posted in Games at 2013-08-04 14:08 | 6 comments | Tags: , ,

Personal Workflow

For about a year I’ve been using Trello, a free web app for organizing notes, to track my personal to-do lists across various projects. I’ve used it to create the Well-Sorted Version (which included repeatedly proofing 600 pages of gibberish) and update NearbyGamers from Rails 2.1 to 3.2.13 (while moving it from a VPS to Heroku and from MySQL to Postgres — a yak-shaving marathon) while staying on top of daily chores and other life maintenance. For the first time I feel reliably productive and in control of the overwhelming procrastination that’s kept me from from finishing these and many other projects for years.

The general idea started with the book Getting Things Done. This gave me the core idea of having one place to track everything with specific steps for what do next so I don’t get hung up on knowing what I should do or where/how to start. The system it describes was too complex and heavy for me, but it gave me a rough guide.

I don’t want to delve into the nuts and bolts of my procrastination, but I do want to mention the books The Now Habit and The Procrastination Equation as the only two helpful books of the many I’ve read on the topic. The former is a bit more touchy-feely and has useful strategies and thinking habits. The latter has hard science and insight into specific ways to get stuck and unstuck.

Overview

Personal Trello board

Trello breaks notes down into cards, which can have notes, checklists, and file attachments. Cards are manually sorted into columns called lists, and a set of lists is a board. I only use one board so that I can see everything in front of me at a glance, and cards generally move across lists from left to right as things get done. The lists are “Big Picture”, “Later”, “This Week”, “Today”, “Done <year>-<month>”, and “Waiting On”.

The first list is “Big Picture”. It holds template cards for my daily and monthly checklist templates. At the start of each day or month I copy them to new cards over in my “Today” list. The list also checklist templates for travel and “Clean all the things“.

Trello includes colored labels you can add to cards. I use them to mark what project something is related to: happy green is personal stuff, like spending time with friends; boring yellow is life maintenance like taxes and chores; orange is NearbyGamers because that site is kind of orange, calm blue used to be the Well-Sorted Version and is now unassigned, active red is studying, and I don’t use purple. (Keeping myself to one or two projects instead of a dozen fun ideas that I drop whenever I get stuck is part of my success.) I keep a card for each of these in the Big Picture list to track long-term progress on the projects and remind me of what my overall goals are: to have a good life where the necessities are squared away and I’m making things.

Daily and Monthly

The daily card as a checklist has things that need to get done every day. Every day I copy it from “Big Picture” to a new card named for the day. It sounds simple enough, but it’s where most of my thought has gone into this system. Working through this checklist sets the rhythm for my day. By following it I don’t have to think and worry about a lot of little distractions; it is the skeleton that I flesh out with creative work.

It starts with “Make the bed”, continues through a few daily chores, includes “Inbox Zero” and “Browser Zero”, and ends with “Floss and brush”. I have only rarely missed those first and last tasks, but they’re still important to have on the checklist. It’s caught me from forgetting the obvious, and I like starting and ending my day with the certainty that I’m doing what I need to. If you’re at all in doubt for the value of checklists, read the excellent book The Checklist Manifesto. Also, it helps to see that daily progress bar fill up a bit right from the start, it gives momentum for attacking the things I don’t enjoy.

“Inbox Zero” means I need to clean out my email and other inboxes rather than use it for reminders and a to-do list. It’s good at getting me information, I need to get that squared away. “Browser Zero” means I want to see all the tabs closed on my browser rather than have a dozen things I plan to read later. The browser is not a reading queue. I get to check these boxes if I reach both of these at some point during the day, though I don’t tend to spend a lot of time there.

My daily checklist is now 13 items, and I only change it deliberately and infrequently. In my first attempt I started with two dozen things that needed to be done or checked. I think I made it two days before I abandoned that attempt for months, disheartened. This time I started really small: it had “Make the bed” and “Floss and brush”, both things I knew I could do. After I made the habit of checking things off and felt the glow of a completed to-do list, I added a third thing. As a rule of thumb, I’ll only add something if I’ve completed the list for as many days as there are items on the list. If I want to add a 14th item (and I do), I won’t add it until I’ve gotten the current 13 things done for 13 days straight. It seems like a silly rule, to not add things I need to do to my to-do list, but it’s more important to reinforce that I really can do everything I set out to do and grow at a sustainable pace.

The monthly checklist is a lot less interesting, it’s a list of chores. Check on backups, refill soap dispensers, get a haircut in even months, etc. I also make a new Trello list for archiving cards I’ve finished this month; I’ll explain that below.

Later and This Week

Each card is one decent-sized thing to do. More than a few minutes (or I’d just do it immediately) and less than a half day (or I get discouraged at how long it lingers). The “Later” list holds a month or so of cards, and daydreamy “wouldn’t it be cool if” ideas go off into a text file so I don’t stare at them and get discouraged at how infrequently they get done. Most cards start on “Later”, though time-sensitive ones will start in the “This Week” or “Today” lists if needed.

Cards always have to be flowing towards the “Done” list. I had one card titled “Migrate NearbyGamers to current Rails” that lingered for months because I couldn’t predict smaller tasks. I had to fix one bug and see what new bug shook out and repeat that endlessly. I should have made a task for each bug as I went so I was building up a record of my progress. With one card that never moved, even as I coded and coded I felt like I was standing still.

I fill the “This Week” list on Monday mornings. I haven’t yet gotten a feel for estimating the size of things so that I am consistently emptying it each week, though I’d like to. The cards tagged with the green personal label are the best ones. I use them to “unschedule” as suggested by The Now Habit. I start my planning with the things like friends, family, good books, and downtime that make life worth living to make sure that I have enough of that in every week. Then there’s essential life maintenance, starting with things that are due this week. As suggested by Now Habit, I stop to think about the consequences of not doing them. Some of them, like updating old passwords, can be safely ignored a while as long as I accept the security risk. An unpaid bill might bring a penalty fee, phone call, or black mark on my credit report — but is not the end of the world. Ignoring deadlines happens only rarely, but knowing these tasks are choices I can make rather than Official Stuff I Have To Do is liberating.

Today

The most important thing about the “Today” list is that it not have too many things. It needs to be empty at the end of the day, not have a bunch of sad cards that I have to drag back to “This Week” to replan. Sometimes this means breaking down cards I know I won’t finish in a day into several cards, but mostly it means assigning less than I know I can do in a day. There will always be surprises and distractions in every day. It’s discouraging to not finish and heartening to be happily working away and pull in an extra card because everything else is done.

I’ve tried task-tracking systems with significant estimating and scheduling components — the book Time Management for System Administrators is an excellent one that I had a lot of success with at a previous job. I currently don’t get a lot of value out of that, so I dropped it. I try to arrange the tasks so I have large blocks of creative time with breaks for food and little chores.

Monthly and Waiting On

As I finish tasks I drag the cards to the “Done <year>-<month>” list rather than delete or archive them. I like seeing my accomplishments and I can glance at the colors to see if I haven’t been spending enough time on myself or a project. Every month I archive this list and create one for the new month. I like the regular opportunity to reflect on what I’ve done and what I’m going to do next.

Last, I have a “Waiting On” list for things I’m waiting to hear back about. It was too frustrating to have cards lingering from day to day or week to week because I was waiting for paperwork in the mail. I glance at it daily and drag it back to “Today” to prod people as needed. If you let the FBI read your email without a warrant I’ve heard nice things about Boomerang, but I haven’t needed an automated system.

Workflow

When I sat down to write about how I drag little digital cards from left to right I didn’t think I’d spend 1,800 words on it, but I’ve put a lot of thought into a system for doing a lot of work. I wrote not because I think this system is perfect or for anyone else. I wrote because I was stuck and frustrated for a long time and I was helped by seeing what other people did (well, when that wasn’t a form of procrastination). I hope it gives you ideas on how to improve your own processes.

Posted in Life at 2013-06-18 19:42 | 4 comments | Tags: , , ,

The Well-Sorted Version

The Well-Sorted Version is a longtime project I recently finished. I wanted to blog a bit about the technical production of it, so please check out that link if you want this discussion to make any sense.

a page from Well Sorted Version

I’m glad I didn’t start this project ten or twenty years ago or it would’ve been an order of magnitude harder to prepare camera-ready copy for printing. Instead, I only had to send a PDF and a check to receive my 26 bound volumes. Let me work backwards from that and down into the lowest levels of what it took to produce the WSV.

I found Grimm Bindery after an exhaustive search of several hundred print-on-demand printers. Most had websites that made it clear they weren’t able to print with the quality and materials I needed, but I had to email and call dozens to find the handful that met my needs. I picked Grimm because they were the best-organized and had no trouble answering my many questions. If I were printing a run that was larger or using more common options, though, I think the only way to pick a printer would be to winnow down to finalists and have each print a single copy of your work so that you can judge based on actual output. POD is cheap enough that this is entirely reasonable.

Before I had a PDF to send them I had to write the code to generate it. I believe almost all books are produced with Adobe InDesign nowadays, but I was already familiar with the free LaTex typesetting system and it was easy to integrate into the alphabetizing code I was writing, so I never seriously considered it.

Typesetting the book was a long, interesting challenge. I had eventually had a 32 item task list for things to address, from laying out each type of text (book heading, chapter heading, verse and chapter numbers, main text) to laying out the cover to checking the letterspacing on every combination of lower and upper-case letters (104 in all). Accomplishing this meant finding TeX packages to achieve the effects I wanted (like Lettrine for the inset chapter numbers). As an aside, TeX produces significantly better output if you use semantic markup like \par instead of \hspace{0 pt}\newline.

TeX shows its age with some annoying misfeatures (still there for backwards compatability), though, like it assumes things are installed system-wide instead of bundled with your project, and it treats relative imports as relative to the directory you ran the tex command from rather than relative to file doing the include. In the end I had a shell script to work around these and generate the pdf:

 
#!/bin/bash
 
set -e
 
cp input/wov.tex output/wov.tex
cp garamond/*ggm* .
TEXINPUTS=.:input//:/usr/share/texlive// /usr/bin/pdflatex -jobname kjv wov.tex
TEXINPUTS=.:output//:/usr/share/texlive// /usr/bin/pdflatex wov.tex
/usr/bin/pdflatex two-up.tex
#/usr/bin/pdflatex sample.tex
 
rm -f *.aux *.log *ggm*

That sample.tex you see commented out there was a simple test file I used for experimenting with packages or small snippets of markup. I also organized the bible by breaking out each book into its own .tex file that would be included by a master project file that included all the styling. When I wanted to test layout or book-specific issues I could comment out all the other includes and render a pdf with just one book (~3s) instead of the entire bible (~70s). Small, rapid iteration was vital as I pushed text around by 72nds of an inch.

I had to write a program to alphabetize TeX files with the bible in them. Here’s an example:

\begin{BibleBook}
\BBook{The First Book of Moses, called Genesis}\BFont
\BChap{1}\BVerseOne{}In the beginning God created the heaven and the 
earth. \BVerse{2}And the earth was without form, and void; and darkness was
upon the face of the deep. And the Spirit of God moved upon the face 
of the waters.

The text is clearly there to alphabetize, but the alphabetization code had to be smart. It couldn’t alphabetize the text of commands (\BBook) but it DID have to alphabetize the arguments to some of those commands (\BBook but not \begin). I was tempted to do awful things with regular expressions (it’s a weakness) but instead I wrote my first real parser using Parslet.

I slowly built up the parser from individual rules to recognize commands and text. I ignored a lot of the complexity of TeX (like optional arguments and all math) because I was only interested in parsing this one set of files that didn’t use those things. A lot of this work was a dance between improving the parser and tweaking the source markup to make it easier to parse.

 
class Tex < Parslet::Parser
  rule(:backslash) { str '\' }
  rule(:command) { (backslash >> match('[a-zA-Z]').repeat(1)).as(:command) >> option?.as(:options) }
  rule(:command?) { command.maybe }
 
  rule(:option)   { str('{') >> (command | match('[a-zA-Z0-9\[\]. ,\r\n]').repeat()).as(:option) >> str('}') }
  rule(:option?)  { option.repeat(0) }
 
  rule(:comment) { str('%') >> any.repeat.as(:comment) }
 
  rule(:texspace) { str('\/').as(:texspace) }
 
  rule(:text) { (backslash.absent? >> any).repeat(1).as(:text) }
 
  rule(:line) { comment | (command | texspace | text).repeat }
  root(:line)
end

Then there’s a bunch of glue code to schlep in the data from the files and out to sorted versions. This was originally a big imperative mess, but I used some of the ideas in Gary Bernhardt’s excellent Functional Core, Imperative Shell to separate out the logic of pulling out and replacing letters from reading and writing files.

I’m not going to paste all that code, but it looks pretty much like you’d expect. It pulls in all the files, parses them, extracts and sorts all the letters, and then pours them back into files. There’s also a couple hundred lines of specs.

I’ve wanted to type the words “the hard part was” several times while writing this, but really there were several large, hard parts: parsing TeX, typesetting, find a reliable printer, and having the endurance to keep moving forward one small step at a time over 14 months. This blog has been pretty quiet lately, in part because I’ve been busy with the WSV and in part because I’ve been discouraged by feeling like I can’t finish things. Now I’ve finished a big thing and I’m feeling really good about it, so fingers crossed for more blog posts. :)

You sit in a room for years making up a story & drawing little pictures & then someone asks “what was the hardest part?” … It was the years Bryan Lee O’Malley
The rather thick WSV
Posted in Code at 2013-05-31 04:15 | 5 comments | Tags: , , , ,

Inheriting From Built-In Ruby Types

This post originally appeared on the DevMynd blog.

The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information.
– Alan Perlis

I was teaching a class on refactoring and wanted a real-world example to demonstrate finding a class to extract. My students were Rails developers, so I immediately knew I wanted to show an example of an ActiveRecord model.

After skinny controllers and fat models became a Rails best practice ActiveRecord models have tended to grow without bounds. It’s easy to push code down into them, but developers seem to have some kind of mental block causing them to think they can’t create classes that aren’t more ActiveRecord models.

Though I’ve seen this often enough I couldn’t use any client code for my example. I scratched my head a moment and thought of Discourse, a new forum project. I went directly to the User model because it’s so commonly a god object. (Just to be clear, I’m not trying to pick on Discourse. This is a fairly minor improvement to address a very common problem.)

I scrolled down past the usual mass of associations, validations, and hooks to the method definitions. Here’s the first eight:

 
  def self.username_length
  def self.suggest_username(name)
  def self.create_for_email(email, options=())
  def self.username_available?(username)
  def self.username_valid?(username)
  def enqueue_welcome_message(message_type)
  def self.suggest_name(email)
  def change_username(new_username)

From this quick glance there’s a Username class waiting to be extracted. It’s in the name of several methods, it’s the sole argument to many methods, and most of the methods are class methods, implying they don’t maintain any state (class-level variables being rare in Ruby).

It’s discouraging to think about extracting a Username. It’s stored as a varchar in the database and ActiveRecord will choke on validations if it can’t treat it as a string.

The solution is straightforward: Username should inherit from String. Username is-a String, and keeping it in the built-in type is why this code sprawls.

I extracted the below class, changing as little as possible, mostly just removing ‘username’ from the start of method names and using self where appropriate. I left behind the unrelated create_for_email, enqueue_welcome_message, suggest_name, and the temptingchange_username, which was about editing aUser.

 
  class Username < String
    def username_length
      3..15
    end
 
    def suggest
      name = self.dup
      if name =~ /([^@]+)@([^\.]+)/
        name = Regexp.last_match[1]
 
        # Special case, if it's me @ something, take the something.
        name = Regexp.last_match[2] if name == 'me'
      end
 
      name.gsub!(/^[^A-Za-z0-9]+/, "")
      name.gsub!(/[^A-Za-z0-9_]+$/, "")
      name.gsub!(/[^A-Za-z0-9_]+/, "_")
 
      # Pad the length with 1s
      missing_chars = username_length.begin - name.length
      name << ('1' * missing_chars) if missing_chars > 0
 
      # Trim extra length
      name = name[0..username_length.end-1]
 
      i = 1
      attempt = name
      while !attempt.available?
        suffix = i.to_s
        max_length = username_length.end - 1 - suffix.length
        attempt = "#{name[0..max_length]}#{suffix}"
        i+=1
      end
      attempt
    end
 
    def available?
      !User.where(username_lower: lower).exists?
    end
 
    # export a business name for this operation
    alias :lower :downcase
  end

And it’s used as:

 
  u = Username.new 'eric_blair'
  u.available?
  new = u.suggest
  User.new username: Username.new('Samuel Clemens').suggest

The User class got a lot simpler now that it doesn’t know all the business rules about usernames. I left the validation in User because it’s the thing being persisted to the database, though if it wasn’t for the Active Record pattern I’d want to move that over as well.

Now Username is a simple immutable value object that’s easier to reason about and test. It adheres nicely to the SOLID principles, and it’s an uncommon nice example of inheritance working well in Ruby.

One caveat is that User objects Rails instantiantes will return Strings on calls to User#username. We’ll need to write a getter to instantiate and return a Username object, though Rails before 3.2 included composed_of for this:

 
  class User < ActiveRecord::Base
    def username
      Username.new(read_attribute(:username))
    end
 
    # or, pre Rails 3.2:
    composed_of :username, converter: :new
  end

(And to the curious: no, I haven’t contributed a patch back to Discourse. They have a Contributor License Agreement to backdoor their ostensible open source licensing.)

Posted in Code at 2013-04-12 15:02 | No comments | Tags: , , ,

Social Media for Programmers

A friend of mine has been accepted to DevBootcamp and I’m pleased to see the coursework is beginning even before classes. He’s not so pleased — really, he’s puzzled by the emphasis on social media:

So far the hardest part of this entire Bootcamp experience is looking to be the social aspect of it.

The endless chatter on facebook that I have to pay attention to in case someone leaves a comment of actual importance. I had to create a twitter account for the application and now that is being reinforced with Socrates but I’d be surprised if I’ve checked the twitter account more than one in the last 3 months. They also want me to use Tumblr, LinkedIn, and Quora. I feel like I’ll have to spend 2 hours a day looking through the social media instead of actually working. [...]

Is this just a necessity of the field? Do you actually use all of these?

Wow, that’s tossing a lot of social media at you. I guess they want to be really sure that you guys talk and form a community. :) I think you’re experiencing a clash between the stereotype of a programmer as an extreme introvert who hides away in a dark office and the reality of programmers as a friendly, collaborative, argumentative community..

Yes, I use a lot of these tools, probably an hour or so a day of reading and chatting. It is not surprising that you might spend two hours if you have to look everything up along the way. :) You’ll hear about new techniques and tools, get help finding bugs, learn how other people have approached problems, see industry trends, and more. It is an investment in your future.

Actually, a lot of programming is keeping up with trends. I suspect web development moves faster than most programming sub-topics because we’re all online and we can build our own communication tools.

If you go work in a cave and stop paying attention to the outside world you’ve got about four years before you’re not familiar with current tools and will have trouble finding a job. The foundational skills of programming all stay the same (and those *will* be improved by reading the articles you find on social media), but tools turn over fairly rapidly. I didn’t think I’d bring it up until you were a couple years into your career, but the book The Passionate Programmer has excellent advice on how to build and maintain a good career.

You need to pick one consistent username for all these sites. Not only is it easier for you to keep track of, it helps people learn about you and hire you. Mine used to be ‘harkins’ or ‘malaprop’ but neither was unique, so I’ve moved to ‘pushcx’ to echo my blog. Opaque handles like this are really uncommon, most people use a variation of or wordplay on their name.

So let me go through the sites you mention and I use one-by-one:

Tumblr: I only browse this; there’s not much programming content. But if I ever want a cute animated gif or porny art, sure, I suppose. This is the only one in their list that seems odd to me.

Quora: A Q+A site that’s popular for biz/marketing topics. They were a social media darling a year or two back and are a decent place for finding info, but not a lot of code talk. I expect them to flame out before 2015.

LinkedIn: Facebook for business. A great way to find jobs and help friends find jobs, a must-have. I am probably shortchanging my prospects, even though I’m happy with my job and network, by not being on the site.

Gravatar: An avatar that can appear on every site you use the same email address on, generally only supported by small sites. Includes a broken movie-style censorship system. Incredibly annoying if you have more than one email address.

Twitter: Very popular among programmers. You should follow me, @pushcx there. Lots of links and community-building stuff happens here, this one I use actively. Though it’s really hard to separate signal from noise; some of the best programmers post also most the most random annoying bullshit.

App.net: Exists because Twitter is increasingly terrible assholes to people who build Twitter clients and services. If Twitter goes full-evil everyone will move here, but it’s safe to ignore it completely until then.

StackOverflow: Programming Q+A, an absolute must. You will find their pages showing up in search results for all sorts of programming topics, and you can ask questions here. This is an amazing resource.

Hacker News: Code + startup news. The comments used to be a goldmine, but in the last year or two they’ve sunk into general bickering. Worth keeping an eye on (nicer interface: Hckrnews) for industry events and the occasional tech post.

Lobsters: Tech-only Hacker News; sadly only has a tiny fraction of the traffic. Recently interviewed me.

Reddit: Don’t visit yet, it’s an incredible time-sink of humor and brain candy. There are a couple small subforums (“subreddits”) worth following, but that’ll keep until you’re done with DevBootcamp.

GitHub: Required. A really nice set of tools on top of git, a tool you haven’t seen yet for tracking work as you go and collaborating. Nearly all web developers are on the site. After a personal blog, this is where you can best build a reputation for yourself.

Because I am a voracious reader (and because I procrastinate by finding interesting things to read), if I read a good blog post I tend to subscribe for a while to read more posts, and I always have a technical book or two around. So in addition to interacting on the above social sites, I’ll spend at least an hour or two a day reading more about programming. I’m pretty sure it’s uncommon to read this much, but I really enjoy programming and care about learning everything I can.

Posted in Code at 2013-03-14 08:25 | No comments | Tags: , , , , , , ,

Backpack Criteria

I spent all of 2011 traveling through interesting cities in the U.S. and Canada with a small backpack. I put a lot of thought into what I needed, so I wanted to share that in case anyone else finds it useful.

  • Simple dark color scheme, no logos to attract attention. When you are carrying everything you own, you do not want thieves to notice you.
  • Right size: 28L is mine, I could see a few L up or down
  • Waterproof material (preferred) or builtin waterproof cover. Waterproof means I could wear in a nor’easter and everything inside will be bone dry, not “water-resistant”, which means “made of tissue paper”.
  • Chest strap that connects the shoulder straps to better carry weight.
  • External frame, maybe mesh back, to lift the pack off my sweaty back
  • One main compartment, one quick-access compartment on the front. Maybe a mesh side pocket for a water bottle, maybe an internal pocket for small goods (use mesh so I can see inside), but a big flexible space and packing crates are the best way I’ve found to organize. It’s painful to have to search for things and carry extra locks.
  • Front-loading is superior to top-loading. That is, rather than a zip along the top of the backpack, it should be possible to unzip the entire front face.
  • cloth/rubber/plastic tabs on the zipper pulls so that I do not jingle like Santa’s reindeer
  • Some kind of compression straps. Compression/adjustment straps should not dangle like anime tentacles.
  • Laptop pocket in the main compartment is a nice-to-have; my bag has a Camelbak space that works.
  • Pockets with a single zipper should have some kind of small loop on the side the zipper closes on so that every zipper can be locked (I didn’t get this and regret it).
  • Solid materials and workmanship, don’t want to have to replace on the road. Cost should be $100 – $200 USD; any less and you skimped.
  • Equipment straps so I can tie/rubberband down gear when I’m traveling heavy or drying laundry (I have no shame).
  • No zip-off day bag – if a pack includes a day bag, it’s too big to carry every day.
  • Rides comfortably – I am taller and skinnier than average, this was a hard find.
  • Fits in the (roughly) 14 in x 9 in x 22 box standing at the airport gate so I can prove it’s a carry-on and avoid checking luggage.
  • Seriously, no goddamn logos. It’s a backpack, not a billboard.
Posted in Life at 2013-03-12 01:02 | 2 comments | Tags: , ,
  • Twitter: pushcx

  • More tweets below and @pushcx