Have You Seen This Cache? «

Code: , , , , ,

It looks like syntax highlighting, image thumbnails, and compiling object files. Let me explain.

$ time vi -i NONE -u NONE app/models/god_object.rb -c ":quit"
real    0m0.020s
user    0m0.010s
sys     0m0.007s

The client’s GodObject is 2,253 lines long and Vim takes .020 seconds to load it.

$ time vi -i NONE -u NONE --cmd "syn on" app/models/god_object.rb -c ":quit"
real    0m0.079s
user    0m0.070s
sys     0m0.007s

Syntax highlighting adds .059 seconds. A twentieth of a second is barely noticeable to humans. At twice the speed of the fastest blink it feels like the the smallest possible pause.

That was enough time to plant the seed of this idea.

A function is “referentially transparent” when it depends only on its arguments and, if it’s run again, any later call with the same arguments could be replaced by the value returned by the first call.

Common referentially transparent functions do things like perform arithmatic, split a string into an array, or parse the bytes of a file into a data structure representing how to color Ruby source code.

That last one is exactly the situation Vim is in: there’s some uncertainty to reading a file off disk, maybe it’s there one run and not the next, but somewhere downstream there’s a function that takes the contents of the file as its argument and returns a data structure annotating where every token starts and ends so that the frontend can highlight them in the proper colors. Any time this function is given the same bytes it generates the same data structure.

It doesn’t care what day of the week it is, how many rows are in my postgres tables, what a random number generator invents, or anything else. Stable input equals stable output.

This is very similar to a key -> value dictionary. The key is the arguments to the function. The value is whatever the function returns for those keys. Looking up the answer is the same as calculating it and, indeed, many dictionaries can be used as caches this way. For an arithmetic example in Ruby:

square_of = Hash.new do |hash, key|
  hash[key] = key * key
square_of[3] # => 9

When you call square_of[19] you might be running a function, you might be retrieving a cached value. It doesn’t matter unless you have a practical reason to care about the details of CPU and memory usage. This isn’t useful for a simple operation like squaring numbers, but when there’s thousands of slow steps it’s quite valuable.

Every time I open god_object.rb in vim it reparses the Ruby to figure out how to highlight it. Even if the data hasn’t changed, the function runs again. It’s referentially transparent, it’s slow enough to be noticeable, so why not cache it?

Well, maintaining this kind of cache (a “read-through cache”) has a lot of busywork. Aside from the reading and writing to some data structure, there has to be an eviction policy to determine when to throw away data that’s unlikely to be requested or to free up room for new data. People get grumpy when their text editor or web browser swells to eat two gigabytes of RAM, and they don’t connect this to usage being 10 or 50% faster as the program avoids repeating work.

Additionally, Vim would really like that cache to persist across program runs. Why re-parse a file that hasn’t changed because someone quit Vim for a few minutes?

This prompts a whole new round of busywork managing disk quota and, as large as hard drives are getting, you’d have increased hassles because a program wouldn’t be able to free up space until it happened to run again.

I was kicking this around in my head, and I realized I’d seen it done before.

When I browse my folders and see thumbnails for images, they’re stored in ~/.cache/thumbnails so that when I re-open the folder they appear instantly instead of taking a half-second per file.

When I build a C or C++ project, the compiler outputs a bunch of object (.o) files, one per input source code. If I build the project a second time, only the source files that have changed are rebuilt (though this is based on the timestamp on the source code rather than its contents – with a whole host of predictable bugs ensuing).

In fact, Python is quite similar to Ruby and generates .pyc files to cache its compilation of source code.

Which reminds me, every time I start rails server to load up my development server for this client, Ruby has to re-parse source code like Vim. (That’s not to say they should share a cache, they build different data structures and don’t want to have to synchronize releases, but it’s the same problem again.) Wait, how many files is that each time?

$ bundle clean --force
$ find app lib -name "*\.rb" | wc -l
$ find $GEM_HOME/gems -wholename "*/lib/*\.rb" | wc -l

Oh, it’s 6,997 files. That’s going to take a little while. And Ruby’s going to do it all from scratch every time it starts, even though the parsing is a referentially transparent, temptingly cacheable function.

Over in the web world, there’s a really nice cache system available called memcached that’s often used in a read-through cache. Memcached is a key -&gt value store. Memcached will evict data from the cache when it needs room, generally on a “Least Recently Used” (LRU) basis as old data is least likely to be asked for again. The usual memcached use looks like this with the dalli gem:

def action
  key = request.url
  page = Rails.cache.fetch key do
    # page wasn't found, so generate it
    # whatever the block returns is cached under the key,
    # and is returned for the `page` variable
  render html: page, layout: false

Let me generalize that a little:

def read_or_generate *args, &blk
  key = md5sum(*args.map(:&to_s).join)
  Rails.cache.fetch key, &blk
def action
  page = read_or_generate request.url do
    # generate and return page, may not be called
  render html: page, layout: false

Squint a little and this is our pattern again: read_or_generate takes arguments and generates or retrieves the value; we don’t care which happens. (And squint a lot more for the fact that the block is unlikely to be referentially transparent; it probably queries a database but that input is stable until the cache is deliberately cleared, or “stable enough” until it expires.)

I’d like to see a filesystem-level cache like this for Vim, for Ruby, for Python, for C, for every random program that has a referentially transparent function that might as well be a cached value. It’s enough functionality that an individual program doesn’t want to take on the problem, it wants to call a cache system. (The programs that do so usually dump to files like the image thumbnails and object files, ignoring expiration: browsing my 556M thumbnail folder shows tons of images I deleted months ago; `find ~ -name “*\.o” | wc -l` turns up 1,020 object files littered through my home directory.)

The computer would run a daemon like memcached that saved keys to disks, managed expiration, and kept the buffer to a particular size. Vim doesn’t have to take on the whole problem and I don’t have to run out of disk space because a program cached two gigs of data when I last ran it a year ago.

I went looking for this software and couldn’t find it. I’d love to set aside a gig or two of disk space to faster operations and having my directories free of .o and .pyc clutter. There’d have to be some locking (like holding file handles) so that when, say, gcc finishes compiling 30 files, it doesn’t go to link them into a binary only to find that half of them have been evicted from the cache because I was downloading podcasts at the same time.

Does this system sound useful to you?

Before you answer, I thought of something clever for a second version.

Back when Vim read god_object.rb off the system, the kernel did quite a bit of clever caching to speed up reads. The short version is that the kernel caches recently file reads and writes in RAM. Rather than allocate some amount of RAM for this, the kernel uses all the free RAM that programs haven’t asked for. When a program requests more RAM, the kernel shrinks the file cache and gives RAM to the program. There’s as much room for the cache as possible, and when there’s no room free everything continues to work (but slower).

This cache system I’m considering gets a nice benefit from this feature: if Vim caches the couple kilobytes of parsed Ruby code, it’ll probably be accessed via very fast RAM instead of even having to have the disk. The kernel has lots of very clever and reliable code for doing this responsibly, it’s a wheel that shouldn’t be reinvented.

But the clever thing is that if this cache system were in the kernel, it could use all free disk space as a cache like the kernel file cache uses all free RAM. There’d be no fixed-sized allocation to weigh convenience against resources.

This seems like a nice big win to me. Enough of one that I’m puzzled that I haven’t seen anything like it. Maybe I’m not searching well, maybe I haven’t explored enough unix esoterica. Would anyone be able to point me to something like this?

Or be able to build it with me?


  1. The disk cache is working on a block device level data. Did you know the Linux kernel uses an entirely different cache in its memory allocator? Why don’t they just use the same code? Probably because the problems require rather different approaches…

    An wrinkle is that the caches you are referring (AST, resized images) are caches for computation as opposed to straight up data transfer. A good (bad?) cache for calculating fibonacci numbers can be different to a cache for rows that match a given query in a database and may be different again to a good cache for the result of a convolution on an image. The way you might wish to represent the key and the data can vary wildly, as would things like the eviction policy and the choice to not cache things let alone working the point at which you have to insert the cache usage into the code.

Leave a Reply

Your email address will not be published.