Now Do You Know It Works?

You’re writing code to store a file on Amazon S3. It’s a popular, powerful, widely-used and highly-reliable service, and you know the Amazon S3 API pretty well. So you write a function that takes a file and a key name (filename), then calls the HTTP PUT to store the data. Do you know it works?

Well, you run test it once or twice with a file or two you have laying around and don’t get an exception and there’s the file in your S3 file browser. Now do you know it works?

There’s that nagging feeling of doubt that your code might be imperfect that you only get after you’ve been programming long enough to really screw something up. So you write some unit tests to make sure you’re calling the API correctly with neat mocking and all that jazz. You’re an experienced (read: paranoid) developer, so you do more than “happy path” testing, you make sure it raises errors when given keys that contain invalid characters or are too long. Now do you know it works?

Well, no, you haven’t seen it in production. So you test it there and you learn that every few thousand HTTP requests, S3 throws a temporary error and it works fine on retry. You change your code to notice this and try again a few times and only then raise an exception. You’re smart, you encapsulate this somehow so that all your S3 requests use it and you Don’t Repeat Yourself. Now do you know it works?

There’s no more of those random errors in production, but you don’t know it works. You know that you PUT a file to S3 without errors, but you don’t know that you succeeded at the original goal, which was to store a file. The HTTP PUT is the means, not the end.

So you turn around and do an HTTP HEAD against that key you just PUT to make sure the Content-Length matches the size of the file you just uploaded. Heck, if you’re really paranoid, you turn around and GET the file from S3 and hash it to make sure they both have the same SHA1. You’re ensuring that S3 serves up the file you thought you stored. (Maybe you even do this from EC2 to avoid the time and charge of downloading the file, but I’m not going to go into it.) Now do you know it works?

No. Here your troubles begin.

You have a race condition. If another process does a PUT between your PUT and your HEAD/GET, your function will wrongly think Amazon failed to store the file. You can check that the HEAD’s Last-Modified matches the Date on the PUT so you know the file was replaced since you PUT it there, but at that point all your code can do is PUT it again (and boy will it suck when two copies of this program are running and they both want to have the last word) or your code can throw its hands up and pretend things went fine. Now do you know it works?

No, you don’t. You can’t know. Computers promise to be deterministic and reliable, but anything with a little complexity turns into a computer system to sneak in side effects and temporal coupling and other beasties and that’s even before you have to deal with hardware failures or someone turning on the microwave and disrupting your wifi connection...

Finally, You Come to the Point

It’s a trade-off. Even in computers, there is no such thing as perfectly reliable code, there’s only how much resources you’re willing to pay for the next improvements. It’s easy to be oblivious to this. I can’t even count the number of times said some variation of “OK, now I know it works” before I learned that it will always break. Always.

Failure is inevitable. If you want to build anything large and to last, you have to incorporate it into your design and build organic systems that monitor, recover, repair, and, above all, fail gracefully when they themselves inevitably fail.

P.S to folks coding in Ruby: don’t use AWS::S3, it only checks for valid keys on HTTP PUT and doesn’t retry transient errors. RightAWS is an improvement.