Craftsmanship Tour: New York Times

Published: ← 2011-06-22 →
Category: ← Code →
Tags: New York Times ← craftsmanship tour idempotency ← journalism

In May, while visiting New York City, I dropped by the New York Times to code with Derek Willis and, impromptu, Dan Berko. I worked with both at the Washington Post (and saw many other familiar names on doors, online journalism is a small town).

Derek’s got a great career arc. He climbed up the ranks of journalism, covered Congress, and got involved in data-heavy projects. “Computer assisted reporting” is one of those terms that nobody quite loves but nobody’s successfully replaced (though it seems “database journalism” is gaining ground) and refers to collecting and analyzing data in databases. Derek got as interested in the “computers” as the “reporting” and has deliberately pushed his skills and career into software development. (*cough* sounds like a good topic for a quiet blog, eh, Derek? *cough). I look at his GitHub profile today and he’s been busily merging in contributions to his open source projects like his USA Today Census gem.

We started out the day looking around the FEC scraping code. In the States, the Federal Election Commission gives out lots of data from candidates filing required disclosure statements. We tidied up the database a little and then turned to the project for the day, which has just been publicly announced:

Today weâ€™re announcing the addition of paper campaign filings from Senate candidates and two party committees to our Campaign Finance API, which previously had only provided details of electronically filed reports. Now users can request and view the filings of any committee registered with the Federal Election Commission.

Unlike House and presidential candidates, current and would-be senators file their campaign reports first with the Secretary of the Senate, who then forwards them to the F.E.C. That agency then scans in the images from the paper filings and makes them available for viewing (an example). While an effort to require electronic filing for Senate candidates hasnâ€™t gotten much traction this year, we have at least made the APIâ€™s set of filings more complete. New in the Campaign Finance API: Paper Filings

The scraper was previously ignoring the paper-only reports, but we updated it to recognize and categorize them. The categorization was a huge bit of nostalgia for me: take the noisy and sometimes inconsistent provided categories and map them onto a standard set of database categories (form_type, in the screenshot in the announcement).

When we ran the scraper, it would complain and halt each time it reached an unknown category. We’d add that to a mapping table and restart from that point, but it was frustrating to have to keep an eye on it. So we set the scraper to ignore records that it didn’t have a mapping for and warn about the problem. We set the scraper to run (and hit the amazing Shake Shack for burgers) and came back to find a list of missing mappings. After adding that, we ran the scraper again to fill in the missing entries.

This worked because the scraper only added entries it didn’t already have recorded. The term for this is idempotency, and it’s useful from the level of individual functions up to large, fairly complex programs like web scrapers. Every program fails, having an idempotent approach to the problem means you don’t have to keep careful track of many types of failure because you can fix things and re-run your program without worrying about duplicate records or updating things twice.

Derek had to run off to catch a train, so I dropped in on Dan Berko. He was on one of the Post’s several other “web innovation” teams while I was there, so we helped each other with code occasionally but didn’t spend a lot of time coding together.

The New York Times has large and well-maintained internal tools for reporters and editors. The reporters have a CMS for writing stories and the editors have a budgeting system for planning what goes where in the paper. We improved communcation between these two a bit, so the budgeting tool could refer to a story in the CMS and pull metadata from there instead of requiring an editor to re-input it.

The UI was simple: if the editor links a story, several fields should be grayed out and a checkbox should indicate the link. If the editor unchecks the box, the link is broken and the fields become editable again. This started out with two code paths - one for linking a story, one for unlinking - and making sure on pageload that the UI was in the proper state. We’d barely started writing that when we saw it could be implemented even simpler:

` disable_if_linked_to_cms: function() { var checked = $(‘asset_cms_id’).checked; [‘asset_home_status_id’, ‘headline’, etc.].each(function(id) { $(id).disabled = checked; }); }, document.observe(‘dom:loaded’, function(){ Event.observe(‘asset_cms_id’, ‘click’, Budget.disable_if_linked_to_cms); Budget.disable_if_linked_to_cms(); }); `{lang=”javascript”}

When the checkbox is checked, all the form fields are disabled. When it’s unchecked, they’re not. The code runs on pageload and anytime the checkbox is toggled. I really liked this bit of code: we started out writing the simplest thing that came to mind, but soon we realized it could be reduced. The resulting code probably elicits a “So what, it’s not doing much?” reaction, which is far better than the previous “Now, let’s see, what’s this doing?” we would’ve had at first. The sign of the best code is that you immediately understand it, not that you have to stretch yourself to follow its solution.