Clean Up Your Mess «
»


Code: , , , , , ,
4 comments

Too many sysadmins is a bad thing, especially if one of them doesn’t care about keeping the servers up.

The development box at work wasn’t letting me check anything into subversion — commits were just sitting there, not even timing out. In fact, so were updates. Something was seriously wrong.

I talked about it with a coworker and went to look at the box. Afew a few minutes of poking around, the problem became clear: someone installed a backup program that was trying to do some kind of fake-filesystem and wedged the box. Any process that tried to read from disk froze and couldn’t even be kill -9‘d.

And thanks to this odd little behavior, I could see three reboot processes frozen, presumably trying to read the shutdown scripts. So the person that wedged the box knew they wedged it but they just left it that way.

I got the coworker in the office to pull the plug on the box and it came up OK, but I edited /etc/init.d/arkeia to spit out the following note instead of try to start the backup program:

Dear whoever the hell decided to install arkeia:

You left the dev box wedged overnight, wasting at least an hour of two coders time to figure out what you did and fix it. And we know that you know you broke it, we could see that you tried to reboot and then LEFT IT FOR SOMEONE ELSE TO DEAL WITH rather than actually fix it.

Don’t be a jerk! Clean up after yourself!

Please talk to Jim and Harkins and explain why you left the box broken before you try playing with arkeia and wedge the box again.


Comments

  1. Heck yea! Thankfully I’ve never been in a situation such as the above, mainly due to my age, but I’ve come close with some labs on campus.

    There was a time in one of my CS classes where we were in a UNIX lab. One guy figured out he could login (via ssh) to everyone’s computers, and pop up messages. Very annoying. My solution was to teach him how to make messages really fast (bash scripts). It seemed counter-intuitive, but he ended up locking up a computer, and the sysadmins were able to find out who did it (ther person with a coupla thousand processes running).

    That kinda relates to the link David posted above, kinda your story. Thought I’d share as I can’t really sleep.

  2. > Any process that tried to read from disk froze and couldn’t even be kill -9‘d.

    Yeah. That’s standard fare for processes that are blocked on disk IO. In fact, whenever a process refuses to respond to kill -9, it is almost certainly blocking on an unresponsive device.

    When this happens to me, it’s usually about the same time that I start wondering if we’re experiencing a hardware failure. ;)

Leave a Reply

Your email address will not be published.