2011-05-05

Alarums and doldrums

This morning I woke up at half past seven with a vague feeling of unease.

I reached for my phone, checked my e-mail, and sure enough: I had gotten an alert during the night from PageKite's Yamon monitors, telling me that our website was down. I had slept right through it, but apparently my subconcious felt guilty about not waking me up to check what the e-mail was when it arrived.

Rightly so: bad subconcious, bad!

So I got up, blearily stumbled in my birthday suit to my laptop and restarted the system that had crashed.

Now I am pondering how to:

  1. Keep it from happening again
  2. Figure out why it happened

You might think that I got the order wrong and 1) cannot be achieved without 2), but you'd be so wrong. The ways of the sysadmin are subtle and mysterious. In situations like this, it is common practice to add a watchdog to the system which makes sure that whatever crashed gets automatically restarted next time. This basic automation would have reduced last night's event from hours of unavailability to mere seconds.

Unfortunately for the sysadmin, sweeping the problem under the rug like that leads to a problem: if it never happens again, we'll never know why it happened in the first place, and if we never know why it happened, then we can't be sure that we really fixed it and can't be sure it won't happen again...

More to the point, if the problem is happening now, while we have relatively few users, it is reasonable to expect it to get progressively worse as our user base grows, until it reaches the point where a watchdog can't keep up or those seconds of unavailability accumulate to the point of still being a serious problem.

This leads to a more complex sequence:

  1. Keep everybody except me from noticing it happened
  2. Figure out why it happened
  3. Keep it from happening again

And that, my friends, is Internet systems administration in a nutshell.

Updated later ...

One of the keys to solving this, was the daemon utility:

$ sudo apt-get install daemon

That wraps pagekite.py on the service front-ends, so if they crash they get restarted again right away.

The half of the solution was making pagekite.py report its start-up time in its Yamon variables, and making Yamon alert if the start-up time is too recent. That guarantees the service gets restarted ASAP, but I still get notified. I can then browse the program logs and crash reports to try and figure out what the problem was.

Not rocket science. :-)

Tags: tech, pagekite


Recent posts

...