BYBLACK DUCK

Unexpected Outage

Our website suffered unplanned downtime from 1am to 10am this morning (PDT). The web growth we’ve seen lately came back to haunt us: our servers drew too much power and caused a circuit breaker to blow. We’ve reconfigured our power circuits to address this appropriately.

Hardware failure aside, the real embarrassment for Ohloh is how long it took for us to respond – it was unacceptable. While our site is currently monitored, none of us caught the alerts at 1am. The SMS/emails didn’t wake us up (as they have in the past). We clearly need a better system.

My first step is to find a better web monitoring service – hopefully with one that has some type of escalation procedure. Ideally I’d like a service that starts by sending email/SMS but then escalates to calling alternative phone numbers until someone responds. I’d welcome any suggestions below (or at jason@ohloh.net if you prefer).

Our sincerest apologies. The beer’s on us next time we meet.

-jay

  • VxJasonxV (Jason)

    I’m a big fan of NAGIOS. Free and Open Source to boot. (And if you want to pay, you can! Professional support contracts have been available for this and last year IIRC.)

    Rolling your own monitoring system for internal and external uses is so helpful. Not to mention you don’t have to worry about giving some external services internal insight into your network.

    Seriously. NAGIOS. Do it.

    [edit]
    Yes, I am aware this post is almost 3 months old.
    However:
    1. No one has responded to it.
    2. There was an implicit call for help in the body of the message.

  • maxhex (MAXHEX KSA MY CODE ;-) AND RESOURS)

    conficker by maxhex

  • maxhex (MAXHEX KSA MY CODE ;-) AND RESOURS)

    conficker by maxhex

  • http://blogs.oracle.com/vlad/ vladak (Vladimir Kotal)

    Time to buy new servers which draw less power ? (like those with hardware threads)