LessWrong downtime 2012-03-26, and site speed

post by matt · 2012-04-03T04:15:09.856Z · LW · GW · Legacy · 7 comments

Contents

  Executive summary:
  Actions:
  Further actions - site speed:
None
7 comments

Our investigation into last week's LW downtime is complete: here (Google Docs).

Executive summary:

We failed to update our AWS configuration after changes at Amazon, which caused a cycle of servers being spawned then killed before they could properly boot. Our automated testing should have notified us of this failure immediately, but included a predictable failure mode (identified by us last year but not fixed). We became aware of the downtime when I checked my email and worked on it until it was resolved.

I personally feel very bad about our multiple failures leading to this incident.

ref. the last time I did this to you: http://lesswrong.com/lw/29v/lesswrong_downtime_20100511_and_other_recent/

Actions:

  1. We have reconfigured AWS and the tools we use to communicate with it to avoid this failure in the future.
  2. Improvements to our automated site testing system (Nagios) are underway (expected to be live before 2012-04-13 - these tests will detect greater-than-X-failures-from-Y-trials, rather than the current detect zero-successes-from-Z-trials).
  3. We have changed our staffing in part in recognition that some systems (including this one) had been allowed to fall out of date, and allocated a developer to review our system administration project planning.

 

Further actions - site speed:

We're unhappy with the site's speed. We plan on spending some time next week doing what we can to improve it.

 

(If you upvote this post, please downvote my "Karma sink" comment below - I would prefer not to earn karma from an event like this.)

7 comments

Comments sorted by top scores.

comment by David_Gerard · 2012-04-03T09:51:53.645Z · LW(p) · GW(p)

I upvoted to encourage transparent reporting. (But downvoted the sink per your wishes.) Every sysadmin knows this stuff happens, and describing a disaster in detail is a small useful thing to humanity :-)

comment by kilobug · 2012-04-03T07:41:58.587Z · LW(p) · GW(p)

Good luck to you, and thanks for your efforts at running the site.

And I do think you deserve karma for an honest explanation on what went wrong and what you'll do to fix it, but I'll respect your wish.

comment by John_Maxwell (John_Maxwell_IV) · 2012-04-03T23:58:17.204Z · LW(p) · GW(p)

We should have an annual "thank you Trike" day so we can shower them with appreciation when things silently keep going right.

http://sysadminday.com/

Looks like it's July 27 this year. I'll try to remember to send Matt a personal message telling him to create a discussion post and collect his karma.

comment by keefe · 2012-04-06T19:26:52.234Z · LW(p) · GW(p)

I'm pretty familiar with the codebase though I transitioned to ebay before getting too much done on it, send me an email if you want some feedback, I have more free time these days looking to contribute to open source for long term strategic reasons

comment by MixedNuts · 2012-04-04T17:18:53.137Z · LW(p) · GW(p)

Upvoted both the post and the karma sink just to spite you. I like you.

Replies from: wedrifid
comment by wedrifid · 2012-04-05T02:13:09.386Z · LW(p) · GW(p)

Upvoted both the post and the karma sink just to spite you. I like you.

Downvoted both the post and the karma sink so as to adhere to Matt's wishes. I respect him. ;)

comment by matt · 2012-04-03T04:07:13.739Z · LW(p) · GW(p)

Karma sink.