Post-mortem for September 13th 2012
September 14, 2012 | Justin Dorfman
Yesterday there was an unfortunate, more like an unacceptable 2 hour downtime on the Web site and BootstrapCDN service. I (Justin Dorfman) am taking full responsibility for not putting proper monitoring & escalation procedures in place. My biggest mistake was assuming something like this would never happen, and I am the one to blame, not the company I work for, NetDNA.
At 8:30 PM (PST) the BootstrapCDN service went down hard due to an “account suspension”. The suspension was triggered by an “Unpaid invoice” that I created when doing some Q/A for our upcoming Control Panel (CP3). I was alerted by email from Watchmouse 9 minutes later and didn’t realize that I received an email till 10:55pm PST. I quickly realized that the CDN URL (bootstrapcdn.jdorfman.netdna-cdn.com) was pointed to our Suspension web server and I quickly updated the record to the dedicated Anycast IP (220.127.116.11). By 11:03 PM all services were restored.
How will this not happen again?
- BootstrapCDN will have it’s own account, meaning no other Pull/Push zones will be on the same account.
- Watchmouse alerts now go to the full support team rather that just 1 person.
- I will get text messages from Watchmouse if there is any downtime (even false positives).
- We have set up 4 Panopta alerts (Checked once a minute from 26 different global nodes)
- We have setup a public Panopta page a well: http://reports.panopta.com/bootstrapcdn
- Since we have 24/7/365 support, agents have been instructed to escalate any downtime reports immediately to my cell phone no matter what.
- We are in the process of getting @BootstrapCDN from a squatter so all related news & updates will be tweeted in real time.
- We set up support-at-bootstrapcdn.com which will alert me via text message.
|Our Panopta Dashboard that was set up today (09-14-2012)|