Production Outage Planning - 10 (or so) points

I've recently went through a number of production planned outages of a group of systems we've recently taken over. I like these outages because of that magical word planned. This isn't planned like you put it in your planner and write down a time, no this is a concrete list with no surprises, no unclear roles, everything is laid out and everyone knows what they are doing. This doesn't seem hard and it really isn't, just takes some attention. Being my 2nd or 3rd one on this current project and I'm noticing some good stuff and possible fail routes that can easily be avoided.

0. Know everything you can about the systems as humanly possible. If you have never restarted server X, how do you know it will come back and operate as you expect? Are services set to start manually? How do you know? "It should" doesn't cut it. Worse, what happens if it does not come back, at all? Do you have another system to replace it? How do you direct traffic to that server? Does the backup server work? Does it work completely and 100%? Do you have access to stand up another server from nothing if you need to? These are absolutely necessary and can cause you to invoke the UpdateResume(); method -- and quite frankly, this is the scariest part. No documentation is your enemy here.

1. Solid communication among the team is a total, 100% must. It only takes one parasitic team member can destroy this whole process and render the outage a total failure. Ask other team members "what do you think" if you feel certain members are feeling uncomfortable or one person is doing all the talking. There should be a lot of people jumping in and commenting, confirming, asking "what if". Talk it out as much as possible.

2. Everyone's role is the same as it is day to day. This means your DBA talks about risks and procedures to the database, the devs talk about the code and the risks to the app, etc. This doesn't necessarily mean there isn't cross talk but it does mean the DBA should _NOT_ outline a code deployment.

3. Include at least one of every role on you project, even if they're not directly involved in the deployment. Just because your developer isn't pushing any code doesn't mean they should not be aware that you are changing a script to start a particular service that could affect something the dev team is doing down the road, along with mimicking the internal systems to production. Do not forget QA if there is a code push of any kind; they should be involved in every outage as well.

4. Run though the plan, outloud, multiple times, in the middle of the room. This goes back to number 3 and if you forget to include someone you didn't know you should've -- sometimes a BA will ask "hey will this have any impact on <some obscure system you didn't know about that will be totally screwed when you do this>?" This happens almost every time in the beginning, as time goes on, the question goes the other direction and it's the sys admin or devs or DBA asking the BA.

5. Ask the tough questions and don't rush it. If one team member responsible for system X and can't explain why doing a certain upgrade, setting change, etc will or will not be beneficial, then that is not to be included in the plan.  On the same though, allow that person to figure out why and/or verify what they believe.  If that means putting it off completely, then so be it, don't force an answer that you want to hear -- expect and ask for real evidence, something that says "THIS is why we are doing this".  On the same page, do NOT be rushed into an outage because it will be nothing but pain.  Push back and be clear about WHY.

6. Have a failover/rollback plan. Code breaks, servers fail, systems stop working -- it happens, so plan for it. Backup/zip/move/copy what you are changing out -- and that means onto a totally different system and understand how to get it back to where it was. Make sure you've tested this type of rollback as well. Just know "how to do it" is very different from "I know it will work".  Understand the risks, be realistic about their impact and prepare for it to all go wrong.

7. Pre-outage steps are solid gold! Do everything you can early to make it easier. If code can be copied up into a pre-determined directory, do it. If a config file can be set aside and ready, do it. Verify the outage window everyone's thinking about has been given the rubber stamp. Verify everyone has access, etc. Pre-steps can save you a ton of time and make your outage that much cleaner.

8. Make a timeline and start from the ends. Say your window is 6 hours. It'll take you 5 hours to complete your task, you've got one hour extra. Wrong, if it takes 3 hours to roll back your changes. Start at the ends. If it'll take you 30 minutes to shutdown the systems and 3 hours to roll back, you've only got 2.5 hours to do your work. Make your timelines from the ends and work toward the middle.  This will save much pain and suffering.

9. Execute the plan and nothing more. It's ok for the devs to say "well if we can just change this while we’re at it" bring down the hammer immediately. It's too late and not part of the plan, BUT it does mean it can go into the next one. Have them write it down and include it on the next process and stand your ground.

9b. Have alternate ways to do the same thing.  Sometimes reindexing a database takes 2 hours, others it takes 15 minutes.  Consider service shutdowns and "cleaner" ways of running.  This will increase your risk, but sometime save an outage from being completely worthless fail whale waste of time to a total success (been there, done that, got the t-shirt).

10. Do a post-opt review.  What did you learn?  What surprises did you get and why?  Can they be stopped from happening again?  How long did the whole thing take?  Why did it take longer, shorter than you expected?  What can you do differently?  These answers will make subsequent plans far more refined and get you that much closer to perfection.

11. Catch up on sleep.  Ok, this one is more of a reminder for myself to cash in on weekends and time that I don't spend on outages to restock on sleep.  Make no mistake; being up at midnight to 5am and into work at 8:30am isn't glorious by any means, but the rewards will come when you return back to a normal day and maybe even leave early a few times.

I'm sure that if you follow these, your plans will be utterly painful at first but will get better over time.  It's not fun, easy nor exciting, but it's a good, solid way to make sure you and your client are not surprised (or at least it's kept to a minimum).
Comments are closed