« MySQL 5.0 In Depth - Stored Procedures, Views, Triggers and Cursors | Main | State of the Dolphin at MySQL 2005 »

April 18, 2005

Mastering Changes and Upgrades to Mission Critical Systems

Afternoon tutorial of the first day at MySQL User's Conference is Andrew Cowie talking about Mastering Changes and Upgrades to Mission Critical Systems. Andrew is a pretty engaging speaker. He's got some interesting papers and presentations, including a paper on recovering systems from disaster in lower Manhatten in September 2001.

I certainly didn't capture everything that Andrew said, but got some of the interesting points.

We need to think about changes and upgrades as a life cycle. Every so many years things will be replaced. Software upgrades, deploying patches, replacing hardware. It's not a rare curcumstance that we have to do these things, they can be planned for.

Our systems as a whole are very complex, and are often required to be up at all times, with nothing going wrong. The longer a system can't get refreshed, the more of a problem it becomes. When a cluster goes live and is working, it's hard to justify taking it down to make a configuration change or update to the code. The problem is that the longer you postpone doing something in the system the more risky it gets to do something in the system and more likely that it will break.

To look at how changes get made, it's important to think about:

Organizational Blueprints -> Process -> Procedures
Procedures are the actual steps executed, processes involve policies.

Andrew argues that when we're talking about management of our systems we're not in the technology business, we're in the operations business. We shouldn't be looking to programmers to learn how to best get things done, we should be looking to places like NASA for best practices on designing, documenting and carrying otu procedures. In IT there is a lot of ad-hoc, not following procedures. In mission critical systems we should look to experts in operations.

One mistake that techies makes is that they give too much procedural information to the manager, and not explain things in the context that a manager will need to make a decision. When making procedures you should provide an executive summary, especially before embarking on some risky change.

Beyond the summary, you need to put together a list of steps, organizes into sections and tasks.

-Section
--Step
---Name
----Task
----Task
----Task

Steps must be done concurrently, but named items within the steps can be done simultaneous.

Times are critical, and making a big picture estimate is good. But to assign specific times to all the small tasks is kind of pointless (Andrew doesn't like MS Project because everything must be assigned a date).

Andrew recommends good versioning, and not keeping things in Word docs that don't get versioned. Wiki is on his list of recommendations.

Senior people shouldn't be doing tasks all the time. Devolution should be happening, where tasks get moved to more junior people when they've been ironed out.

You do not need some large document for disaster recovery, you need people who have experience with your system who can think on their feet who have access to the process steps.

Posted by mike at April 18, 2005 4:41 PM