May 10, 2006
MySQL Scale-Up Plan
I spent Monday writing a document to summarize the current MySQL system architecture and lay out the plan for where we're going with scaling our MySQL architecture at OpenAir. Yesterday I met with the CTO and the Director of Engineering (my manager) to go through the plan, which has evolved considerably since I started working on it. The three of us have been meeting somewhat regularly since I took the helm of all things MySQL.
This is not the document, it is a summary of what's in the document. For our environment there are three major things to think about when it comes to scaling up MySQL; performance, redundancy, and monitoring.
A key component of scaling MySQL is taking steps to make sure we're getting the most from your existing systems. For us this has primarily been performance tuning to process queries faster. The more we can get out of existing hardware the less we need to be loading up on new hardware.
When I first started working on the scaling plan we thought we were pushed up against the upper limits of the hardware. Turns out that with some tuning the databases are no longer maxing out existing hardware and we can breath a bit while planning instead of rushing a solution.
When I first got to OpenAir I was tasked with scaling up MySQL in the context of building a clustered database for scalability and redundancy. The idea was to build something where queries could be spread across multiple machines seamlessly to spread the load around and to prevent application interruptions when a machine went down.
Since I'd done a bit with MySQL Cluster I spent some time getting that set up to experiment with it in the OpenAir environment. Bottom line is we need disk-based storage (we're watching the 5.1 release with anticipation). We also looked at m/Cluster, a third-party tool that sits on top of MySQL and provides database clustering. Somewhere in the midsts of the cluster exploration we realized that we weren't sure clustering was worth the work to build and maintain.
Around this time I headed off to the MySQL Users Conference and had a chance to see Jeremy Cole's talk about scaling and redundancy at Yahoo!. My report back to the folks at OpenAir was recieved with excitement. The redundancy plan is now something I'm much more familiar with; replication to hot spare machines and manual fail-over (with good monitoring alerts). We've already got some of this in place for backups so the implementation is more of a refactoring and building a process for quick manual failover. We did this at Tufts too which means I'm bringing a lot (5 years) of experience to this solution.
The last piece of the plan is to build better monitoring. To have a highly available database means that if something is failing, or in danger of failing, you should know about it. For us this includes better replication heartbeat and more MySQL-specific monitoring. We have a general monitoring tool that does a good job of checking the servers in general but doesn't get into things specific to how MySQL is running. Will be adding in some of those checks.
In yesterday's meeting it became clear that going through the process of looking at the pie in the sky solution was helpful in determining just how much it would take to have cluster, or cluster-like technology, and what that was worth to us. After a good look at it we decided that with the technology that's available now, replication to hot spares is the right spot for us. Yes, we could dedicate a chunk of resources (time and money) to having a cluster but replication gets us to where we need to be and allows us to focus those resources in other areas (like feature enhancements) that give better returns.
Posted by mike at May 10, 2006 9:30 AM