April 24, 2006
MySQL Cluster Configuration, Tuning, and Maintenance
First tutorial of Monday morning at the MySQL Users Conference is Johan Anderson on MySQL Cluster Configuration, Tuning and Maintenance. Found a spot next to Giuseppe (the data charmer) and a power strip (which are in ample supply here).
Johan came with the cluster aquisition from Ericsson to MySQL in 2003 and works mostly as a consultant with other folks using MySQL cluster for telecom, search engines and e-commerce. Likes to keep it simple, or be sure to avoid areas where the cluster isn't the best (ie MySQL cluster isn't great at joins).
Johan starts with a general overview, using the typical diagram of the cluster nodes and walks through the pieces of the cluster. MySQL is currently working on a new feature where you can add a new storage node online. The management server isn't critical to a running cluster, only used on startup or for failure conditions.
(some working out of logistics, turning up the mic, getting a lazer pointer, etc)
Data DistributionData distribution is done by each table being divided into fragments, based on the number of storage nodes. One node will be have the primary fragment of the data and another node will have the secondary fragment. Decision on where to store the data is based on a hash value of the primary key. Each data node in the cluster can handle transactions.
Node FailureWhen a node fails the data from that machine is then pulled from the other node in that node group that has a copy. The other node then has to assume the full responsibility of serving that data until the failed node recovers. When the failed node comes back it knows that it's joining a running cluster and will grab a copy of all the information it needs from the running node. Once it has a complete copy of the data it will rejoin the cluster as a healthy node. 5.1 will have an option to recover a node just applying the changes that have happened since the node failed.
BackupsCluster has it's own backup that's a part of the cluster engine that runs on each node. You start the backup from the management server which tells all of the nodes to take a backup. The nodes each have a copy of the data. When a backup is taken the fragments are written to disk and you are then responsible for copying those files to another location. You can take a backup from a 4-node cluster and restore it to a cluster with different configuraitons.
The changelog has data changes that happen during the backup process. You cannot use the changelog outside of a backup. So there's no log file that can be applied down the road somewhere to roll the database.
If a node fails during backup it will be aborted and the backup will have to be restarted.
If you lose all the nodes in a node group you have a system crash and you need to do a restore from the most recent checkpoint.
TransactionsEach data node has a transaction coordinator. The SQL node picks one of the data nodes (round-robin fashion) and tells it to start a transaction (5.1 may include option to specify node). Each transaction coordinator handles the two-phase commit protocol for a transaction. The cluster uses a linear two-phase commit.
It's important to note that there's a 50% chance that the transaction will start on a node that doesn't have the data it needs, so data nodes have to be able to communicate between each other.
For an update on a two-replica cluster the TC sends a message to the LQH with the primary fragment for the tuple. Once that LQH gets the lock it needs it sends the message along to the LQH on to the other node. When the second node has the lock on the LQH it returns a message to the transaction coordinator. After the prepare is done (locks obtained) the TC sends a commit. The transaction coordinator recieves a notice from the first LQH once the change is committed. The TC then sends a message to the other LQH saying that the commit is finished.
Once the commit starts it must be finished. Even if a node crashes between one node committing and the other committing the transaction has to be completed.
What happens if the TC dies? Another transactions coordinator takes over and aborts transactions that haven't made it to the commit phase. The application must be able to handle this. If the commit has started the new TC will complete the transaction.
LockingWhen working with ranges a batch of records (likely a subset of all that is needed) is brought up to the MySQL client. The lock is obtained on the subset of the records that is obtained, but that doens't mean the lock is obtained on all records that will be processed. The subset is changed and then returned to the cluster where the lock is released and then the next batch is locked and sent up to the client.
If an operation takes a lock and it doesn't unlock in a timely fashion the TransactionDeadlockDetectionTimeout setting controls how long a second lock will wait before aborting. The default is 1200ms. TransactionInactiveTimeout can be used to maximum time between operations, or how long a transaction can hold a lock. The default is 0, which is not good. A better setting is 5000-10000ms. Set TimeBetweenInactiveTransactionAbortCheck (perhaps 1000 ms) to specify frequency of checks for inactive transactions.
Memory is split into two pieces, data and index. Data stores tuples and ordered indexes. Index data stored primary keys in a hash index. Memory page sizes are 32KB, index memory page size is 8KB. Max tuple storage is 8052 bytes except for blobs. Use ndb_size.pl script to determine storage requirements for a given MyISAM or InnoDB table in your MySQL instance. Do not use more than 90% of your data or index memory, the cluster needs headroom.
Disk usage is often neglected, but is important to think about about because you have checkpoint data needing to be written to the disk. Each node is writing checkpoint data, and UNDO/REDO logs. Every change is put into the REDO log. Every time three checkpoints have passed the REDO log is cycled. A local checkpoint can be intensive. You can set the time between the checkpoints using TimeBetweenLocalCheckpoints. The default is 20, which means that reaching 4MB of data in the REDO log will trigger a checkpoint. During a checkpoint the UNDO log is written to provide data on changes that happen during the checkpoint.
In addition to writes, this data needs to be read from disk on recovery.
Johan gets into some pretty detailed diagrams of the timeline for disk writes and configuration of disk utilization. It appears to be quite a science to calculate how much disk space you should have available for your cluster. Some of this calculating goes away with 5.1 where the checkpoint has been rewritten and the UNDO log is no longer used.
The bottom line is to be aware of how many transaction records will be used in your largest transactions and identify how many operations records they will use and be sure to have room in memory to process those records.
Transport BuffersEach connection uses a transporter for sending and receiving data. The SendBufferMemory/ReceiveBufferMemory can be configuerd to allow more memory for transporting data in cases where records are large.
MaintenanceA few maintenance items:
- logs are circular
- memory isn't reclaimed unless a table is dropped
- develop a backup strategy
- rolling restart will let you update config.ini and then restart each NDB node one at a time
- rolling restart resolves memory fragmentation
Hardware RecommendationsGet good hardware, with good disks. Data nodes shoud be Dual 64-bit machines with SCSI disks and lots of RAM. The network should be gigabit or SCI.
Best PracticesA few ideas for utilizing the cluster efficiently (not everything, just what catches my eye):
- use primary key operations, it's good at them
- scans are expensive
- maintaining indexes is expensive
- avoid joins
- denormalize data to have a more horizontal layout
- engine-condition-pushdown=1 to pushdown where clauses
Posted by mike at April 24, 2006 11:37 AM