When S#!t Hits the Fan...
That is a good reason to prepare for failure to minimize data loss and downtimeCluster Design Documentation
First document your cluster, and verify you have:- Odd number of instances (>=3) on at least 3 independent location
- At least a daily backup using XtraBackup is saved remotely
- Enabled Monitoring is enabled (warning you from low disk space or under performing instances).
- Enabled slow queries monitoring to make sure query performance is monitored and you take care of slow queries to maximize UX
Data Recovery Plan (DRP)
DR Cases
- Data was deleted/modified accidentally. This case will require either:
- Accept the data loss
- Get back to daily backup and lose any data collected since last backup (T1+T2).
- Recover the database on a new node, and cherry picking the changes on the current cluster (T1)
- Single node was crushed
Galera support an automatic recovery of a node w/o significant work. Recovery can be accelerated by recovering the node from the daily backup (T1) - All Cluster is not working
- Requires recovery of a single node from daily backup (T1)
- Setup the cluster (T2)
Technical Procedures
T1: Restore a node from daily backup
- Bring back the files from the the remote backup to /mnt/backup/
- Uncompress the files
sudo tar xvfz $file_name - Shutdown the MySQL
sudo service stop mysqld - Copy the files to your data folder (/var/lib/mysql)
sudo rm -rf /var/lib/mysql/*
sudo innobackupex --copy-back /mnt/datadrive/mysqlbackup/ - Verify the folder permissions
sudo chown -R mysql:mysql /var/lib/mysql - Restart the MySQL and verify everything is working.
sudo service mysql start
T2: Setup a cluster
- Verify Galera was defined in my.cnf and define
[mysqld]
wsrep_cluster_address=gcomm://10.10.10.10
wsrep_provider=/usr/lib64/libgalera_smm.so - Start the first node:
sudo service mysql start --wsrep-new-cluster - Verify the cluster size
mysql> SHOW STATUS LIKE 'wsrep_cluster_size';
+--------------------+-------+
| Variable_name | Value |
+--------------------+-------+
| wsrep_cluster_size | 1 |
+--------------------+-------+ - Repeat the process on the other nodes, this time just w/ a simple MySQL restart
sudo service mysql start
T3: Add/Remove a node
- Restore the node from the nightly backup (T1)
- Perform step 4 in setup a cluster (T2)
Bottom Line
Being ready for the worst, can help you mitigate it with minimal data loss and minimal downtime
Keep Performing,
Moshe Kaplan
Keep Performing,
Moshe Kaplan