Jan 9, 2019

Disaster Recovery Plan (DRP) for MySQL/MariaDB Galera Cluster

When S#!t Hits the Fan...

That is a good reason to prepare for failure to minimize data loss and downtime

Cluster Design Documentation

First document your cluster, and verify you have:
  1. Odd number of instances (>=3) on at least 3 independent location
  2. At least a daily backup using XtraBackup is saved remotely
  3. Enabled Monitoring is enabled (warning you from low disk space or under performing instances). 
  4. Enabled slow queries monitoring to make sure query performance is monitored and you take care of slow queries to maximize UX

Data Recovery Plan (DRP)

DR Cases

  1. Data was deleted/modified accidentally. This case will require either:
    1. Accept the data loss
    2. Get back to daily backup and lose any data collected since last backup (T1+T2).
    3. Recover the database on a new node, and cherry picking the changes on the current cluster (T1)
  2. Single node was crushed
    Galera support an automatic recovery of a node w/o significant work. Recovery can be accelerated by recovering the node from the daily backup (T1)
  3. All Cluster is not working
    1. Requires recovery of a single node from daily backup (T1)
    2. Setup the cluster (T2)

Technical Procedures

T1: Restore a node from daily backup
  1. Bring back the files from the the remote backup to /mnt/backup/
  2. Uncompress the files
    sudo tar xvfz $file_name
  3. Shutdown the MySQL
    sudo service stop mysqld
  4. Copy the files to your data folder (/var/lib/mysql)
    sudo rm -rf /var/lib/mysql/*
    sudo innobackupex --copy-back /mnt/datadrive/mysqlbackup/
  5. Verify the folder permissions
    sudo chown -R mysql:mysql /var/lib/mysql
  6. Restart the MySQL and verify everything is working.
    sudo service mysql start
  1. Verify Galera was defined in my.cnf and define
    [mysqld]
    wsrep_cluster_address=gcomm://10.10.10.10
    wsrep_provider=/usr/lib64/libgalera_smm.so
  2. Start the first node:
    sudo service mysql start --wsrep-new-cluster
  3. Verify the cluster size
    mysql> SHOW STATUS LIKE 'wsrep_cluster_size';
    +--------------------+-------+
    | Variable_name      | Value |
    +--------------------+-------+
    | wsrep_cluster_size | 1     |
    +--------------------+-------+
  4. Repeat the process on the other nodes, this time just w/ a simple MySQL restart
    sudo service mysql start
T3: Add/Remove a node
  1. Restore the node from the nightly backup (T1)
  2. Perform step 4 in setup a cluster (T2)

Bottom Line

Being ready for the worst, can help you mitigate it with minimal data loss and minimal downtime

Keep Performing,
Moshe Kaplan

ShareThis

Intense Debate Comments

Ratings and Recommendations