Migrate from on-premise MySQL Server to Amazon Aurora in near zero downtime

Introduction

A lot of companies are thinking about migrating their infrastructure to Cloud and one of the most important show stoppers is downtime, especially when it comes to database migration. In this post, I am going to show you how we migrated our on-premise MySQL Server to Amazon Aurora for MySQL in near-zero downtime. A good majority of the process has been written in bash scripts and I’m going to explain them step by step, however, you can find all the scripts in this repository.

Simplified Process

A usual process for migrating a database is something like this:

  1. Create a backup from the source database
  2. Restore the backup to the destination
  3. Setup replication from source to destination
  4. Stop applications
  5. Make sure no write is happening to either of databases
  6. Wait for the destination to catch up
  7. Change connection strings to use the new database
  8. Start applications

It seems simple, doesn’t it? But let’s be honest, database migration is not that simple unless your database is pretty small and you are allowed to have several hours of downtime. If that’s the case, don’t bother designing a zero-downtime migration process. Just follow those steps and you’re done. But you probably are here because you can’t afford downtime and you know that steps 4-7 mean downtime. We call it Cut-Over. Migrating the database of every live application involves a cut-over, when you have to stop the writes to the current database, activate the new database and continue the writes to it. The cut-over is necessary to ensure the consistency of the database. You have to ensure that writes could NEVER happen to both databases at the same time. The faster you can do cut-over the less downtime you will have.

Architecture

Our previous architecture before migration looked like this:


We used ProxySQL to provide high availability and read-only load balancing. Having a proxy gives you the ability to route your database requests to the right destination at any moment. ProxySQL is a high-performance proxy for MySQL and you can actually use it as the primary point of the cut-over without ever need to change your application configuration or even stopping them. You can simply add a proxy layer to your architecture in zero downtime, there are so many benefits to it and I highly recommend to do that even if you are not doing a migration. We can also script (and automate if you want) the whole cut-over process with bash scripts, so you can ensure that the cut-over will happen flawlessly. Let’s do it.

Preparation

Preparation involves these steps:

  1. Create a backup from the source MySQL database
  2. Restore the backup to the destination Aurora cluster
  3. Setup replication from source to destination

There are so many resources on the web that cover these topics, so, I am not going to cover them again, but you can find some of them in the Useful Links section at the end of this article. Make sure to study them before starting the migration, they all have great points that make your life easier. But I would like to just highlight some of the important factors that you might need to consider before moving forward to the cut-over.

  • Ensure that Charset and Collation configuration for the servers and databases are identical
  • Ensure that replication lag is minimal using show slave status; statement
  • Set binlog retention configuration on Aurora cluster using this statement:
    call mysql.rds_set_configuration('binlog retention hours', X); where X is the value in hours you want to keep the binlogs
  • Check the connectivity on your default port (e.g. 3306) from your ProxySQL to the destination Aurora cluster
  • Import timezone tables from Aurora to your current MySQL database using the following statements:
mysqldump --login-path=your_aurora --no-create-db --no-create-info mysql time_zone time_zone_name time_zone_transition time_zone_transition_type > aurora-timezones.sql

mysql --login-path=your_mysql mysql < aurora-timezones.sql

Create Users In Aurora

Ensure to create application users before the cut-over. You can use mysqlpump or pt-show-grants to migrate your users with their permissions. For mysqlpump a command like this can be used:

mysqlpump --exclude-databases=% --no-create-db --no-create-info --users --skip-dump-rows --skip-defer-table-indexes --skip-routines --include-users=user1,user2,user3,... > dump-users.sql

Make sure to remove or update IP limitation from the newly created users in the destination database.

ProxySQL Configuration

The mysql_servers table looked like this:

The host groups 10 and 11 handle write and read requests respectively as described in read/write split with ProxySQL. We are going to add Aurora endpoints to each group with higher weights and different comments.

Add Aurora Endpoints to ProxySQL

At this step, the applications connect to ProxySQL, it routes the database requests to the current MySQL Server and Aurora cluster is being synced with the current database via replication. Now that we have everything ready, it’s time to add Aurora Endpoints to ProxySQL. We initially add them as OFFLINE_HARD to make sure they don’t accept any request. (read Configuring Backend MySQL Servers in ProxySQL)

insert into mysql_servers (hostgroup_id,hostname,port,status,weight,compression,max_connections,max_replication_lag,use_ssl,max_latency_ms,comment) 
values(10, 'aurora-endpoint.cluster.rds.amazonaws.com'
,3306,'OFFLINE_HARD',2000000,0,1000,0,0,0,'RDS WRITER');

If you are using the read/write split feature of ProxySQL, you can also add the Aurora Reader Endpoint as well. This will also help us to be able to cut-over readers first and make sure the connection between ProxySQL and Aurora is available and stable.

insert into mysql_servers (hostgroup_id,hostname,port,status,weight,compression,max_connections,max_replication_lag,use_ssl,max_latency_ms,comment) 
values(11, 'aurora-endpoint.cluster-ro.rds.amazonaws.com'
,3306,'OFFLINE_HARD',2000000,0,1000,0,0,0,'RDS READER');

The mysql_servers table should look like this now:

Scripts Prerequisites

I am going to explain the cut-over process and share the scripts that I used in the following sections. All of the scripts can be found in this repository. There are a couple of assumptions that I make before moving forward to scripts:

  • These scripts can run on a machine with Ubuntu 16.04 and higher with connectivity to MySQL, ProxySQL and Aurora instances, it could be the ProxySQL machine itself
  • mysql-client has been installed
  • Three login paths named mysql, proxysql, aurora have been set and tested using mysql_config_editor
  • I only share bash scripts here, SQL files can be found in the repo

Cut-Over The Reader Endpoints

If you have added the Aurora reader endpoint to ProxySQL, it’s a good idea to cut-over them first. It’s a safe activity that helps you identify potential connection issues before the writers cut-over. Even if something goes wrong when you route the read traffic to Aurora, it will probably just impact a small section of your application such as reports, dashboards, etc. and you can quickly revert the changes back and troubleshoot the problem without losing any transaction. You can use this script to cut-over the reader endpoints:

#!/bin/bash
set -e
proxy_login_path="proxysql"
echo "setting rds readers online, they can accept queries now…"
mysql –login-path=$proxy_login_path < ./01_set_rds_readers_online.sql
echo "waiting for 5 seconds for the current queries to complete…"
sleep 5
echo "setting mysql readers offline, all read traffic should go to rds now…"
mysql –login-path=$proxy_login_path < ./02_set_mysql_readers_offlline.sql
echo "CUT-OVER READERS COMPLETED – Please monitor readers."

Failback The Reader Endpoints

You should always be prepared if something goes wrong. So, for every cut-over, we need to have a plan to failback to the previous state and then troubleshoot what went wrong later. This is how we can failback the reader endpoints if something goes wrong in the previous step:

#!/bin/bash
set -e
proxy_login_path="proxysql"
echo "failing back readers…"
mysql –login-path=$proxy_login_path < ./03_failback_readers.sql
echo "FAILBACK READERS COMPLETED"

Cut-Over The Writer Endpoints

We finally reached the REAL CUT-OVER, where you are going to have a short downtime. The shorter downtime the better. So, our focus here is to minimize downtime by scripting the process. You should also be prepared for failback in case something goes wrong. As soon as you cut-over the requests to the new Aurora cluster, new transactions are going to run against it (and you don’t want to lose them in case you have to failback to your old MySQL cluster). So, right before the cut-over, we will have to stop the replication and start a reverse replication from Aurora to MySQL. Create a user in Aurora cluster with appropriate permissions for the reverse replication:

CREATE USER 'reverse_replication' IDENTIFIED BY '<password>';
GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'reverse_replication';
FLUSH PRIVILEGES;

Now that we have everything ready, let’s talk a little about the real cut-over process without making any changes to the application configuration:

  1. In ProxySQL: Set MySQL writer endpoints status to OFFLINE_SOFT, it avoids new requests to the endpoints and allows the current requests to finish
  2. Wait for 1-2 seconds to allow current requests to finish (depends on the average transaction time in your database, you might need to wait shorter or longer)
  3. In ProxySQL: Set MySQL writer endpoints status to OFFLINE_HARD. This is where downtime starts because there is no writer endpoint with ONLINE status available to route requests to
  4. Wait until the replication catch up on your Aurora cluster
  5. In Aurora: Stop replication
  6. In MySQL: Start reverse replication for failback (you need to ignore rds specific tables because they aren’t available in MySQL, you can ignore the tables that start with rds_ or you can completely ignore the mysql database)
  7. Make sure there is no lag for the reverse replication
  8. In ProxySQL: Set Aurora writer endpoints status ONLINE

It will probably take 4-5 minutes if you do it manually which means 4-5 minutes downtime. So, let’s script it:

#!/bin/bash
set -e
proxy_login_path="proxysql"
mysql_login_path="mysql"
rds_login_path="aurora"
function progress {
echo "—————————–"
date
echo $1
}
function read_slave_status {
progress "reading slave status…"
slave_status=$(mysql –login-path=$1 -e "SHOW SLAVE STATUS\G")
slave_seconds_behind_master=$(echo "$slave_status" | grep "Seconds_Behind_Master" | awk '{ print $2 }')
slave_master_log_file=$(echo "$slave_status" | grep "[^_]Master_Log_File" | awk '{ print $2 }')
slave_read_master_log_pos=$(echo "$slave_status" | grep "Read_Master_Log_Pos" | awk '{ print $2 }')
slave_exec_master_log_pos=$(echo "$slave_status" | grep "Exec_Master_Log_Pos" | awk '{ print $2 }')
echo "-Seconds_Behind_Master=$slave_seconds_behind_master"
echo "-Master_Log_File=$slave_master_log_file"
echo "-Read_Master_Log_Pos=$slave_read_master_log_pos"
echo "-Exec_Master_Log_Pos=$slave_exec_master_log_pos"
}
function read_master_status {
progress "reading master status…"
master_status=$(mysql –login-path=$1 -e "SHOW MASTER STATUS\G")
master_log_file=$(echo "$master_status" | grep "File" | awk '{ print $2 }')
master_log_position=$(echo "$master_status" | grep "Position" | awk '{ print $2 }')
echo "-Master_Log_File=$master_log_file"
echo "-Master_Log_Position=$master_log_position"
}
function start_reverese_replication {
progress "starting reverse replication…"
sql="CHANGE MASTER TO "
sql+="MASTER_HOST='aurora-endpoint.cluster.rds.amazonaws.com', "
sql+="MASTER_USER='reverse_replication', "
sql+="MASTER_PASSWORD='password', "
sql+="MASTER_LOG_FILE='$master_log_file', "
sql+="MASTER_LOG_POS=$master_log_position; "
sql+="CHANGE REPLICATION FILTER REPLICATE_IGNORE_DB = (mysql); "
sql+="START SLAVE; "
mysql –login-path=$mysql_login_path -e "$sql"
}
# ———main module———
progress "starting cutover…"
progress "setting mysql writers offline soft.."
mysql –login-path=$proxy_login_path < ./04_set_mysql_writers_offlline_soft.sql
progress "waiting 2 seconds to finish active transactions…"
sleep 1
progress "setting mysql writers offline hard…"
mysql –login-path=$proxy_login_path < ./05_set_mysql_writers_offlline_hard.sql
slave_seconds_behind_master=1
while :
do
read_slave_status $rds_login_path
read_master_status $mysql_login_path
if [ $slave_seconds_behind_master -eq 0 ] && [ $slave_read_master_log_pos -eq $slave_exec_master_log_pos ] && [ "$master_log_file" = "$slave_master_log_file" ] && [ $master_log_position -eq $slave_read_master_log_pos ]; then
progress "RDS replication caught up."
break
fi
progress "waiting 1 second for replication to catch up…"
sleep 1
done
progress "stopping replication…"
mysql –login-path=$rds_login_path < ./06_stop_rds_replication.sql
progress "replication stopped."
read_master_status $rds_login_path
start_reverese_replication
slave_seconds_behind_master=1
while :
do
read_slave_status $mysql_login_path
read_master_status $rds_login_path
if [ $slave_seconds_behind_master -eq 0 ] && [ $slave_read_master_log_pos -eq $slave_exec_master_log_pos ] && [ "$master_log_file" = "$slave_master_log_file" ] && [ $master_log_position -eq $slave_read_master_log_pos ]; then
progress "reverse replication caught up."
break
fi
progress "waiting 1 second for replication to catch up…"
sleep 1
done
progress "setting rds writers online …"
mysql –login-path=$proxy_login_path < ./07_set_rds_writers_online.sql
progress "CUT-OVER COMPLETED. MONITOR TRAFFIC NOW.

Failback The Writer Endpoints

Finally, we also need to prepare a failback process. As we already have a reverse replication in place, we can quickly failback to the previous state if anything goes wrong. It is basically the same process as the cut-over just in different direction:

#!/bin/bash
set -e
proxy_login_path="proxysql"
mysql_login_path="mysql"
rds_login_path="aurora"
function progress {
echo "—————————–"
date
echo $1
}
function read_slave_status {
progress "reading slave status…"
slave_status=$(mysql –login-path=$1 -e "SHOW SLAVE STATUS\G")
slave_seconds_behind_master=$(echo "$slave_status" | grep "Seconds_Behind_Master" | awk '{ print $2 }')
slave_master_log_file=$(echo "$slave_status" | grep "[^_]Master_Log_File" | awk '{ print $2 }')
slave_read_master_log_pos=$(echo "$slave_status" | grep "Read_Master_Log_Pos" | awk '{ print $2 }')
slave_exec_master_log_pos=$(echo "$slave_status" | grep "Exec_Master_Log_Pos" | awk '{ print $2 }')
echo "-Seconds_Behind_Master=$slave_seconds_behind_master"
echo "-Master_Log_File=$slave_master_log_file"
echo "-Read_Master_Log_Pos=$slave_read_master_log_pos"
echo "-Exec_Master_Log_Pos=$slave_exec_master_log_pos"
}
function read_master_status {
progress "reading master status…"
master_status=$(mysql –login-path=$1 -e "SHOW MASTER STATUS\G")
master_log_file=$(echo "$master_status" | grep "File" | awk '{ print $2 }')
master_log_position=$(echo "$master_status" | grep "Position" | awk '{ print $2 }')
echo "-Master_Log_File=$master_log_file"
echo "-Master_Log_Position=$master_log_position"
}
function start_reverese_replication {
progress "starting reverse replication…"
sql="CALL mysql.rds_set_external_master ('mysql-node-1', 3306, 'rds_replication', 'password', '$master_log_file', $master_log_position, 0);"
echo "$sql"
mysql –login-path=$rds_login_path -e "$sql"
start_repl="CALL mysql.rds_start_replication;"
mysql –login-path=$rds_login_path -e "$start_repl"
}
# ———main module———
progress "starting cutover…"
progress "setting mysql writers offline soft.."
mysql –login-path=$proxy_login_path < ./08_set_rds_writers_offlline_soft.sql
progress "waiting 2 seconds to finish active transactions…"
sleep 1
progress "setting mysql writers offline hard…"
mysql –login-path=$proxy_login_path < ./09_set_rds_writers_offlline_hard.sql
slave_seconds_behind_master=1
while :
do
read_slave_status $mysql_login_path
read_master_status $rds_login_path
if [ $slave_seconds_behind_master -eq 0 ] && [ $slave_read_master_log_pos -eq $slave_exec_master_log_pos ] && [ "$master_log_file" = "$slave_master_log_file" ] && [ $master_log_position -eq $slave_read_master_log_pos ]; then
progress "MySql replication caught up."
break
fi
progress "waiting 1 second for replication to catch up…"
sleep 1
done
progress "stopping replication…"
mysql –login-path=$mysql_login_path < ./10_stop_mysql_replication.sql
progress "replication stopped."
read_master_status $mysql_login_path
start_reverese_replication
slave_seconds_behind_master=1
while :
do
read_slave_status $rds_login_path
read_master_status $mysql_login_path
if [ $slave_seconds_behind_master -eq 0 ] && [ $slave_read_master_log_pos -eq $slave_exec_master_log_pos ] && [ "$master_log_file" = "$slave_master_log_file" ] && [ $master_log_position -eq $slave_read_master_log_pos ]; then
progress "RDS replication caught up."
break
fi
progress "waiting 1 second for replication to catch up…"
sleep 1
done
progress "setting rds writers online …"
mysql –login-path=$proxy_login_path < ./11_set_mysql_writers_online.sql
progress "FAILBACK COMPLETED. MONITOR TRAFFIC NOW."

Summary

Migrating a live database to a new server or platform is challenging and it could cause long downtime if you are not prepared for it. In this post, I explained how you can migrate a live MySQL database to a new RDS Aurora Cluster with near-zero downtime using ProxySQL and scripting the process in bash. Although it’s not a script to copy/paste and run, you can use it to build your own migration process that fits your system. Leave a comment below if you have any question.

Leave a comment