A few weeks ago, I spun up an RDS Aurora A-Z instance. It automatically created two instances: the main one and a read-only replica.
Last week I used the mysql command line interface to login to the main mysql instance and I successfully created a new table. Today, I used the mysql command line interface to login to the main mysql instance, tried to make a change, and got an error that said that the database was read-only. I then looked in the AWS RDB console, and it appears that the main and the replica have switched. The main is read-only and the replica is the writer.
I noticed that about 2 hours ago, and the situation has not changed. So this is not happening because of a maintenance window (since the maintenance windows are only 30 minutes long).
Why would this have happened? Is there something I should do to prevent this from happening in the future?
They may have switched due to maintenance. There's a pending upgrade to Aurora 1.7.1 dated 2016-09-20 showing for one of my Aurora clusters now (on 2016-10-15,
SELECT @@AURORA_VERSION;
shows 1.6). It would make sense if the replicas were upgraded first, then a failover event was triggered, and then the master would be upgraded, but I'm speculating -- I can't find this explictly stated in the documentation.Or, there could have been a failure of the original master, that resulted in a failover followed by recovery of the original master.
Either way, you should find evidence of something in the instance event logs, assuming it was recent -- see "Events" on the left hand side of the RDS console.
But as to why they switched and then didn't switch back is a question that's potentially easier to answer -- I don't believe there's a reason to expect them to switch back.
At any point in time, one of your instances is the "master" -- but unlike MySQL/MariaDB native replication, calling it the "master" isn't really accurate, because the instances in an Aurora cluster all share a common backing store -- they don't have individual copies of the data, they're all peers accessing a shared and replicated storage back-end. Rather than master and slaves/replicas, one of them is a writer (can read and write) and the others (if they exist, a single instance "cluster" is valid) are readers (read only), but any one of the instances can become the writer due to a failover event (which may be triggered for reasons other than an actual failure). It's possible to prioritize the instances so that failover causes a switch to a preferred instance (the instances in an Aurora cluster don't have to be the same instance class) but this only seems relevant when the number of nodes is greater than two.
Fundamentally, though, the design of Aurora appears to be such that you shouldn't be thinking of your instances as though a specific one of them is the master... and the infrastructure provides a way for it not to matter.
An Aurora cluster has a cluster name assigned by you and an alphanumeric cluster identifier assigned by the system, and each instance in the cluster has a name assigned by you.
Aurora, as is standard behavior for RDS, creates a hostname in DNS for each instance based on the name you give to the instance and the cluster identifier, but an Aurora cluster has two additional hostnames created -- one that will connect you to the writer, and another that will connect you to one of the readers (or, it will also connect you to the sole member of the cluster, which is in fact the writer, when the cluster has only one member).
So let's say your cluster name is
prod-db
, let's say your system-assigned identifier isxyzzyexample
, and let's say the nodes you created are namednode-1
andnode-2
... and the region isus-east-1
.The instance hostnames look like this:
But the hostnames you should be using to access Aurora are not those.
The ones you should be using, unless you have a specific reason to do otherwise, such as pinning a job to a specific replica, look like this:
These are implemented as CNAMEs in DNS, managed by RDS, so each time you connect, you get an answer appropriate to the current configuration of your cluster. The TTLs are 5 seconds on the writer address, and 1 second on the reader address, so the odds are pretty good that the answer will be correct. By using these addresses to connect, you don't have to be concerned with the machines switching roles.