tl;dr: Replication is stalled on a specific binlog and position, and I'm not sure why
I have a MySQL Replication setup with MySQL 5.5.
This replication setup has no history of ever falling behind and has always been solid.
This morning, I noticed that the Slave was behind the Master by 17 hours.
Doing more research, it looks to be an issue with the SQL_Thread.
The current master log file, according to the slave (via SLAVE STATUS
), is mysql-bin.001306
@ position 20520499
. This is in line with the MASTER STATUS
output from master.
However, SLAVE STATUS
shows that the Relay_Master_Log_File
is currently mysql-bin.001302
with a Exec_Master_Log_Pos
of 36573336
. The Relay_Master_Log_File
nor Exec_Master_Log_Pos
has advanced at all while I've been monitoring them this morning.
Looking into the binlogs on master, This is the statement located at mysql-bin.001302@3657336
:
# at 36573053
#170221 14:33:48 server id 1 end_log_pos 36573130 Query thread_id=96205677 exec_time=0 error_code=0
SET TIMESTAMP=1487716428/*!*/;
BEGIN
/*!*/;
# at 36573130
# at 36573213
#170221 14:33:48 server id 1 end_log_pos 36573213 Table_map: `database-name`.`table-name` mapped to number 5873
#170221 14:33:48 server id 1 end_log_pos 36573309 Write_rows: table id 5873 flags: STMT_END_F
### INSERT INTO `database-name`.`table-name`
### SET
### @1='xxxxxxxx'
### @2=6920826
### @3='xxxxxxxx'
### @4='GET'
### @5='address'
### @6=2017-02-21 14:40:24
### @7=2017-02-21 14:40:24
# at 36573309
#170221 14:33:48 server id 1 end_log_pos 36573336 Xid = 1668637037
COMMIT/*!*/;
# at 36573336
Around this time, yesterday, I did execute a few large queries to migrate data to a new table. The process looked a bit like this;
mysql> insert into tmp_table ( select <rows> from origin table ); -- 44 million rows
mysql> insert into dest_table ( select * from tmp_table ); -- 44 million rows
The two tables in question did not have a primary or unique key on them, which I've read could have been a problem. However, while the database + table shown in the binlog entry above is the destination table here -- the insert record shown is not one that was generated during the migration.
If you've gotten this far, you deserve internet points.
At this point, I'm not sure what else to consider or where else to look to find the reason for the log stall. Any insight is appreciated.
Thanks.
For reference, here is the MASTER STATUS
and SLAVE STATUS
output as of the time of this post:
MASTER STATUS
mysql> show master status;
+------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.001306 | 20520499 | | |
+------------------+----------+--------------+------------------+
1 row in set (0.00 sec)
SLAVE STATUS
mysql> show slave status \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: master-host
Master_User: replication-user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.001306
Read_Master_Log_Pos: 20520499
Relay_Log_File: relay-bin.002601
Relay_Log_Pos: 36573482
Relay_Master_Log_File: mysql-bin.001302
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 36573336
Relay_Log_Space: 3565987462
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 63435
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
1 row in set (0.00 sec)
I was on the right track here with the large query transactions from yesterday.
After I migrated the data, I executed a DELETE statement on the original table, to get rid of the rows that I had migrated away.
These tables are just full of tracking data, and therefore, don't have any primary or unique keys on them.
Due to how ROW based replication works, the slave, instead of executing the identical DELETE statement that was executed on master, it executes a DELETE statement for each row, which ends up looking something like this:
And, since there is not an index that matches that query, the single threaded replication SQL thread executed 40 million + delete statements (or...was trying to) which took a long time to run because of all the scanning that had to be done to identify each row (the table was about 80 million rows in size at the time).
In the end, I dealt with this by stopping the slave thread (
STOP SLAVE
) skipping a single slave transaction (SET GLOBAL sql_slave_skip_counter = 1;
) and re-starting the slave thread (START SLAVE
).This resulted in my Master and Slave being out of sync on the table in question here -- but I was able to leverage the nature of the row based replication to bring it back in sync by executing the following on the Master:
Since the DELETE was executed on the Master, the INSERT here only inserted the records I wanted to keep (the deleted ones are gone). And, since row based replication inserts each row individually instead of executing the same INSERT INTO...SELECT statement, the Slave table was only populated with the desired data. Then, the subsequent DROP TABLE statement drops the table on the slave without having to address each row individually.
The caveat here is that because the Master version of the table was still 30-40 million rows...the INSERT and consequential replication ends up locking up your slave for a little while (duplicating the problem above), but it's a much shorter stall (ended up being about 20 minutes) due to mysql not having to scan the database for rows to delete.
I hope this can be of help to someone in the future. Sorry it's winded, hope it was informative and helpful.