I have a master/slave replication setup where I use InnoDB and MyISAM tables in over 7000 databases that I want to copy from my master to the slave to restore replication.
Both servers were running Ubuntu 10.04.2 LTS (which uses the mysql-server 5.1.41-3ubuntu12 package). Recently I tried to upgrade MySQL in the hope that I was hitting some bug that a newer version had resolved -- so my slave is now Ubuntu 10.10. However, the problem appears to be the same.
I'd prefer not to disrupt my master, so I have tried taking an LVM snapshot of my entire disc so that I can copy my data and log directory via rsync to my slave:
/var/lib/mysql : Where my ibdata1 and ib_logfile0, as well as all my .ibd and .frm files are stored. I used innodb_file_per_table, so there are a lot of .idb files.
/var/log/mysql : Where I keep all my binary logs
Once copied, I reset the permissions:
chown mysql.mysql /var/lib/mysql -R
chown mysql.mysql /var/log/mysql -R
I remove the master.info and relay-log.info files from the /var/lib/mysql directory. (Since my master is actually to slave to another master, for certain tables).
Then I try to start mysql on the slave. Soon, I start to see the lots and lots of errors that look like the following in /var/log/mysql.err:
InnoDB: Error: tablespace id is 150238 in the data dictionary InnoDB: but in file ./1_107789/email.ibd it is 150747!
or:
InnoDB: Error: trying to add tablespace 148302 of name './23_4377/link.ibd' InnoDB: to the tablespace memory cache, but tablespace InnoDB: 148302 of name './1_68522/open.ibd' already exists in the tablespace InnoDB: memory cache!
And then every now and then:
110207 13:55:45 InnoDB: Assertion failure in thread 2979265392 in file ../../../storage/innobase/fil/fil0fil.c line 603 InnoDB: Failing assertion: 0 InnoDB: We intentionally generate a memory trap. InnoDB: Submit a detailed bug report to http://bugs.mysql.com. InnoDB: If you get repeated assertion failures or crashes, even InnoDB: immediately after the mysqld startup, there may be InnoDB: corruption in the InnoDB tablespace. Please refer to InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html InnoDB: about forcing recovery. 110207 13:55:45 - mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. key_buffer_size=16777216 read_buffer_size=131072 max_used_connections=1 max_threads=10000 threads_connected=1 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 868418 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. thd: 0xbc5a7138 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0xb193f13c thread_stack 0x30000 /usr/sbin/mysqld(my_print_stacktrace+0x2d) [0xb7638c4d] /usr/sbin/mysqld(handle_segfault+0x494) [0xb7304854] [0xb707f400] /lib/tls/i686/cmov/libc.so.6(abort+0x182) [0xb6d89a82] /usr/sbin/mysqld(+0x477790) [0xb7514790] /usr/sbin/mysqld(+0x47795e) [0xb751495e] /usr/sbin/mysqld(fil_space_get_size+0xdc) [0xb751966c] /usr/sbin/mysqld(buf_read_page+0xad) [0xb75015dd] /usr/sbin/mysqld(buf_page_get_gen+0x331) [0xb74fab21] /usr/sbin/mysqld(btr_get_size+0x190) [0xb75b02b0] /usr/sbin/mysqld(dict_update_statistics_low+0x50) [0xb7503e70] /usr/sbin/mysqld(dict_table_get+0xec) [0xb750682c] /usr/sbin/mysqld(+0x4cde5f) [0xb756ae5f] /usr/sbin/mysqld(row_ins+0x157) [0xb756d3c7] /usr/sbin/mysqld(row_ins_step+0x110) [0xb756d710] /usr/sbin/mysqld(row_insert_for_mysql+0x37e) [0xb75754de] /usr/sbin/mysqld(ha_innobase::write_row(unsigned char*)+0xf9) [0xb74e1299] /usr/sbin/mysqld(handler::ha_write_row(unsigned char*)+0x6d) [0xb7412d3d] /usr/sbin/mysqld(write_record(THD*, st_table*, st_copy_info*)+0x3ba) [0xb7391e2a] /usr/sbin/mysqld(mysql_insert(THD*, TABLE_LIST*, List&, List >&, List&, List&, enum_duplicates, bool)+0x1122) [0xb73967c2] /usr/sbin/mysqld(mysql_execute_command(THD*)+0xc85) [0xb7317c95] /usr/sbin/mysqld(mysql_parse(THD*, char const*, unsigned int, char const**)+0x3ae) [0xb731f45e] /usr/sbin/mysqld(Query_log_event::do_apply_event(Relay_log_info const*, char const*, unsigned int)+0x47d) [0xb73dbe9d] /usr/sbin/mysqld(Query_log_event::do_apply_event(Relay_log_info const*)+0x26) [0xb73dca76] /usr/sbin/mysqld(apply_event_and_update_pos(Log_event*, THD*, Relay_log_info*)+0x137) [0xb7463cc7] /usr/sbin/mysqld(handle_slave_sql+0x1094) [0xb74662e4] /lib/tls/i686/cmov/libpthread.so.0(+0x596e) [0xb706396e] /lib/tls/i686/cmov/libc.so.6(clone+0x5e) [0xb6e29a4e] Trying to get some variables. Some pointers may be invalid and cause the dump to abort... thd->query at 0xb183bdc6 is an invalid pointer thd->thread_id=2 thd->killed=NOT_KILLED The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash.
I have been fiddling with various options and trying to understand why it thinks there is a table mismatch. As far as I am concerned there should be no mismatch because I'm copying both the ibdata1, innodb log files as well as the .ibd. So why doesn't it just recover and get on with it, so that I can restore the replication? I'm clearly missing something, but I cannot find it.
Any clues or suggestions appreciated. Thanks
I believe that you have an incosistent snapshot especially due to the error
It may not LVM's fault. Googling here and here, it's my guess that you need to make sure mysql has written everything to the disks (no buffers) and that changes won't happen by lock'ing tables to be on the safe side. It could also be due to the different MySQL versions on the small chance that something has changed in the innodb code. You could rule this out by trying that exact snapshot on a clone/(similar server) of your master. Please see this too
I think the problem was in the way I was copying the data. Since my old slave already had some of the databases on it, I was using rsync to save time in copying the data:
But since I added the -I option like this:
it has worked for me successfully. The -I (--ignore-times) tells rsync to 'don't skip files that match size and time'. Presumably small sub-second changes to files (which didn't change the filesize or the file timestamp) were causing the problem.