Apparently I started a SELECT
query 20 days ago that never finished. It kept running even though the client had disconnected, and ran for so long that one of the undo logs grew to 230 gb in size. (The database in question is 320 gb of data). I've force killed the query, but the undo log just won't shrink back to normal size, causing the disk to be full.
For the record, this is the sizes on disk:
root@the-db:/var/lib/mysql# du -h undo_00* erik_*
11G undo_001
244G undo_002
1.5G erik_temporary_undo_004.ibu
22G erik_undo_003.ibu
One interesting thing is that all undo logs are pretty massive?
I've tried to make sense the documentation, but can't get it to work: https://dev.mysql.com/doc/refman/8.0/en/innodb-undo-tablespaces.html
Version running:
root@the-db:/var/lib/mysql# mysql --version
mysql Ver 8.0.18 for Linux on x86_64 (MySQL Community Server - GPL)
Automatic truncation
If I understand correctly, to have the purge thread truncate an undo log automatically, three condition needs to be met:
Step 1. The undo log needs to be larger than @@innodb_max_undo_log_size
, which is:
mysql> SELECT @@innodb_max_undo_log_size;
+----------------------------+
| @@innodb_max_undo_log_size |
+----------------------------+
| 1073741824 |
+----------------------------+
Step 2. @@innodb_undo_log_truncate
needs to be ON
mysql> select @@innodb_undo_log_truncate;
+----------------------------+
| @@innodb_undo_log_truncate |
+----------------------------+
| 1 |
+----------------------------+
Step 3. There needs to be a total of 2 active undo logs Automated truncation of undo tablespaces requires a minimum of two active undo tablespaces
(not sure if these means 2 active before one is taken out of rotation for truncating).
mysql> SELECT NAME, SPACE_TYPE, STATE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE SPACE_TYPE = 'Undo' ORDER BY NAME;
+-------------------------+------------+----------+
| NAME | SPACE_TYPE | STATE |
+-------------------------+------------+----------+
| erik_temporary_undo_003 | Undo | active |
| erik_temporary_undo_004 | Undo | active |
| innodb_undo_001 | Undo | active |
| innodb_undo_002 | Undo | inactive |
+-------------------------+------------+----------+
According to above, we have 3 active and 4 total undo logs.
However, nothing seems to be happening.
Manual truncation
There is also a manual process for truncating a log, which has another set of prerequisites.
Step 1 and 2 are the same for automatic truncation.
Step 3 is different in that Manual truncation of undo tablespaces requires a minimum of three active undo tablespaces
(still not sure if this means 3 active before one is taken out of rotation). To get around my uncertainty, I created 2 extra undo table spaces.
CREATE UNDO TABLESPACE erik_temporary_undo_003 ADD DATAFILE 'erik_undo_003.ibu';
CREATE UNDO TABLESPACE erik_temporary_undo_004 ADD DATAFILE 'erik_temporary_undo_004.ibu';
Step 4 is to deactivate the table space needing a truncation:
ALTER UNDO TABLESPACE innodb_undo_002 SET INACTIVE;
If I understand correctly this will trigger truncation and after it is completed the STATUS
will be empty
. Again, here is the current status:
mysql> SELECT NAME, SPACE_TYPE, STATE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE SPACE_TYPE = 'Undo' ORDER BY NAME;
+-------------------------+------------+----------+
| NAME | SPACE_TYPE | STATE |
+-------------------------+------------+----------+
| erik_temporary_undo_003 | Undo | active |
| erik_temporary_undo_004 | Undo | active |
| innodb_undo_001 | Undo | active |
| innodb_undo_002 | Undo | inactive |
+-------------------------+------------+----------+
It was set to inactive about 3 days ago, and it hasn't changed since.
One weird thing is that according to the following query, there are only 2 active undo logs, when the previous query shows 3. Am I missing something?
mysql> SHOW STATUS LIKE 'Innodb_undo_tablespaces%';
+----------------------------------+-------+
| Variable_name | Value |
+----------------------------------+-------+
| Innodb_undo_tablespaces_total | 4 |
| Innodb_undo_tablespaces_implicit | 2 |
| Innodb_undo_tablespaces_explicit | 2 |
| Innodb_undo_tablespaces_active | 2 |
+----------------------------------+-------+
Shouldn't Innodb_undo_tablespaces_active
be 3 at least?
Frequency
The documentation mentions innodb_purge_rseg_truncate_frequency
as a way to get the purge thread to run more often. The default is 128
and the example shows setting it to 32
. What this actually means in terms of time is very unclear. It just mentions "every 32 runs".
To increase that frequency, decrease the innodb_purge_rseg_truncate_frequency setting. For example, to have the purge thread look for undo tabespaces once every 32 timees[sic] that purge is invoked, set innodb_purge_rseg_truncate_frequency to 32.
For good measure, I've set it to 1.
mysql> show variables like "%truncate%";
+--------------------------------------+-------+
| Variable_name | Value |
+--------------------------------------+-------+
| innodb_purge_rseg_truncate_frequency | 1 |
| innodb_undo_log_truncate | ON |
+--------------------------------------+-------+
Metrics
I just recently understood how to get the InnoDB metrics:
mysql> set global innodb_monitor_enable = all;
And will update with metrics when they become available.
Relevant configs
mysql> show variables like "%undo%";
+--------------------------+------------+
| Variable_name | Value |
+--------------------------+------------+
| innodb_max_undo_log_size | 1073741824 |
| innodb_undo_directory | ./ |
| innodb_undo_log_encrypt | OFF |
| innodb_undo_log_truncate | ON |
| innodb_undo_tablespaces | 2 |
+--------------------------+------------+
mysql> show variables like "%truncate%";
+--------------------------------------+-------+
| Variable_name | Value |
+--------------------------------------+-------+
| innodb_purge_rseg_truncate_frequency | 1 |
| innodb_undo_log_truncate | ON |
+--------------------------------------+-------+
What am I missing to get MySQL to truncate the undo log size?
I just assume that truncating means that the actual file on disk will shrink. Maybe it means something else?
After spending a few days fighting with an ever-growing undo tablespace, we finally figured it out so I'm gonna share the results:
We had 2/2 active (default) undo tablespaces. 1 was around 1GB and other one was 90GB and growing.
Per docs:
That part was the key. To my understanding, all transactions need to finish before cleanup can take place.
We looked for all running transactions:
In the list of transaction, there was one that was 2 days old.
If you have performance schema enabled, you can get the process and query that holds it:
We killed that process and everything recovered within 20 minutes. All storage was reclaimed back.