Context
We saw sporadic ActiveRecord::StatementInvalid
errors from MySQL servers over the course of 4 hours yesterday. The errors haven't happened since then.
Strangely, SQL statements seemed to have been corrupted: a random character or more shifted by 2 bytes backwards or forwards. For example, "Unknown table 'erades'"
for a query against a grades
table. This wasn't limited to table names: column names, SQL keywords, and values were all affected in those queries.
The error
A sample of error messages:
.
->,
: Table '[database name].id' doesn't exist: SELECT [..] FROM [..] INNER JOIN [..] ON [..] = `[table name]`.`id` ..
`
->b
: Incorrect table name 'topic_setsb ON ': SELECT [..] FROM [..] INNER JOIN `topic_sets` ON ...
s
->q
: Unknown column 'aqsigned_[redacted]' in 'field list'
U
->W
: You have an error in your SQL syntax; [..] near 'NWLL GROUP BY ...
8
->:
: You have an error in your SQL syntax; [..] near ':887)' at line 1: SELECT [..] WHERE [..] IN (48846, 48901, 48887)
Since MySQL will likely report only the first error in the statement, I couldn't tell the exact number of corruptions per statement. The full SQL query included in the message seems to be a copy of the statement held on the client side, and there was no corruption in that portion of the message.
In some cases, I couldn't tell which character could have been corrupted from the error message, such as "the right syntax to use near '' at line 1" and not all corruptions may have been about byte-shifts. This might be from unprintable characters or there might have been insertion/deletions of characters.
There was a somewhat fixed pattern in the way statements got corrupted, but there seemed to be no clear-cut rule. e.g. There would be 10-20 occurrences of a few patterns where the same statement was corrupted in the same way, but the position of corruptions vary among those patterns.
The error log I can obtain from the RDS console is empty. There was no AWS service degradation reported for the time period.
There were a total of 143 errors reported to our exception tracking tool. They originated from 4 Passenger workers, 2 on the same EC2 instance and the other two on another EC2 instance. Distribution of error counts across these workers: 1, 41, 42, 50. This is about 0.001% of the total requests served during the 4 hours this happened.
For each worker, errors happened in trickles: each reported these errors for around 5 minutes, with 1-2 errors happening every few seconds to 2-3 minutes - aside from the one that had only 1 error.
Some queries were against the master database and some were against the replica database.
Environment and other facts:
- AWS RDS MySQL 5.6.39
- AWS EC2 instances c4.4xlarge. There were 15+ instances serving the same website at the time.
- Apache 2, mpm module: event
- Passenger 5.3.3, concurrency model: process. Typically there are 30+ worker processes in a single EC2 instance.
- Rails 4.2.10, database pool size: 2.
- MySQL client library: mysql2 0.4.10
- Ruby 2.3.7
Timeline
instance A launch 2018-11-13 12:00 UTC
instance A process P errors 2018-11-13 13:40-13:43 UTC
instance A process Q errors 2018-11-13 14:22-14:26 UTC
instance A shutdown 2018-11-13 20:00 UTC
instance B launch 2018-11-13 15:00 UTC
instance B process R errors 2018-11-13 15:39-15:40 UTC
instance B process S errors 2018-11-13 17:39-17:43 UTC
instance B shutdown 2018-11-13 23:00 UTC
Possible cause
One acquaintance admitted they saw the same kind of error in AWS on the same day, and referred to this talk [1] about network-level data corruption: it talks about how network switches recompute the Ethernet CRC of a packet when rerouting and corruption during the rerouting process may result in a "valid" CRC, and how TCP checksum also has a loophole. (Textual notes by kevinchen.co).
[1] !!Con 2017: Corruption in the Data Center! TCP can fail to keep your data safe! by Evan Jones
Their recommendation was to use TLS for all communication, as the TLS layer will fail to decypt the corrupted bytes.
I also found this page that describes the limitation of CRC and checksum. There's a similar-looking error in this serverfault question and the cause there was also network-related.
I'm increasingly convinced that the errors are caused by something faulty at the network level.
Question
Does the Ethernet / TCP theory agree with the erroneous behavior I described? I'm not sure if it's possible for just one process at a time to see this error, although I suppose it can happen if a switch decides to handle packets differently based on the source - destination port pairs, as each connection will be using a different port.
I'd be happy to reframe the question if this is actually an XY question.
Note: I've posted a request to AWS forum for confirmation / investigation that this was something from AWS's infrastructure. Here, in serverfault, I'm interested in learning about the plausibility of the hypothesis and how it can exhibit the behavior I saw (or not).