ento's questions -server

ento

Asked: 2018-11-15 20:53:12 +0800 CST

Can network packet corruption affect only one server process out of many that are talking to the database?

7

Context

We saw sporadic ActiveRecord::StatementInvalid errors from MySQL servers over the course of 4 hours yesterday. The errors haven't happened since then.

Strangely, SQL statements seemed to have been corrupted: a random character or more shifted by 2 bytes backwards or forwards. For example, "Unknown table 'erades'" for a query against a grades table. This wasn't limited to table names: column names, SQL keywords, and values were all affected in those queries.

The error

A sample of error messages:

. -> , : Table '[database name].id' doesn't exist: SELECT [..] FROM [..] INNER JOIN [..] ON [..] = `[table name]`.`id` ..
` -> b : Incorrect table name 'topic_setsb ON ': SELECT [..] FROM [..] INNER JOIN `topic_sets` ON ...
s -> q : Unknown column 'aqsigned_[redacted]' in 'field list'
U -> W : You have an error in your SQL syntax; [..] near 'NWLL GROUP BY ...
8 -> : : You have an error in your SQL syntax; [..] near ':887)' at line 1: SELECT [..] WHERE [..] IN (48846, 48901, 48887)

Since MySQL will likely report only the first error in the statement, I couldn't tell the exact number of corruptions per statement. The full SQL query included in the message seems to be a copy of the statement held on the client side, and there was no corruption in that portion of the message.

In some cases, I couldn't tell which character could have been corrupted from the error message, such as "the right syntax to use near '' at line 1" and not all corruptions may have been about byte-shifts. This might be from unprintable characters or there might have been insertion/deletions of characters.

There was a somewhat fixed pattern in the way statements got corrupted, but there seemed to be no clear-cut rule. e.g. There would be 10-20 occurrences of a few patterns where the same statement was corrupted in the same way, but the position of corruptions vary among those patterns.

The error log I can obtain from the RDS console is empty. There was no AWS service degradation reported for the time period.

There were a total of 143 errors reported to our exception tracking tool. They originated from 4 Passenger workers, 2 on the same EC2 instance and the other two on another EC2 instance. Distribution of error counts across these workers: 1, 41, 42, 50. This is about 0.001% of the total requests served during the 4 hours this happened.

For each worker, errors happened in trickles: each reported these errors for around 5 minutes, with 1-2 errors happening every few seconds to 2-3 minutes - aside from the one that had only 1 error.

Some queries were against the master database and some were against the replica database.

Environment and other facts:

AWS RDS MySQL 5.6.39
AWS EC2 instances c4.4xlarge. There were 15+ instances serving the same website at the time.
Apache 2, mpm module: event
Passenger 5.3.3, concurrency model: process. Typically there are 30+ worker processes in a single EC2 instance.
Rails 4.2.10, database pool size: 2.
MySQL client library: mysql2 0.4.10
Ruby 2.3.7

Timeline

instance A launch           2018-11-13 12:00 UTC
instance A process P errors 2018-11-13 13:40-13:43 UTC
instance A process Q errors 2018-11-13 14:22-14:26 UTC
instance A shutdown         2018-11-13 20:00 UTC
instance B launch           2018-11-13 15:00 UTC
instance B process R errors 2018-11-13 15:39-15:40 UTC
instance B process S errors 2018-11-13 17:39-17:43 UTC
instance B shutdown         2018-11-13 23:00 UTC

Possible cause

One acquaintance admitted they saw the same kind of error in AWS on the same day, and referred to this talk ^[1] about network-level data corruption: it talks about how network switches recompute the Ethernet CRC of a packet when rerouting and corruption during the rerouting process may result in a "valid" CRC, and how TCP checksum also has a loophole. (Textual notes by kevinchen.co).

_{[1] !!Con 2017: Corruption in the Data Center! TCP can fail to keep your data safe! by Evan Jones}

Their recommendation was to use TLS for all communication, as the TLS layer will fail to decypt the corrupted bytes.

I also found this page that describes the limitation of CRC and checksum. There's a similar-looking error in this serverfault question and the cause there was also network-related.

I'm increasingly convinced that the errors are caused by something faulty at the network level.

Question

Does the Ethernet / TCP theory agree with the erroneous behavior I described? I'm not sure if it's possible for just one process at a time to see this error, although I suppose it can happen if a switch decides to handle packets differently based on the source - destination port pairs, as each connection will be using a different port.

I'd be happy to reframe the question if this is actually an XY question.

Note: I've posted a request to AWS forum for confirmation / investigation that this was something from AWS's infrastructure. Here, in serverfault, I'm interested in learning about the plausibility of the hypothesis and how it can exhibit the behavior I saw (or not).

ento

Asked: 2011-07-20 17:34:42 +0800 CST

How to structure the "authentication graph" of a group of frontend/backend servers?

2

I'm taking over a bunch of servers as a newbie sysadmin, which include a frontend web server, and several backend servers that hold databases that include private information of the users of the web service.

For a starter, I disabled SSH password authentication. I'm wondering what else I should take care of, in terms of login authentication, to secure the servers with reasonable strength (first priority), and also to ease the task of future administration issues (second priority).

Question 1: Is it recommended to setup a "stepping stone" server, like in diagram (2) below, which will be the only server with port 22 open to the wild? Will it make the backend servers more secure?

(1) Flat-shaped -- no stepping stone (current setup)


[dev machine] - pub key auth - [Frontend]*
 private key A                  public key A

[dev machine] - pub key auth - [Backend]*
 private key A                  public key A

(2) Star-shaped -- with stepping stone


[dev machine] - pub key auth - [S- stone] -    ?    - [Frontend/Backend]*
 private key A                  public key A

Question 2: In case of this setup, which authentication method is recommended for internal login?
                         (a)   use another key pair: priv key B - pub key B
                         (b)   use ssh-agent
                         (c)   reuse key pair A: put priv key A to the s- stone
                         (c)   use password auth

Note: "authentication graph" in the question title is a made-up word. I'd be glad to know if there's a term for this kind of problem -- which server to allow login from for which server.

ento

Asked: 2010-02-12 21:53:35 +0800 CST

cherrypy fails to stop when puppet tries to ensure running and refresh it at the same time

0

I am trying to manage a cherrypy service with puppet. However, when the configuration is applied, cherryd ends up with no PID file although the process is up and running. Since the PID file is lost I can no longer stop the process with /etc/init.d/mycherryd stop (unless I modify the handmade init script to lookup the PID with ps or something.)

$ /etc/init.d/mycherryd status
cherryd dead but subsys locked

The problem seems to be that puppet is trying to refresh/restart cherryd (triggered by changes in configuration files) immediately after ensuring it's running (as specified in the manifest), and cherrypy fails to stop and start (restart) itself while still executing its startup process.

Is there a clear cut solution to this problem? Is this a bug on the cherrypy side, or can I write a puppet manifest so refresh is called only after the service is up and running? Any suggestion welcome.

cherrypy log

See how cherrypy catches SIGTERM midway through startup and still starts to listen.

:cherrypy.error[18666] 2010-02-12 13:10:23,551 INFO: ENGINE Listening for SIGHUP.
:cherrypy.error[18666] 2010-02-12 13:10:23,552 INFO: ENGINE Listening for SIGTERM.
:cherrypy.error[18666] 2010-02-12 13:10:23,552 INFO: ENGINE Listening for SIGUSR1.
:cherrypy.error[18666] 2010-02-12 13:10:23,552 INFO: ENGINE Bus STARTING
:cherrypy.error[18671] 2010-02-12 13:10:23,554 INFO: ENGINE Daemonized to PID: 18671
:cherrypy.error[18671] 2010-02-12 13:10:23,554 INFO: ENGINE PID 18671 written to '/var/mycherryd/cherry.pid'.
:cherrypy.error[18671] 2010-02-12 13:10:23,555 INFO: ENGINE Started monitor thread '_TimeoutMonitor'.
:cherrypy.error[18670] 2010-02-12 13:10:23,556 INFO: ENGINE Forking twice.
:cherrypy.error[18666] 2010-02-12 13:10:23,557 INFO: ENGINE Forking once.

:cherrypy.error[18671] 2010-02-12 13:10:23,716 INFO: ENGINE Caught signal SIGTERM.
:cherrypy.error[18671] 2010-02-12 13:10:23,716 INFO: ENGINE Bus STOPPING
:cherrypy.error[18671] 2010-02-12 13:10:23,716 INFO: ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('0.0.0.0', 12380)) already shut down
:cherrypy.error[18671] 2010-02-12 13:10:23,717 INFO: ENGINE Stopped thread '_TimeoutMonitor'.
:cherrypy.error[18671] 2010-02-12 13:10:23,717 INFO: ENGINE Bus STOPPED
:cherrypy.error[18671] 2010-02-12 13:10:23,732 INFO: ENGINE Bus EXITING
:cherrypy.error[18671] 2010-02-12 13:10:23,759 INFO: ENGINE PID file removed: '/var/mycherryd/cherry.pid'.
:cherrypy.error[18671] 2010-02-12 13:10:23,782 INFO: ENGINE Bus EXITED

:cherrypy.error[18671] 2010-02-12 13:10:23,792 INFO: ENGINE Serving on 0.0.0.0:12380
:cherrypy.error[18671] 2010-02-12 13:10:23,826 INFO: ENGINE Bus STARTED

puppet log

puppet tries to refresh the service immediately after ensuring it to be 'running'.

Feb 12 13:10:22 localhost puppetd[8159]: (//mycherrypy/File[conffiles]) Scheduling refresh of Service[cherryd]
Feb 12 13:10:22 localhost last message repeated 12 times
Feb 12 13:10:23 localhost puppetd[8159]: (//mycherrypy/Service[mycherryd]/ensure) ensure changed 'stopped' to 'running'
Feb 12 13:10:23 localhost puppetd[8159]: (//mycherrypy/Service[mycherryd]) Triggering 'refresh' from 13 dependencies
Feb 12 13:11:23 localhost puppetd[8159]: (//mycherrypy/Service[mycherryd]) Failed to call refresh on Service[mycherryd]: Could not stop Service[mycherryd]: Execution of '/sbin/service mycherryd stop' returned 1:  at /etc/puppet/manifests/mycherrypy.pp:161
Feb 12 13:11:24 localhost puppetd[8159]: Value of 'preferred_serialization_format' (pson) is invalid for report, using default (marshal)
Feb 12 13:11:24 localhost puppetd[8159]: Finished catalog run in 99.25 seconds

puppet manifest excerpt

class mycherrypy {
  file {
    'rpm':
      path   => "/tmp/${apiserver}.i386.rpm",
      source => "${fileserver}/${apiserver}.i386.rpm";
    'conffiles':
      require => Package["${apiserver}"],
      path    => "${service_home}/config/",
      ensure  => present,
      source  => "${fileserver}/config/",
      notify  => Service["mycherryd"];
  }

  package {
    "$apiserver":
      provider => 'rpm',
      source   => "/tmp/${apiserver}.i386.rpm",
      ensure   => latest;
  }

  service {
    "mycherryd":
      require => [File["conffiles"], Package["${apiserver}"]],
      ensure    => running,
      provider  => redhat,
      hasstatus => true;
  }
}

Can network packet corruption affect only one server process out of many that are talking to the database?

Context

The error

Environment and other facts:

Possible cause

Question

How to structure the "authentication graph" of a group of frontend/backend servers?

cherrypy fails to stop when puppet tries to ensure running and refresh it at the same time

cherrypy log

puppet log

puppet manifest excerpt

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?