Will

Asked: 2016-01-04 15:31:45 +0800 CST2016-01-04 15:31:45 +0800 CST 2016-01-04 15:31:45 +0800 CST

After Upgrading to Ubuntu 15.10 from 15.04, EC2 Webservers are Crashing

I have a variety of Ubuntu machines on EC2 running in production, with about 30 that were upgraded from 15.04 to 15.10. With most of the machines, the upgrade went flawlessly and experienced no issues at all.

However, 10 of my webservers have started crashing immediately following the 15.10 upgrade. As far as what exactly defines a "crash", Instance Status Checks fail, and I can no longer SSH to the machine. Background daemons running on the system stop responding, and nothing is written to the logs. The most recent log entries I see on one machine show:

/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: DHCPREQUEST of 10.xxx.xxx.104 on eth0 to 10.xxx.xxx.1 port 67 (xid=0x616a091d)
/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: DHCPACK of 10.xxx.xxx.104 from 10.xxx.xxx.1
/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: bound to 10.xxx.xxx.104 -- renewal in 1640 seconds.

But my Instance Status Checks didn't begin failing until 00:32:00 (when the first of several checks failed to respond). There is absolutely nothing in the logs during the period following the entries above.

Now, like I said, my ~20 other 15.10 instances have never crashed in the over 6 weeks since their upgrade, only this set of webservers, and they're all crashing. So, what's different about these machines? Only two things, really.

They're my highest-traffic 15.10 instances, sending and receiving about 5-10Mb/sec on average, peaking to a bit over 30-40 on occasion.
They're my only instances of type c4.xlarge or m4.xlarge. Originally, they were all c4.xlarge, but I replaced them with m4.xlarge to try to isolate the problem. It seems to be less frequent with the m4.xlarge, but I've still seen 3 or 4 or so crashes a day between the 10 webservers. Generally, I'm seeing each instance crash at least once a day, at seemingly random times.

These instances are running Apache 2.4.x, mod_php 5.6.11, and memcached 1.4.24, but I have other machines receiving less traffic on a smaller instance type that are perfectly stable.

Not sure if related, but all of these machines are seeing periodic ifquery segfaults, for example:

/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [   22.592488] ifquery[476]: segfault at 1 ip 0000000000403187 sp 00007ffde8596050 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [   23.593774] ifquery[510]: segfault at 1 ip 0000000000403187 sp 00007ffde6087b90 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [   24.594994] ifquery[531]: segfault at 1 ip 0000000000403187 sp 00007ffe70747a50 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:04:12 xxx-web-3a kernel: [    2.623024] ifquery[367]: segfault at 1 ip 0000000000403187 sp 00007ffefc980f60 error 4 in ifup[400000+d000]

One system, prior to the c4.xlarge --> m4.xlarge migration, saw a General Protection Fault logged a single time in the system console log, but I have not seen this again.

I'm not seeing these segfaults on my other 15.10 machines which are not crashing.

These are all Enhanced Networking instances with Intel 82599 10G Ethernet, which I slightly suspect may contribute to the issue, but, I have other (much-lower-traffic) machines with the same adapter running 15.10 without ever crashing.

Is anyone seeing similar problems, or have any ideas for debugging or fixing this? Thanks!

Edit

Looking at the Console Log, one of my frequently-crashing systems reported a General Protection Fault right before rebooting:

[171009.844097] general protection fault: 0000 [#1] [ 0.000000] Initializing cgroup subsys cpuset

After Upgrading to Ubuntu 15.10 from 15.04, EC2 Webservers are Crashing

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

After Upgrading to Ubuntu 15.10 from 15.04, EC2 Webservers are Crashing

0 Answers