I have a variety of Ubuntu machines on EC2 running in production, with about 30 that were upgraded from 15.04 to 15.10. With most of the machines, the upgrade went flawlessly and experienced no issues at all.
However, 10 of my webservers have started crashing immediately following the 15.10 upgrade. As far as what exactly defines a "crash", Instance Status Checks fail, and I can no longer SSH to the machine. Background daemons running on the system stop responding, and nothing is written to the logs. The most recent log entries I see on one machine show:
/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: DHCPREQUEST of 10.xxx.xxx.104 on eth0 to 10.xxx.xxx.1 port 67 (xid=0x616a091d)
/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: DHCPACK of 10.xxx.xxx.104 from 10.xxx.xxx.1
/var/log/syslog:Dec 18 00:28:58 xxx-web-4a dhclient: bound to 10.xxx.xxx.104 -- renewal in 1640 seconds.
But my Instance Status Checks didn't begin failing until 00:32:00
(when the first of several checks failed to respond). There is absolutely nothing in the logs during the period following the entries above.
Now, like I said, my ~20 other 15.10 instances have never crashed in the over 6 weeks since their upgrade, only this set of webservers, and they're all crashing. So, what's different about these machines? Only two things, really.
- They're my highest-traffic 15.10 instances, sending and receiving about 5-10Mb/sec on average, peaking to a bit over 30-40 on occasion.
- They're my only instances of type
c4.xlarge
orm4.xlarge
. Originally, they were allc4.xlarge
, but I replaced them withm4.xlarge
to try to isolate the problem. It seems to be less frequent with them4.xlarge
, but I've still seen 3 or 4 or so crashes a day between the 10 webservers. Generally, I'm seeing each instance crash at least once a day, at seemingly random times.
These instances are running Apache 2.4.x, mod_php 5.6.11, and memcached 1.4.24, but I have other machines receiving less traffic on a smaller instance type that are perfectly stable.
Not sure if related, but all of these machines are seeing periodic ifquery
segfaults, for example:
/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [ 22.592488] ifquery[476]: segfault at 1 ip 0000000000403187 sp 00007ffde8596050 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [ 23.593774] ifquery[510]: segfault at 1 ip 0000000000403187 sp 00007ffde6087b90 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:02:05 xxx-web-3a kernel: [ 24.594994] ifquery[531]: segfault at 1 ip 0000000000403187 sp 00007ffe70747a50 error 4 in ifup[400000+d000]
/var/log/syslog:Dec 17 14:04:12 xxx-web-3a kernel: [ 2.623024] ifquery[367]: segfault at 1 ip 0000000000403187 sp 00007ffefc980f60 error 4 in ifup[400000+d000]
One system, prior to the c4.xlarge
--> m4.xlarge
migration, saw a General Protection Fault
logged a single time in the system console log, but I have not seen this again.
I'm not seeing these segfaults on my other 15.10 machines which are not crashing.
These are all Enhanced Networking instances with Intel 82599 10G Ethernet, which I slightly suspect may contribute to the issue, but, I have other (much-lower-traffic) machines with the same adapter running 15.10 without ever crashing.
Is anyone seeing similar problems, or have any ideas for debugging or fixing this? Thanks!
Edit
Looking at the Console Log, one of my frequently-crashing systems reported a General Protection Fault right before rebooting:
[171009.844097] general protection fault: 0000 [#1] [ 0.000000] Initializing cgroup subsys cpuset
0 Answers