SnapOverflow

SnapOverflow Logo SnapOverflow Logo

SnapOverflow Navigation

  • Home
  • Server
  • Ubuntu

Mobile menu

Close
  • Home
  • System Administrators
    • Hot Questions
    • New Questions
    • Tags
  • Ubuntu
    • Hot Questions
    • New Questions
    • Tags
  • Help
Home / server / Questions

Questions[server-crashes](server)

Martin Hope
smartenbergen
Asked: 2016-12-24 07:10:25 +0800 CST

Server freezes without kernel panic

  • 6

We are running a KVM node which is crashing irregularly showing a very strange behaviour. The interesting thing is that we already had this problem with another node which crashed every 1-2 weeks. As we could not find a hardware issue, we began to migrate the VMs to a new node. About one week after we had migrated 50% of the VMs, the new node crashed while the "old" one is running fine since then (uptime 3 weeks, we have not seen such a great uptime for months).

When a node crashes, we sometimes see these strange things on the Supermicro IPMI:

enter image description here enter image description here

We also saw:

  • "No signal" like the server has been powered off (of course it was not, and it was also never shown as powered off on the IPMI main page)
  • The normal login screen or other normal output from the server, but freezed

What we never saw was a kernel panic or at least some messages in the logs before the crash, there is complete silence until suddenly the lights go out.

As the problem "moved" from one server to another (a brand-new machine), there are only a few options left in my opinion:

  • A specific VM is causing the issue
  • Kernel bug
  • Hardware issue regarding our setup

More information about the machines:

  • CentOS 7 with latest kernel (3.10.0-514.2.2.el7.x86_64)
  • Supermicro Case with redundant power supplies
  • Supermicro X10DRi / X10DRWi with latest BIOS version
  • Intel Xeon E5-2630 v3 / v4
  • 512 GB DDR4 ECC RAM (Samsung Server RAM)
  • 145 VMs running (RAM and CPU far away from being saturated, also thanks to KSM)
  • Software RAID-10 with 8 / 16 SSDs

Has anyone seen this behaviour or can say something about the strange "messages" on the console? I have never seen something like this and even do not know how I should describe this for a Google search. At the moment we have no very good idea what should be done next as it could be everything.

Thanks in advance!

hardware kvm-virtualization kernel server-crashes supermicro
  • 2 Answers
  • 2232 Views
Martin Hope
kwb
Asked: 2016-07-13 12:35:18 +0800 CST

How can you distinguish between a crash and a reboot on RHEL7?

  • 11

Is there a way to determine whether a RHEL7 server was rebooted via systemctl (or reboot / shutdown aliases), or whether the server crashed? Pre-systemd this was fairly easy to determine with last -x runlevel, but with RHEL7 it's not so clear.

server-crashes systemd rhel7 system-monitoring
  • 4 Answers
  • 8216 Views
Martin Hope
Bron Gondwana
Asked: 2012-07-01 08:15:09 +0800 CST

Anyone else experiencing high rates of Linux server crashes during a leap second day?

  • 363
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.

*NOTE: if your server still has issues due to confused kernels, and you can't reboot - the simplest solution proposed with gnu date installed on your system is: date -s now. This will reset the kernel's internal "time_was_set" variable and fix the CPU hogging futex loops in java and other userspace tools. I have straced this command on my own system an confirmed it's doing what it says on the tin *

POSTMORTEM

Anticlimax: only thing that died was my VPN (openvpn) link to the cluster, so there was an exciting few seconds while it re-established. Everything else was fine, and starting up ntp went cleanly after the leap second had passed.

I have written up my full experience of the day at http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/

If you look at Marco's blog at http://my.opera.com/marcomarongiu/blog/2012/06/01/an-humble-attempt-to-work-around-the-leap-second - he has a solution for phasing the time change over 24 hours using ntpd -x to avoid the 1 second skip. This is an alternative smearing method to running your own ntp infrastructure.


Just today, Sat June 30th, 2012 - starting soon after the start of the day GMT. We've had a handful of servers in different datacentres as managed by different teams all go dark - not responding to pings, screen blank.

They're all running Debian Squeeze - with everything from stock kernel to custom 3.2.21 builds. Most are Dell M610 blades, but I've also just lost a Dell R510 and other departments have lost machines from other vendors too. There was also an older IBM x3550 which crashed and which I thought might be unrelated, but now I'm wondering.

The one crash which I did get a screen dump from said:

[3161000.864001] BUG: spinlock lockup on CPU#1, ntpd/3358
[3161000.864001]  lock: ffff88083fc0d740, .magic: dead4ead, .owner: imapd/24737, .owner_cpu: 0

Unfortunately the blades all supposedly had kdump configured, but they died so hard that kdump didn't trigger - and they had console blanking turned on. I've disabled console blanking now, so fingers crossed I'll have more information after the next crash.

Just want to know if it's a common thread or "just us". It's really odd that they're different units in different datacentres bought at different times and run by different admins (I run the FastMail.FM ones)... and now even different vendor hardware. Most of the machines which crashed had been up for weeks/months and were running 3.1 or 3.2 series kernels.

The most recent crash was a machine which had only been up about 6 hours running 3.2.21.

THE WORKAROUND

Ok people, here's how I worked around it.

  1. disabled ntp: /etc/init.d/ntp stop
  2. created http://linux.brong.fastmail.fm/2012-06-30/fixtime.pl (code stolen from Marco, see blog posts in comments)
  3. ran fixtime.pl without an argument to see that there was a leap second set
  4. ran fixtime.pl with an argument to remove the leap second

NOTE: depends on adjtimex. I've put a copy of the squeeze adjtimex binary at http://linux.brong.fastmail.fm/2012-06-30/adjtimex — it will run without dependencies on a squeeze 64 bit system. If you put it in the same directory as fixtime.pl, it will be used if the system one isn't present. Obviously if you don't have squeeze 64-bit… find your own.

I'm going to start ntp again tomorrow.

As an anonymous user suggested - an alternative to running adjtimex is to just set the time yourself, which will presumably also clear the leapsecond counter.

linux debian ntp server-crashes leapsecond
  • 5 Answers
  • 152304 Views
Martin Hope
Ehsan
Asked: 2011-09-23 17:48:58 +0800 CST

memory&swap are full, can't ssh; any option other than physical restart?

  • 7

by mistake, i executed some applications that used all memory (and i think swap) on my ubuntu server and its now crashed, SSH doesn't work and freezes. Do you know any other options other than following solutions:

  1. physically restart the server.
  2. wait until a process ends.

is there any way to remotely restart the server when ssh not working? i can still ping the server, so wondering if any reserved memory is there for killing unfriendly processes or for basic commands such as restarting the Os.

*The commands executed with "nohup" so they didn't end by closing ssh sessions.

ubuntu ssh outofmemoryerror server-crashes
  • 2 Answers
  • 5559 Views
Martin Hope
AppleGrew
Asked: 2011-07-30 23:46:51 +0800 CST

What is wrong in my php-fpm configuration?

  • 8

I have a 64-bit server but only 256MB of RAM. So, I moved to nginx server with fast-cgi to connect to PHP. I have PHP 5.3.6 running.

The issue is that after every two or three days when I try to access any PHP page then I get server internal error. The only way around is to restart php-fpm manually. This means I should have set some wrong parameters which is causing it to choke. Below I have listed the relevant configs.

/etc/php-fpm.conf :-

include=/etc/php-fpm.d/*.conf
log_level = error
;emergency_restart_threshold = 0
;emergency_restart_interval = 0
;process_control_timeout = 0

/etc/php-fpm.d/www.conf :-

[www]
pm = dynamic
pm.max_children = 10
pm.start_servers = 3
pm.min_spare_servers = 2
pm.max_spare_servers = 5
pm.max_requests = 500

/etc/nginx/php.conf :-

location ~ \.php {
        fastcgi_param  QUERY_STRING       $query_string;
        fastcgi_param  REQUEST_METHOD     $request_method;
        fastcgi_param  CONTENT_TYPE       $content_type;
        fastcgi_param  CONTENT_LENGTH     $content_length;

        fastcgi_param  SCRIPT_NAME        $fastcgi_script_name;
        fastcgi_param  SCRIPT_FILENAME    $document_root$fastcgi_script_name;
        fastcgi_param  REQUEST_URI        $request_uri;
        fastcgi_param  DOCUMENT_URI       $document_uri;
        fastcgi_param  DOCUMENT_ROOT      $document_root;
        fastcgi_param  SERVER_PROTOCOL    $server_protocol;

        fastcgi_param  GATEWAY_INTERFACE  CGI/1.1;
        fastcgi_param  SERVER_SOFTWARE    nginx;

        fastcgi_param  REMOTE_ADDR        $remote_addr;
        fastcgi_param  REMOTE_PORT        $remote_port;
        fastcgi_param  SERVER_ADDR        $server_addr;
        fastcgi_param  SERVER_PORT        $server_port;
        fastcgi_param  SERVER_NAME        $server_name;

        fastcgi_pass unix:---some-location---;
}

Update 1

And I have four nginx processes running. On an average each php-fpm process takes 35MB of RAM (Virtual memory size 320MB each). I also have a MySql process running.

Update 2

I forgot to paste the logs.

php-fpm error log :-

WARNING: [pool www] seems busy (you may need to increase start_servers, or min/max_spare_servers), spawning 8 children, there are 1 idle, and 7 total children
WARNING: [pool www] server reached max_children setting (10), consider raising it
NOTICE: Terminating ...

php-fpm www.error log :-

PHP Fatal error:  Allowed memory size of 33554432 bytes exhausted (tried to allocate 122880 bytes) in /home/webadmin/blog.applegrew.com/html/wordpress/wp-content/plugins/jetpack/class.jetpack-signature.php on line 137
PHP Fatal error:  Allowed memory size of 33554432 bytes exhausted (tried to allocate 122880 bytes) in /home/webadmin/blog.applegrew.com/html/wordpress/wp-content/plugins/jetpack/class.jetpack-signature.php on line 137
PHP Fatal error:  Allowed memory size of 33554432 bytes exhausted (tried to allocate 122880 bytes) in /home/webadmin/blog.applegrew.com/html/wordpress/wp-content/plugins/jetpack/class.jetpack-signature.php on line 137
fastcgi nginx php-fpm server-crashes
  • 3 Answers
  • 36679 Views

Sidebar

Stats

  • Questions 681965
  • Answers 980273
  • Best Answers 280204
  • Users 287326
  • Popular
  • Answers
  • Marko Smith

    Can you pass user/pass for HTTP Basic Authentication in URL parameters?

    • 5 Answers
  • Marko Smith

    Ping a Specific Port

    • 18 Answers
  • Marko Smith

    Check if port is open or closed on a Linux server?

    • 7 Answers
  • Marko Smith

    How to automate SSH login with password?

    • 10 Answers
  • Marko Smith

    How do I tell Git for Windows where to find my private RSA key?

    • 30 Answers
  • Marko Smith

    What's the default superuser username/password for postgres after a new install?

    • 5 Answers
  • Marko Smith

    What port does SFTP use?

    • 6 Answers
  • Marko Smith

    Command line to list users in a Windows Active Directory group?

    • 9 Answers
  • Marko Smith

    What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

    • 3 Answers
  • Marko Smith

    How to determine if a bash variable is empty?

    • 15 Answers
  • Martin Hope
    Davie Ping a Specific Port 2009-10-09 01:57:50 +0800 CST
  • Martin Hope
    Smudge Our security auditor is an idiot. How do I give him the information he wants? 2011-07-23 14:44:34 +0800 CST
  • Martin Hope
    kernel Can scp copy directories recursively? 2011-04-29 20:24:45 +0800 CST
  • Martin Hope
    Robert ssh returns "Bad owner or permissions on ~/.ssh/config" 2011-03-30 10:15:48 +0800 CST
  • Martin Hope
    Eonil How to automate SSH login with password? 2011-03-02 03:07:12 +0800 CST
  • Martin Hope
    gunwin How do I deal with a compromised server? 2011-01-03 13:31:27 +0800 CST
  • Martin Hope
    Tom Feiner How can I sort du -h output by size 2009-02-26 05:42:42 +0800 CST
  • Martin Hope
    Noah Goodrich What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats? 2009-05-19 18:24:42 +0800 CST
  • Martin Hope
    Brent How to determine if a bash variable is empty? 2009-05-13 09:54:48 +0800 CST
  • Martin Hope
    cletus How do you find what process is holding a file open in Windows? 2009-05-01 16:47:16 +0800 CST

Related Questions

Trending Tags

linux nginx windows networking ubuntu domain-name-system amazon-web-services active-directory apache-2.4 ssh

Explore

  • Home
  • Questions
    • Hot Questions
    • New Questions
  • Tags
  • Help

Footer

SnapOverflow

About Us

  • About Us
  • Contact Us

Legal Stuff

  • Privacy Policy

Help

© 2022 SOF-TR. All Rights Reserve