We've had it a few times now. Suddenly our production server won't respond because a process is in an infinite loop, or the MySQL server stops serving new requests because one query is blocking everything...
We SSH to the server and use ps aux
or top
to find the culprit, or mytop
or SHOW FULL PROCESSLIST
in MySQL to find the offending process ID and kill
it. Then offcourse we try to recreate the situation on the testserver and fix the bug.
But sometimes the server is so well hung your ps aux
/ top
/ mytop
/ SHOW FULL PROCESSLIST
won't go through - even the admins are blocked.
What is the best way to ensure an admin can always access the server and kill offending processes or queries (both on Linux and MySQL)?
- Can we allocate priorities to different users?
- Reserve a part of the resources for root?
I've checked nice(1), but constantly having an open connection with nice -20 seems a bit excessive and difficult to work with (let alone dangerous as root).
http://en.wikipedia.org/wiki/Magic_SysRq_key
We use Dell servers that have a remote access network card (DRAC) installed that allows us to access the server out of band via ssh or a web browser. We can get to a console screen, or power cycle the server. Most major server vendors support some similar device.
This doesn't help you if you want to log into a server that has 0 resources available to allow a login. Short of reserving resources for a log in, this is the next best thing to physical access to the server.
It sounds like you have issues surrounding problem applications. Why do you have apps that are going into infinite loops and MySQL queries that are exhausting your server resources?
I guess my answer wasn't good enough and someone felt the need to delete it.
monit or god, preempt things taking over by setting limits. If you're running virtual iron rather than bare metal, assign all but one core to your process and save one core for console access. KVM over IP can sometimes allow you to enter a key combination on the console. If network activity is doing it, shut off eth0 until things calm down, connect on eth1.
The pam_limits.so module it's a nifty tool to limit memory, open files, ... and to set nice priority for users and groups.
Maybe SLURM could be the answer. It's a QoS Resource Manager for linux based cluster systems.