Set-Up
I've been a programmer for quite some time now but I'm still a bit fuzzy on deep, internal stuff.
Now. I am well aware that it's not a good idea to either:
- kill -9 a process (bad)
- spontaneously pull the power plug on a running computer or server (worse)
However, sometimes you just plain have to. Sometimes a process just won't respond no matter what you do, and sometimes a computer just won't respond, no matter what you do.
Let's assume a system running Apache 2, MySQL 5, PHP 5, and Python 2.6.5 through mod_wsgi.
Note: I'm most interested about Mac OS X here, but an answer that pertains to any UNIX system would help me out.
My Concern
Each time I have to do either one of these, especially the second, I'm very worried for a period of time that something has been broken. Some file somewhere could be corrupt -- who knows which file? There are over 1,000,000 files on the computer.
I'm often using OS X, so I'll run a "Verify Disk" operation through the Disk Utility. It will report no problems, but I'm still concerned about this.
What if some configuration file somewhere got screwed up. Or even worse, what if a binary file somewhere is corrupt. Or a script file somewhere is corrupt now. What if some hardware is damaged?
What if I don't find out about it until next month, in a critical scenario, when the corruption or damage causes a catastrophe?
Or, what if valuable data is already lost?
My Hope
My hope is that these concerns and worries are unfounded. After all, after doing this many times before, nothing truly bad has happened yet. The worst is I've had to repair some MySQL tables, but I don't seem to have lost any data.
But, if my worries are not unfounded, and real damage could happen in either situation 1 or 2, then my hope is that there is a way to detect it and prevent against it.
My Question(s)
Could this be because modern operating systems are designed to ensure that nothing is lost in these scenarios? Could this be because modern software is designed to ensure that nothing lost? What about modern hardware design? What measures are in place when you pull the power plug?
My question is, for both of these scenarios, what exactly can go wrong, and what steps should be taken to fix it?
I'm under the impression that one thing that can go wrong is some programs might not have flushed their data to the disk, so any highly recent data that was supposed to be written to the disk (say, a few seconds before the power pull) might be lost. But what about beyond that? And can this very issue of 5-second data loss screw up a system?
What about corruption of random files hiding somewhere in the huge forest of files on my hard drives?
What about hardware damage?
What Would Help Me Most
Detailed descriptions about what goes on internally when you either kill -9 a process or pull the power on the whole system. (it seems instant, but can someone slow it down for me?)
Explanations of all things that could go wrong in these scenarios, along with (rough of course) probabilities (i.e., this is very unlikely, but this is likely)...
Descriptions of measures in place in modern hardware, operating systems, and software, to prevent damage or corruption when these scenarios occur. (to comfort me)
Instructions for what to do after a kill -9 or a power pull, beyond "verifying the disk", in order to truly make sure nothing is corrupt or damaged somewhere on the drive.
Measures that can be taken to fortify a computer setup so that if something has to be killed or the power has to be pulled, any potential damage is mitigated.
Some information about binary files -- isn't it true that the apache binary file or some library could have a random byte or two corrupted in the middle, that wouldn't come out and cause a problem until later? How can I assure myself that this didn't happen as a result of the power pull or the kill?
Thanks so much!
Pulling the power causes everything to stop in flight, with no warning. kill -9 has the same effect on a single process, forcefully terminating it with a SIGKILL.
If a process is killed by kernel or power outage, it doesn't do any clean-up. That means you could have half-written files, inconsistent states, or lost caches. You usually don't have to worry about any of this because of journaling, exit status and battery backup.
Temporary files in /tmp will be automatically gone if they are in tmpfs, but you may still have application-specific lock files laying around to remove, like the lock and .parentlock for firefox.
Most software is smart enough to retry a transaction if it doesn't record a successful exit status. A good example of this is a typical mail system. If a message is being delivered, but gets cut off in the middle, the sender will retry later until it gets a success.
Your filesystem is probably journaled. If you are moving or writing a file and it dies mid-stream, the journaled file system will still reference the original. The journaled filesystem will make changes non-destructively, leaving the old copy, then only reference the new copy as a last step before reclaiming space the old copies occupied on disk.
Now if you have a RAID array, it has all kinds of memory buffers to increase performance and provide reliability in a power failure. Most likely your filesystem will not know about the caches in the device and their state, so it thinks a change has been committed to disk, but it is still in the RAID cache somewhere. So what happens when the power dies? Hopefully you have a functional battery in your RAID enclosure and you monitor it. Otherwise you have a corrupt file system to fsck.
Yes, a few bits can become corrupted in a binary, but I would not worry about that much on modern hardware. If you are really paranoid, you can monitor the health of your disks and RAID with the appropriate tools, but you should be doing that anyway. Do regular backups and get an Uninterruptible Power Supply.
In an unexpected shutdown, the only files which should be corrupted are files which are open for writing. On most systems at any given instant in time, you're probably not writing to a file. Probably.
1 kill -9
is POSIX SIGKILL and is implementation dependent. The process that receives this signal will not be given an opportunity to handle it.
1 Power off
depends on the hardware. The heads auto-park under the drive momentum and Everything in your write cache loses DRAM refresh and decays to irretreivable corruption within seconds. The same happens for your system memory, CPU cache, registers, etc.
From wdc.com (google: site:wdc.com Protective Head Parking )
Power is lost: Hard drive is reset. Head is parked in the landing zone using spindle energy. Spindle motor stopped.
2 - what can go wrong
files left open are incompletely written out. If a file is opened for writing, there will be data corruption. File writes in modern hardware are fast and modern PCs are not normally stressed with IO. It's like walking blindfolded over a quiet country road. Most of the time, you'll be fine.
3 - countermeasures
see above for what disks do.
Look up journaled file systems, they're normal now: http://en.wikipedia.org/wiki/Journaling_file_system
Software like MS Word or vi will write to a temporary file rather than the original. The objective is to never leave the system in a state where there is no consistent copy on disk.
Windows keeps copies of the registry (it's just too important) Wikipedia: "Windows 2000 keeps an alternate copy of the registry hives (.ALT) and attempts to switch to it when corruption is detected" (I haven't done heavy tech support since Win2k, so I'm not sure what MS's new mechanisms are)
4 - what to do
In order of difficulty (easy-hard)
Keep backups is the most appropriate answer, good backups should let you go back to the previously modified version.
5
Redundant power? End user education? put tape and cardboard over the power button?
6
Short of hardware malfunctions, corrupted disk drivers, a broken OS kernel, an absence of checksums or crashes during upgrades, binaries and libraries are not opened read-write so they don't get corrupted. It happens, but it's rare.
As for a kill -9, this sends a signal to the process to "die" right on the spot. The process dies (unless it is in uninterruptable sleep, in which case it becomes a zombie). No files are closed, no data is written out, and the program cannot catch this signal and do something else. No cleanup, no nothing: it just dies.
File systems today are very robust; things like XFS, JFS, ext3, and ext4 all have journals and other things to keep the filesystem metadata intact.
Binaries like Apache itself and others are not likely to get corrupted by a sudden loss of power or by a system kill, as they are either in memory or being read; if they are being read from (i.e., Apache HTTP is starting for instance) it is possible that a power surge could corrupt the binary, but it seems unlikely.
I've a Mac Mini people seem to like to shut off cold (no matter how many times I tell them.....) and it just keeps on going.
For the most part,, as long as you don't rely on kill -9 or power off regularly, I wouldn't worry too much. Things were much worse in the past; I'd worry more about (for instance) Solaris 2.6 than I would about Solaris 10 (and so on).
A "kill -9" won't sync a pending IO operation. This often isn't an issue, but if the system is under heavy IO load, you may lose data.
Its more of a problem with servers, where the RAID controller (without battery-backed cache) may cache writes and lose your data.
Edit: One more thing... if you are depending on network mounted drives and have open file handles, you are very likely to leave the file inconsistent or corrupted. On Windows, the classic example of this where you see this is when users mount Outlook PST files on a share and lose power or network connectivity.