I've recently had an experience of writing a shell script which crashed a server (and damaged a partition) by consuming all resources. It was hooked up to a cron job, and it seems it took longer to run than the interval between executions, over time snowballing out of control.
Now, I've since modified it to record its running state, and to not run more than once simultaneously. My question is: are there other, simple ways to safeguard a script against causing harm? Is there a standard list of things a script should do to behave properly, not consume too many resources, to fail gracefully, alert the right people, etc?
Basically: what other pitfalls should I avoid?
Computers do exactly what they are told. The only way to ensure that a script "behaves properly" is to write it so that it will behave properly, under all scenarios.
Some basic advice:
The fact that your system blew up without you knowing it was coming tells me you either do not have a monitoring system, or your current system isn't good enough.
Invest some time in making sure that your servers tell you that there's a problem before they fall over.
Your script stepped on its own tail. That shouldn't happen.
You've learned the hard way that you need to guard against this sort of thing (and have the system notify you if it happens).
Carefully evaluate every script you are going to deploy to make sure it won't produce undesirable side effects. If you can imagine a failure scenario, test for it (and handle it properly!).
Take the time to simulate failures (either by hard-coding the condition to true in your script, or by generating the circumstances to test your detection logic.
The safeguards you are talking about depend on what your script is doing. For example, it will be better to backup some important file before modifying it in an automatic way. If the script fails in some way and corrupted this important file, you are safe because you have backup and so on.
One important thing to mention is logging, logging, and logging. If your script is running in the background without a log file showing its progress and what is doing, you will have no idea about any potential problem in the near or far future. Don't forget to include time stamp of each log entry and enable NTP service to know exactly at what time this happened.
In the end, we now run the script inside a VM. That vastly limits the scope of damage that can be caused.
The frightening thing about Linux (for me, at least) is that minor typos or bugs can have devastating effects. Even something like running a command with a ${VARIABLE} can have a totally different (and destructive) meaning if that variable is blank, or contains a space.