Using two Debian servers, I need to setup a strong failover environment for cron jobs that can be only called on one server at a time.
Moving a file in /etc/cron.d should do the trick, but is there a simple HA solution to operate such action ? And if possible not with heartbeat ;)
I think heartbeat / pacemaker would be the best solution, since they can take care a lot of a lot of race conditions, fencing, etc for you in order to ensure the job only runs on one host at a time. It's possible to design something yourself, but it likely won't account for all the scenarios those packages do, and you'll eventually end up replacing most of, if not all, of the wheel.
If you don't really care about such things and you want a simpler setup. I suggest staggering the cron jobs on the servers by a few minutes. Then when the job starts on the primary it can somehow leave a marker on whatever shared resource the jobs operate on (you don't specify this, so I'm being intentionally vague). If it's a database, they can update a field in a table or if it's on a shared filesystem lock a file.
When the job runs on the second server, it can check for the presence of the marker and abort if it is there.
We use two approaches depending on the requirements. Both involve having the crons present and running from all machines, but with a bit of sanity checking involved:
If the machines are in a primary and secondary (there may be more than one secondary) relationship then the scripts are modified to check whether the machine they are running on is a primary state. If not, then they simply exit quietly. I don't have an HB setup to hand at the moment but I believe you can query HB for this information.
If all machines are eligible primaries (such as in a cluster) then some locking is used. By way of either a shared database or PID file. Only one machine ever obtains the lock status and those which don't exit quietly.
To make long story short you have to turn your cron scripts into some kind of cluster-able applications. Being the implementation as lightweight or as heavyweight as you need, they still need one thing - be able to properly resume/restart action (or recover their state) after primary node failover. The trivial case is that they are stateless programs (or "stateless enough" programs), that can be simply restarted any time and will do just fine. This is probably not your case. Note that for stateless programs you don't need failover because you could simply run them in parallel on all the nodes.
In normally complicated case, your scripts should be on cluster's shared storage, should store their state in files there, should change the state stored on disk only atomically, and should be able to continue their action from any transient state they will detect on startup.
Actually there is no solution that is satisfactory in this area. We have tried them all. scripting solutions, cron with heartbeat/pacemaker and more. The only solution, until recently, was a grid solution. naturally this is not what we want seeing as how a grid solution is a bit more than overkill for the scenario.
That's why I started the CronBalancer project. works exactly like a normal cron server except it's distributed, load-balanced and HA (when finished). Currently the first 2 points are finished (beta) and works with a standard crontab file.
the HA framework is in place. all that's left is the signaling needed to determine the fail-over and recovering actions.
http://sourceforge.net/projects/cronbalancer/
chuck
I had been using Nagios event handler as a simple solution.
On the NRPE server:
Don't forget to add the
nagios
user to the sudoers group:and disable
requiretty
:On the Nagios server:
services.cfg
commands.cfg
autostart_crond.sh
but I have switched to use Pacemaker and Corosync since it's the best solution to ensure that the resource only run on one node at a time.
Here're the steps what I've done:
Verify that the crond init script is LSB compliant. On my CentOS, I have to change the exit status from 1 to 0 (if start a running or stop a stopped) to match the requirements:
then it can be added to the Pacemaker by using:
crm configure show
crm status
Testing failover by stopping Pacemaker and Corosync on 3.145:
then check the cluster status on the 2.93:
I prefer Rcron for this particular problem. You have a state file, which simply says "active" or "passive", and if it's active your cron will run on a certain machine. If state file is set to passive it won't run. Simple as that.
Now, you can use RedHat Cluster Suite or any other clustering middleware to manage state files across your cluster, or you can manually set active on a certain node and that's it.
Making it execute/not execute on a particular machine is trivial. Either have a script put a cron job in /etc/cron.d, as you suggest, or have the script permanently in /etc/cron.d, but have the script itself do the failover checking and decide whether to execute.
The common (missing) part in both of these is how the script checks to see if the script on the other machine is running.
Without more information about what you're trying to do, this is hard to answer.