I am running into a couple of issues with some applications we've deployed and maintain. I have the feeling we have approached this with some anti-patterns up to now, but I would like to see how to make this more flexible and stable.
In one situation, we have a server at a client which pushes data to us to parse every night (yes, Windows Task Scheduler). This is highly unstable however, so once every month this doesn't happen because of reasons out of our control. This heavily impacts our business since we run with stale data in that situation.
In another scenario we have a lot of background job processes that should be running. We already keep them up using bluepill ( http://www.github.com/arya/bluepill ) but obviously restarts happen, both automatically and manually, and people forget things or systems mess up.
What I would like to track is events that should occur or should be available. Like the existence of a process, the execution of a program, or the creation/age of a file, and track it when they don't happen or exist.
We develop most things in Ruby on Rails, use NewRelic, Bluepill and Munin, and run on Ubuntu. I've been toying around with counting ps aux | grep processname | wc -l
in Munin scripts, or capturing the age of a file and raising alerts over 24-26 hours, stuff like that.
Is there better tooling to track things that should happen, and raise alerts if they don't?
P.S. I know some things are suboptimal, like manually having to define bluepill for applications and then forgetting to do so. The same goes for the push based approach of the first application, a dedicated daemon that manages that on the client side that we control and can track its connection to us might be a much better solution.