I'm wondering what software the web scale guys are using to monitor their n arrays of servers in the server farm(s).
What does facebook, twitter, digg use? How google does it?
I'm looking for a solution to our own monitoring requirements. Our servers sit in the cloud, AppEngine & EC2. We are looking to monitor the "application" (which is build from many small services) meaning that the end result should be a system that can monitor both response time (+alivenss and co.) and application validness: If I do X then Y should happen, then after 2 hours verify the Z was processed and T was appended to the correct log...
The ideal solution would be a system that I can deploy unit tests to, the same unit tests I'm using to test the software while developing.
Recommendations, pointers, comments are highly welcome - I'm looking for directions to attack this issue.
Thanks, Maxim.
I watched this a while ago. It's 'A day in the life of Facebook operations'. They use cfengine2 (deployment), nagios (monitoring), ganglia (monitoring and trending) plus a lot of in-house tools. Funny to see some of the tools we use are used in such a massive scale (+60.000 servers)