We're building a medical image processing software stack, currently hosted on various AWS resources. As part of this application, we have a handful of long-running servers (database, load balancers, web application, etc.). Collecting performance data on those servers is quite simple - my go-to- recipe of Nagios (for monitoring/notifications) and Munin (for collection of performance data and displaying trends) will work just fine.
However - as part of this application, we are constantly starting up and terminating compute instances on EC2. In typical usage, these compute instances start up, configure themselves, receive a job from a message queue, and then get to work processing that job, which takes anywhere from 15 minutes to over 8 hours. After job completion, these instances get terminated, never to be heard from again.
What is a decent strategy for collecting performance data on these short-lived instances?
I don't necessarily need monitoring on them - if they fail for whatever reason, our application will detect this and handle re-starting the job on another instance or raising the flag so an administrator can take a look at things. However, it still would be useful to collect information like CPU (user, idle, iowait, etc.), memory usage, network traffic, disk read/write data, etc. In our internal database, we track the instance ID of the machine that runs each job, and it would be quite helpful to be able to look up performance data for a specific instance ID for troubleshooting and profiling.
Munin doesn't seem like a great candidate, as it requires maintaining a list of munin nodes in a text file - far from ideal for an environment with a high amount of churn, and for the short amount of time each node will be running, I'd rather keep the full-resolution data indefinitely than have RRD water down the data over time.
In the end, my guess is that this will require a monitoring engine that:
- uses a database (MySQL, SQLite, etc.) for configuration and data storage
- exposes an API for adding/removing hosts and services
Are there other things I should be thinking about when evaluating options?
Perhaps I'm over-thinking this, though, and just ought to run sar
at 1-minute intervals on these short-lived instances and collect the sar db files prior to termination.
Zenoss has an EC2Manager plugin which automatically adds all of your EC2 instances (even in the open source version) and monitors EC2 for changes. Zenoss may be more heavyweight than you really want, though.
For ephemeral instances, Nagios is not a great fit, as you would constantly have to re-write the config files.
I recently researched which "well known/maintained" monitoring systems are a good fit in situations like these.
The following monitoring systems make it easier to add/remove hosts:
I think, you will need some sort of central CMDB to be a source of truth. Then you can use the CMDB as the source of data for Puppet/Chef/etc, which will can configure the monitored hosts, and add them to the monitoring server.
If you care about collecting performance data but not monitoring, and you want something that doesn't require configuration each time an instance is born or dies, collectd would be a great fit.
Setup one instance as a server, meaning the network plugin configured to a receive data and the RRDtool plugin configured to write data. Setup your ephemeral instances with whatever plugins you need to gather the relevant performance data and with the network plugin configured to send data to the server.
SInce these instances are short-lived, you'll want to change the default RRATimespan. If you don't want to store the data in RRD files, collectd can send to other data stores like graphite or mongodb.
Zabbix would be an excellent choice here.
It's easy to set up and you can configure auto-registration and discovery, collect performance data and have the records cleaned up after X days.
I think the right approach with ephemeral instances is to leverage Amazon CloudWatch or the CloudWatch API in some manner. But that largely depends on what you really need to see...
If you're using a quality load balancing solution in The Cloud, that can almost be more beneficial than per-instance monitoring, as the load balancer can make more informed routing decision based on better real-time conditions (e.g. # of connections, node response time/latency, geographical location).
However, we're looking to do the same, and potentially integrate with the commercial monitoring suite we use. Otherwise, Zenoss seems to have a canned solution.