I am using Amazon Spot Instances to crawl a lot data. Most of them run until Amazon terminates them once the current price exceeds our max bid.
I need to monitor and mainly archive the logs generated in those spot instances. Those logs are very important for debugging and analytics. We have application logs, system logs such as syslog, secure log. Below are the options I could think of:
- use Chukwa/Flume. Not listing Facebook's scribe here because I think the project is dead. There is a rare possibility to lose few logs with this approach.
- Attach an EBS volume to those spot instance. But then managing those volumes when the spot instances are terminated will be a pain.
- Mount a NFS volume so that we write logs in that volume. The performance is really bad sometimes in this approach.
Also, I think the ability to run Linux commands such as grep, awk on those archived files are also important. What are people using in such situation?
P.S. We are already using splunk but I will not archive logs in splunk.
Two approaches that I've used:
Ship your logs to another location using Syslog. With AWS we use a VPC with a VPN connection to a private rack in a local datacenter. All of our instances are running Syslog-NG and send their logs to a server in our datacenter. The data is stored in MongoDB.
Use logrotate to archive your logs to S3. It's not as real-time as using Syslog, but it's simpler to set-up and maintain, especially if you're generating a lot of data. AWS's newly announced Data Pipeline could also be a good addition to this solution because you could use it to automatically process your logs using Elastic Map Reduce jobs.
As you have already written, that you can have an option to use Chkwa/Flume.
I believe that's the best and most efficient way to do the log processing and storing, but I might suggest to use logstash for the same.
Logstash is pretty efficient and have supports a lot of internal message formats. Logstash also provides with a front-end where you can use regular expression and check the results.
Though for front-end I would suggest using graylog2, which have a lots of functionality comparing to logstash front-end.
Though, if you already have splunk, I don't understand why you don't want to store the data there. Could it be due to the licensing fee? I am not sure about their fees structure, but I know that it's a lot :)