We're looking at a long-term migration towards the cloud. The plan is to start small, and gradually move less essential parts of the infrastructure into the cloud. All good so far.
Part of this migration includes log files from web servers and whatnot. Keep in mind that the servers are still in datacenters outside the cloud. It should be easy to have a cron job grab log files at the end of each day, compress them, and shove them into Amazon S3, with a possible backup to Glacier. That's easy.
The problem occurs when S3 is the only place where you store the logs, and you want to search the logs for various events. If you don't know the time interval, you may have to download all logs from S3 for a comprehensive search, and that turns out to be expensive - moving your data into the cloud is cheap, getting it out of there is expensive.
Or I could setup an EC2 instance template. When someone wants to do a log search, fire up the instance, download the logs from S3 to it, and grep away. Downloading files from S3 to EC2 is cheap. But the download may take a while; also, again, if you don't know what you're looking for, you need to download a lot of logs, which means using lots of space in EBS.
Another way is to upload logs into DynamoDB or something. Price might be an issue. Another issue is that the logs are completely unstructured Apache and Squid logs, and the like - and so queries might take a very long time.
We're talking 500GB / year of compressed logs, storing up to 5 years.
To me, storing logs in the cloud like this starts to sound like a not very good idea. Maybe just use Glacier as a "tape backup", but keep the logs locally for now, on a couple hard drives.
Which way do you lean?
Logstash + ElasticSearch + Kibana. That is the combination you need.
Sounds like you're looking towards AWS, so build up a modest EC2 cluster for this - perhaps a single logstash "router/broker" box and two Elasticsearch cluster nodes.
Keep a reasonable amount of data "online" in ES indexes, and then archive older indexes to S3. ElasticSearch supports this out of the box and is able to (relatively seamlessly) export and import data from S3.
As far as how to get logs up to EC2, I'd just use an IPsec tunnel from your self-hosted servers to the EC2 cluster and send logs using whatever protocol you'd like. Logstash has broad support for a bunch of input formats.
What about running a splunk server in the cloud, keep it up all the time on a small instance, can use EBS volumes, or even try S3 volumes if I/o doesn't become a bottleneck.
We lunched a cloud log management service (slicklog.com) which might address your issues. When your logs arrive on the platform,
After a period of time (min 1 month) configurable via the UI, the logs are archived ($0.001/GB no compressed), when the logs are archived
Given your requirements, it might be considered a solution if you can live with the fact that the archived logs files are not searchable.
Hope this helps.