I am looking into building a syslog / logging infrastructure and am pondering about some architecture best practices. Essentially, I see that a syslog system needs to support two conflicting workloads:
- log collection. Potentially massive streams of data need to be written quickly to disks and indexed.
- log querying. logs will be queried by both fixed fields such as date and source as well as text search.
What is the best disk/system setup assuming I'd like to keep it to a single server for now? Should I use SSDs or ramdisk to off-load some processing? some disks in stripe and some in raid5?
I am particularly eyeing Graylog2 with ElasticSearch/MongoDB
First off I think it helps to define the value of the logs, e.g. If this is high volume financial transaction logs you may opt for very high end raid controllers with lots of battery backed cache and high end disks with tagged writes or NCQ.
In the general case the ZFS filesystem is pretty helpful, you are free to use HDDs for the cheap capacity they offer and then add SSDs as a cache for reading (the L2 ARC cache in ZFS) once you need them. If writes become a bottleneck you can use SSDs for the ZIL (effectively a write cache in ZFS). The good thing is it all just works, really well in my experience too.
Taking it further to address the conflicting workload concerns, a product like Cassandra (there are many other options too) has an architecture Which neatly solves these requirements in an efficient way.