We have some fairly large datasets (user events and server log information - >100 GB) that's becoming fairly unwieldy for data processing. I've seen lots of activity around NoSQL/Hadoop/etc and I was wondering what SV had to say about a paired solution. The absolute ideal situation would be:
- A "master-slave" like synchronization between our live MySQL and the NoSQL/Hadoop servers, but different enough that we can build custom indexes/etc.
- Ability to run standard aggregation results on the NoSQL/Hadoop in trivial time such as:
- <1 sec for sum(*) where event_type = 'blarg' between 'date 1' and 'date 2'
- Give me all the incoming search terms (which we record) for this page and the children of this page over an arbitrary time period and their count
- A simple (<10 minute) way to update a developer's machine.
Thoughts? We've tried a number of solutions around MySQL and nothing meets all these elegantly.
MongoDB is simple and now has auto sharding. It is not very efficient for disk usage, so you will need to give it a lot of disk space. It can handle queries, but it will need indexes unless you want it to scan billions of records. What we have done is to actually store summaries in Mongo because if you know your query you can build an optimized data structure around it and be extremely efficient.
Have you tried Infobright with MySQL? It does automatic compression and is FAST. It could be enough for you.
You need to code some kind of adapter by yourself!