I want to migrate an existing application that has approx. 10 million records stored in a relational database to CouchDB. The thing that I love about CouchDB is easy replication and fast cached views. The thing I don't like is the write and view creation speeds which will be very slow with 10 million documents.
One idea I have to get around these potential bottlenecks is to have three CouchDB instances:
- Write only instance: This is the master instance. Our single point of truth. Only updates, inserts and deletes allowed here. There are no reads and no views on this instance.
- View creation only instance: Only used to create and cache views. There are no reads or writes on this instance.
- Read only instance: Read access via replicated views.
Instance 2 is replicated from instance 1. Since there will not be any applications that use instance 2, it makes it feasible to create new views without affecting production applications.
Instance 3 is replicated from instance 2 which includes all the cached views.
Is this a feasible solution?
I have never run CouchDB, only researched it, so don't take my suggestions here as true without verification...
First off, I would highly recommend reading John P. Wood's series on his experiences with production use of CouchDB: http://johnpwood.net/2009/06/15/couchdb-a-case-study/
Next, when you say instances, is that a physical server with a single CouchDB instance? If we're only talking 3 servers, I don't think that splitting up capacity by assigning different roles is optimal. My gut feeling would be to keep all 3 servers identical and loaded with the full data set, to allow for parallel read queries...?
If it's just 3 servers, I would consider traditional RDBMS and a traditional replication setup. I doubt CouchDB will make that great a difference for you with this relatively small amount of compute power?
Another thing, take a good look at HBase, buildt on top of Hadoop. The Hadoop framework is getting excellent corporate sponsorship now, with both Yahoo and Facebook being big users. Given this, HBase might develop faster and be more well-tested than some of the competition.
HTH
I'm fairly sure CouchDB doesn't replicate view caches (since they're caches, after all), so you'd have to replicate those out-of-band (which kind of misses the point, IMO).
CouchDB is probably just not that nice for write-heavy loads. If your load is read-heavy after all, I guess you can just call the views after each insertion/update, so that the views are always fully cache-backed.
Disclaimer: I'm using CouchDB in a few sites, but nowhere near the size you're talking about.