I'm exploring different ways of storing user-uploaded files (all are MS Office documents or alikes) on our high load web site. It's currently designed to store documents as files and have a SQL database store all metadata for those files. I'm concerned about growing out of the storage server and SQL server performance when number of documents reaches hundreds of millions. I was reading a lot of good information about CouchDB including its built-in scalability and performance, but I'm not sure how storing files as attachments in CouchDB would compare to storing files on a file system in terms of performance.
Anybody used CouchDB clusters for storing LARGE amounts of documents and in high load environment?
In reply to Redmumba. The CouchDB dev team would be interested in the crashes you are seeing.
On top of that: CouchDB's whole architecture is based on the fail-early principle. All subsystems as well as the main server are designed in a way to terminate and recover immediately when an error occurs. "crashes" are just part of normal operation, it makes for a lot more reliable software (ironically, but that's the whole Erlang philosophy).
As for the question, CouchDB will fit the requirements good enough. CouchDB's attachment streaming is definitely IO bound very near filesystem speed. CouchDB documents give you all the space you need for metadata and document attachments keep the binary data close by. No need to use different systems for that.
The experiences we've had with CouchDB in a high-load environment hasn't been that great; we've seen a lot of instability (frequent crashing), that the mailing lists tend to indicate can simply be solved by installing a monitor daemon to restart it if it fails. We don't use large value sets, but we do hit it fairly frequently--but keep this in mind, as larger files means longer connection times. Which means going down mid-transfer would be even more painful depending on bandwidth and file size.
I would recommend looking into MongoDB with GridFS support. MongoDB would be nice for you (based on your specification) because you look like you have additional metadata that you may want stored alongside the file; because its document oriented, you'll be able to store this metadata alongside the binary files. To that end, GridFS allows you store large files in the database.
BBC seem to be using is successfully. I believe there is a video on TED discussing what they are doing with it.
I have not used CouchDB but I do have experience with SQL Server. If you store the files in SQL server (varbinary(max) is physically stored on the file system) I think you'll be better off. It will scale to billions of rows and performance, regardless of the database used ( oracle, sql server, etc...), will depend on the application design and the hardware. I think this is the key. Performance issues are almost always the result of poorly designed applications or infrastructure, and not the underlying enterprise class database.