Which database to decide upon? Any comparisions?
- Existing: postgresql
- Issues
- Not easily scalable horizontal. Needs sharding etc
- Clustering does not solve the data growth problem
- Looking for: Any database that is easily horizontally scalable
- Cassandra (Twitter uses that?)
- MongoDB (rapidly gaining popularity)
- Voldemort
- Other?
- Why?
- Data growing with snowball effect
- existing postgresql locks table etc for vaccuum tasks periodically
- Archiving data is tideous currently
- Human interaction involved in existing archive, vaccuum, ... process periodically
- Need a 'set it. forget it. just add another server when data grows more.' type of solution
First question: Why are you on a relational database to begin with, if you don't need ACID properties? It sounds like you're doing some kind of non-transactional work, so getting a RDMBS with transactions is probably too heavy for your environment.
Second question: What kind of data are you storing? You sound like you need a column-store database, and that it's for some kind of data warehouse project.
Third question: If you're stuck with PostgreSQL (which is a fine database as-is) is it the current version? Older pre-8.x versions are notoriously slow, but a lot of work has gone into improvements since then, and some of the issues you mention - like autovacuum - are now easily addressed with "set-and-forget" settings.
Some additional info on this would be nice. Why is it snowballing? Can you normalize it to reduce storage?
If this is an issue, I can tell already that you're running an older version. Newer versions have per-table controls for this and you can even turn it off entirely.
Hard to make any kind of judgement here because there isn't much to work with. What media is the archive being dumped to? How much sustained I/O is involved? What time frame are you operating under? How much data? Does it need to be a "hot" dump or can it be "cold"?
I'm trying to see how "normal" usage would require manual intervention, because it shouldn't. Vacuum is automatic now and (as mentioned before) can be set to not occur at all, and most backups are scripted (and when you can script, you can schedule). So how is this occuring?
You're talking about a clustered-server arrangement.
It sounds like the following to me:
Other than that, there's not enough info here to figure out what a good fit would be.
You might consider looking into HBase and HyperTable as well; but again, as Avery Payne mentioned, you don't give us any information about your current application, just your database platform.
Some things to keep in mind:
Joins are done manually on the non-SQL platforms. They won't do things like foreign keys, aggregates, etc. All of that is manual.
Existing applications are not necessarily easy to port. Depending on what it'll cost you to port, it may be more cost-effective for you to scale your PostgreSQL server vertically (rather than horizontally).
You don't get ACID, and you have to manually manage concurrency. Depending on your application, this may be a problem. You also can't enforce global conservation rules in the traditional way, again due to lack of atomicity.
Cassandra is the best option where you know you need to scale.
I'd recommend some of the Case Studies articles from http://wiki.apache.org/cassandra/ArticlesAndPresentations
What you can do to solve some of your issues is:
Table is not locked, it just performs slow. This is done by postgresql to prevent transaction ID wraparound. You can reduce the frequency by writing multiple rows in batches and then commit. You could use a queue (like rabbitmq) for intermediate writes: application->queue->db. This will also increase your write performance a lot.
If your data is too big in the orders of several TB I would suggest you to move to the cloud, because dumping is not an option. Use AWS or Google Cloud, and use snapshots. E.g. EBS snapshots, which are very fast, are replicated accross continents and solve the need for backup.
If by archiving you mean delete data and move to an "archive", then use tablespaces, which are rotated by date. There are some implementations online for this.