We currently have 12 nodes running in our Cassandra cluster. Ultimately even if a couple of the nodes go down, we're still up and running. The paranoia in me would like to do at least one backup a day and store it on Amazon S3. My question is the following:
When backing up Cassandra, is it sufficient to run the backup from one node, or do I have to run a backup script from each one of the 12 nodes and push its respective backup onto S3? If at one point a restore is required, do we have to backup from the individual nodes backup, or is there a way to "aggregate" the backups (assuming you need to take them from each node individually) into one large restore process?
Slightly confused by the documentation. Just want to get an efficient backup process rolling on my Cassandra cluster.
You need to back each node up, unless every node stores 100% of the data, then you can back only one node up.
The easiest way to back up Cassandra is to back up each node. I've used tablesnap before to do this automatically and it's pretty good. There's also Priam from Netflix but I haven't tried it personally. https://github.com/Netflix/Priam