Currently we run a 4-node Cassandra ring in each of two datacenters. We would like to rebuild them into a single 8-node ring. All else being equal we would really like to have consistent reads, so we currently run QUORUM reads and writes. However, if we lose a datacenter it appears that this would cause many or all requests to fail due to inability to meet the ConsistencyLevel. Since we plan to send requests to both datacenters, switching to LOCAL_QUORUM shouldn't be enough to guarantee consistency.
Cassandra appears to be sorely missing ConsistencyLevel settings that are measured against only available nodes.
What can be done to get maximum consistency without availability failures in this scenario, and what has to be traded-off to get it?
this simply isn't possible. when your network becomes partitioned (i.e., the link between the data centers go down) then comes back together, how will you reconcile the changes made within each data center during the outage? i'm asking specifically records that have changed in BOTH data centers.
there's a reason that in distributed systems, things like ConsistencyLevel and quorum take planning by the administrator and are not left to the system to decide automatically. if they did, then (again, using your example) you could have 2 adjacent nodes partitioned off and those 2 nodes would decide they have quorum and would become inconsistent with the rest of the nodes.
You can have your app read/write using QUORUM in normal operations, then failover to LOCAL_QUORUM in the case of a DC failure. This is something you'll have to do yourself, as Cassandra won't do this automatically. Optionally, if the DC fails you can perform a nodetool repair before opening it up for read/write access. Obviously QUORUM in a multi-DC scenario may mean you'll have latency issues depending on the pipe between them, but that's a tradeoff you'll have to weigh.