I have posted this on cassandra-user mailing list however didn't get any responses just yet and I was wondering whether someone here on serverfault.com will have any ideas.
I seem to have came across rather weird (at least for me!) problem / behaviour with Cassandra.
I am running a 4-nodes cluster on Cassandra 0.8.7. For the keyspace in question, I have RF=3, SimpleStrategy with multiple ColumnFamilies inside the KeySpace. On of the ColumnFamilies however seems to have data distributed across only 3 out of 4 nodes.
The data on the cluster beside the problematic ColumnFamily seems to be more or less equal and even.
# nodetool -h localhost ring
Address DC Rack Status State Load Owns Token
127605887595351923798765477786913079296
192.168.81.2 datacenter1 rack1 Up Normal 7.27 GB 25.00% 0
192.168.81.3 datacenter1 rack1 Up Normal 7.74 GB 25.00% 42535295865117307932921825928971026432
192.168.81.4 datacenter1 rack1 Up Normal 7.38 GB 25.00% 85070591730234615865843651857942052864
192.168.81.5 datacenter1 rack1 Up Normal 7.32 GB 25.00% 127605887595351923798765477786913079296
Schema for the relevant bits of the keyspace is as follows:
[default@A] show schema;
create keyspace A
with placement_strategy = 'SimpleStrategy'
and strategy_options = [{replication_factor : 3}];
[...]
create column family UserDetails
with column_type = 'Standard'
and comparator = 'IntegerType'
and default_validation_class = 'BytesType'
and key_validation_class = 'BytesType'
and memtable_operations = 0.571875
and memtable_throughput = 122
and memtable_flush_after = 1440
and rows_cached = 0.0
and row_cache_save_period = 0
and keys_cached = 200000.0
and key_cache_save_period = 14400
and read_repair_chance = 1.0
and gc_grace = 864000
and min_compaction_threshold = 4
and max_compaction_threshold = 32
and replicate_on_write = true
and row_cache_provider = 'ConcurrentLinkedHashCacheProvider';
And now the symptoms - output of 'nodetool -h localhost cfstats' on each node. Please note the figures on node1.
node1
Column Family: UserDetails
SSTable count: 0
Space used (live): 0
Space used (total): 0
Number of Keys (estimate): 0
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Key cache capacity: 200000
Key cache size: 0
Key cache hit rate: NaN
Row cache: disabled
Compacted row minimum size: 0
Compacted row maximum size: 0
Compacted row mean size: 0
node2
Column Family: UserDetails
SSTable count: 3
Space used (live): 112952788
Space used (total): 164953743
Number of Keys (estimate): 384
Memtable Columns Count: 159419
Memtable Data Size: 74910890
Memtable Switch Count: 59
Read Count: 135307426
Read Latency: 25.900 ms.
Write Count: 3474673
Write Latency: 0.040 ms.
Pending Tasks: 0
Key cache capacity: 200000
Key cache size: 120
Key cache hit rate: 0.999971684189041
Row cache: disabled
Compacted row minimum size: 42511
Compacted row maximum size: 74975550
Compacted row mean size: 42364305
node3
Column Family: UserDetails
SSTable count: 3
Space used (live): 112953137
Space used (total): 112953137
Number of Keys (estimate): 384
Memtable Columns Count: 159421
Memtable Data Size: 74693445
Memtable Switch Count: 56
Read Count: 135304486
Read Latency: 25.552 ms.
Write Count: 3474616
Write Latency: 0.036 ms.
Pending Tasks: 0
Key cache capacity: 200000
Key cache size: 109
Key cache hit rate: 0.9999716840888175
Row cache: disabled
Compacted row minimum size: 42511
Compacted row maximum size: 74975550
Compacted row mean size: 42364305
node4
Column Family: UserDetails
SSTable count: 3
Space used (live): 117070926
Space used (total): 119479484
Number of Keys (estimate): 384
Memtable Columns Count: 159979
Memtable Data Size: 75029672
Memtable Switch Count: 60
Read Count: 135294878
Read Latency: 19.455 ms.
Write Count: 3474982
Write Latency: 0.028 ms.
Pending Tasks: 0
Key cache capacity: 200000
Key cache size: 119
Key cache hit rate: 0.9999752235777154
Row cache: disabled
Compacted row minimum size: 2346800
Compacted row maximum size: 62479625
Compacted row mean size: 42591803
When I go to 'data' directory on node1 there is no files regarding the UserDetails ColumnFamily.
I tried performing manual repair in hope it will heal the situation, however without any luck.
# nodetool -h localhost repair A UserDetails
INFO 15:19:54,611 Starting repair command #8, repairing 3 ranges.
INFO 15:19:54,647 Sending AEService tree for #<TreeRequest manual-repair-89c1acb0-184c-438f-bab8-7ceed27980ec, /192.168.81.2, (A,UserDetails), (85070591730234615865843651857942052864,127605887595351923798765477786913079296]>
INFO 15:19:54,742 Endpoints /192.168.81.2 and /192.168.81.3 are consistent for UserDetails on (85070591730234615865843651857942052864,127605887595351923798765477786913079296]
INFO 15:19:54,750 Endpoints /192.168.81.2 and /192.168.81.5 are consistent for UserDetails on (85070591730234615865843651857942052864,127605887595351923798765477786913079296]
INFO 15:19:54,751 Repair session manual-repair-89c1acb0-184c-438f-bab8-7ceed27980ec (on cfs [Ljava.lang.String;@3491507b, range (85070591730234615865843651857942052864,127605887595351923798765477786913079296]) completed successfully
INFO 15:19:54,816 Sending AEService tree for #<TreeRequest manual-repair-6d2438ca-a05c-4217-92c7-c2ad563a92dd, /192.168.81.2, (A,UserDetails), (42535295865117307932921825928971026432,85070591730234615865843651857942052864]>
INFO 15:19:54,865 Endpoints /192.168.81.2 and /192.168.81.4 are consistent for UserDetails on (42535295865117307932921825928971026432,85070591730234615865843651857942052864]
INFO 15:19:54,874 Endpoints /192.168.81.2 and /192.168.81.5 are consistent for UserDetails on (42535295865117307932921825928971026432,85070591730234615865843651857942052864]
INFO 15:19:54,874 Repair session manual-repair-6d2438ca-a05c-4217-92c7-c2ad563a92dd (on cfs [Ljava.lang.String;@7e541d08, range (42535295865117307932921825928971026432,85070591730234615865843651857942052864]) completed successfully
INFO 15:19:54,909 Sending AEService tree for #<TreeRequest manual-repair-98d1a21c-9d6e-41c8-8917-aea70f716243, /192.168.81.2, (A,UserDetails), (127605887595351923798765477786913079296,0]>
INFO 15:19:54,967 Endpoints /192.168.81.2 and /192.168.81.3 are consistent for UserDetails on (127605887595351923798765477786913079296,0]
INFO 15:19:54,974 Endpoints /192.168.81.2 and /192.168.81.4 are consistent for UserDetails on (127605887595351923798765477786913079296,0]
INFO 15:19:54,975 Repair session manual-repair-98d1a21c-9d6e-41c8-8917-aea70f716243 (on cfs [Ljava.lang.String;@48c651f2, range (127605887595351923798765477786913079296,0]) completed successfully
INFO 15:19:54,975 Repair command #8 completed successfully
As I am using SimpleStrategy I would expect the keys to be split, more or less, equally across the nodes, however this don't seem to be the case.
Has anyone came across similar behaviour before? Has anyone have any suggestions what I could do to bring some data into node1? Obviously, this kind of data split means node2, node3 and node4 need to do all the read work which is not ideal.
Any suggestions much appreciated.
Kind regards, Bart
SimpleStrategy means Cassandra distributes the data without respect for racks, datacenters, or other geography. That's important information for understanding data distribution, but it's not enough to fully analyze your situation.
If you want to understand how rows distribute over a cluster, it's also an issue of the partitioner you use. The random partitioner hashes row keys before deciding on the cluster members that should have them. The order-preserving partitioner does not, which can create hot spots (including total non-use of a node!) on the cluster, even if your nodes have equal divisions of the ring. You can experiment with how Cassandra will distribute different keys using the following command on one of your nodes to see where Cassandra thinks different keys (actual or hypothetical) belong on which nodes:
If other column families are distributing their data properly over the cluster, I would look into the partitioner and keys you're using.
Came out to be a problem with schema -- instead of having multiple rows (1 row per User) we had one massive row with over 800.000 columns.
What I suspect was happening was:
We have changed the way application does this, i.e. it stores single row for single user's details and the problem is gone.