I seemed to have hit quite interesting problem recently while investigating my previous problem -- One ColumnFamily places data on only 3 out of 4 nodes
We had a very long row with over 800.000 columns in it. It stored user details; one column per one user's details. Not getting in reasons behind this kind of designs, according to documentation this should be fine, however we were having massive performance problems.
It seemed as the whole row was cache by the operating system cache and Cassandra -- because it was fairly often used row -- was spending most of the CPU time on serialising the row in order to get data out of it (even though the applications queries were about specific column within the row).
First thing I did was enabling row-cache on the ColumnFamily which stored this row. This seemed to have instantly solved the high CPU usage problem and solved performance problems... Until first scheduled manual repair.
At some point during manual repair (replicable; it was happening every single time I tried running manual repair), node was loosing contact with other nodes (and vice-versa!) and all affected nodes (usually two out of four) were reporting that heap is almost full (Heap is 0.9977127109734825 full...). Affected nodes were simply dying (as in, Cassandra, not the host itself).
We have solved the root problem -- changed the way we store data within the problematic ColumnFamily, however I still don't understand why I was getting into those problems and would appreciate if someone could shed some light as to potential source of the problem?
Was there anything really bad about what we did?
We are using 0.8.7, over 4 nodes cluster with RF=3. Heap is set to 16GB and those are 48GB boxes.
0 Answers