Right now have a 6-node Riak cluster that is experiencing very high latency and timeouts. When I go to check riak-admin transfers
I get the following:
ubuntu@ip-172-31-38-8:~$ riak-admin transfers
'riak@prod-riak-19' waiting to handoff 54 partitions
'riak@prod-riak-18' waiting to handoff 54 partitions
'riak@prod-riak-17' waiting to handoff 53 partitions
'riak@prod-riak-16' waiting to handoff 53 partitions
'riak@prod-riak-15' waiting to handoff 53 partitions
'riak@prod-riak-14' waiting to handoff 53 partitions
I've since turned off Active Anti-Entropy, and still experiencing high latency but nothing else seems to be giving us a problem. When I check the error logs there aren't any errors for the last 5 hours.
CPU usage looks like this:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4016 riak 20 0 3775m 564m 6224 S 9 3.8 3:34.90 beam.smp
so the machine obviously isn't maxed out. Is this the sign of data corruption? What could possibly be going on here? Thanks
When a Riak node is started it spawns a vnode for every partition in the ring, even those that it doesn't own. Each vnode that it doesn't own will attempt a handoff with the node that does own it, and after a successful handoff will shut down. These handoffs are subject to the transfer-limit.
Assuming you have a ring size of 64, there would be 10 or 11 vnodes owned by each node. The transfers output you have shown would be expected if no handoffs hand completed since the last time the entire cluster had been restarted.