I had a pretty bad time this evening. I had to move LVM2 LVs from one PV to another (source PV backed by NFS stored vdisk, target PV backed by an iscsi LUN). Moving small LVs of ths VG (few gigabits) went fine, but I had a 400GB LV and after a while this made my guest reach more than 150 loadavg, to the point where it got stuck and I had to hard reboot it.
I tried to resume the pvmove after doubling memory and cpu sizing (16GB and 4vcpu). The load went very high almost immediately. Reaching 60 of 5 min loadavg, I decided to kill the pvmove process (crossing fingers). The process got killed properly, or at least it was not in the proces table anymore as per ps and top, but the load kept on increasing. Reaching more than 90 before I decided reboot was my only option. While the pvmove process was not running anymorethe load never decreased and CPU was almost exclusively waiting on IOs as show bellow (probably 40 min after I killed the process , which ran during 5 min maximum).
top - 21:18:44 up 12:26, 1 user, load average: 93.07, 92.53, 89.07
Tasks: 405 total, 1 running, 402 sleeping, 2 stopped, 0 zombie
Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 0.0%id, 99.8%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16021672k total, 15363796k used, 657876k free, 427060k buffers
Swap: 2095100k total, 36k used, 2095064k free, 11856520k cached
I still had an ssh terminal opened and responsive. Actions to the filesystem seemed pretty responsive (listng dir) but restarting daemon took reaaaally long, and it was not possible to open new ssh connections.
Does any body have an explanation on this behaviour, and more particularly why does the load still increases while the process is not there anymore?
I suspect my iscsi initiator is just not good enough for such operations. But I am eager to ear about anybody else experience on such topics. P.S: I have found this similar question, but it didn't really got answered clearly imho:
https://serverfault.com/questions/268907/high-load-and-oom-killer-on-domus-while-pvmove#=
Regards.
See that ~99%wa value? That's your problem. You're running into severe resource contention in your storage subsystem.
You'll need to implement some monitoring so you can collect metrics and determine if the bottleneck is at the network level, at the physical disk level, or somewhere else entirely.