we are seeing the following:
[root@primary data]# netstat -s | grep buffer ; sleep 10 ; netstat -s | grep buffer
20560 packets pruned from receive queue because of socket buffer overrun
997586 packets collapsed in receive queue due to low socket buffer
20587 packets pruned from receive queue because of socket buffer overrun
998646 packets collapsed in receive queue due to low socket buffer
[root@primary data]#
Bare in mind, the above is a freshly rebooted box... About 1hour uptime. We recently had a box that was up 2 months, and these counteres will into the high millions (XXX million).
We have tried changing various sysctl variables...
Here are our sysctl variables which I believe are related:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
Does anyone know how to resolve these pruned pakcets due to socket buffer overrun / packets colapsing (which I understand isnt as bad as the pruned packets)?
Thanks.
Judging from the information you have provided, and since you seem to have already increased buffers, the problem most likely lies at your application. The fundamental problem here is that even though the OS receives the network packets, they are not processed fast enough and hence fill up the queue.
This does not necessarily mean that the application is too slow by itself, it's also possible that it doesn't get enough CPU time because of too many other processes running on that machine.
Actually, you haven't necessarily increased the buffers; merely the maximum possible size of the queues.
When you open a socket, the queues are set to the value of: net.core.rmem_default = 212992 net.core.wmem_default = 212992
So increasing the maxima will do nothing unless the application is calling setsockopt() to increase the queue size (and failing, if the maximum is below the size it's trying to allocate).
Try increasing the values above.