I have a small farm of web servers (HP Proliant and IBM x, with Broadcom Corporation NetXtreme II BCM5 NIC's) running Apache 2.2.15 on CentOS 6, behind a Cisco ACE load balancer, serving a PHP/JS based web portal. This farm receives a lot of requests daily (it serves a whole small country) trying to access a splash page (to go, from there, to the index page)
I've been struggling with the following problem:
I've noticed sometimes requests to web delay quite a "long" time to be answered (from the client point of view) and sometimes they are not even answered at all (timeout at web client side). In the latter, I don't even seen the request on Apache logs.
I've also noticed that netstat reports an increasing amount of TCP resets being sent (
netstat -st | grep 'resets sent'
)Also,
dropwatch -l kas
shows there are many packets being dropped:
Initalizing kallsyms db dropwatch> start Enabling monitoring... Kernel monitoring activated. Issue Ctrl-C to stop monitoring 53 drops at tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 26 drops at tcp_rcv_established+926 (0xffffffff814981b6) 3 drops at tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 1 drops at netlink_unicast+251 (0xffffffff81471b11) 56 drops at tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 29 drops at tcp_rcv_established+926 (0xffffffff814981b6) 4 drops at tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 51 drops at tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 32 drops at tcp_rcv_established+926 (0xffffffff814981b6) 2 drops at tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 1 drops at ip_rcv_finish+199 (0xffffffff8147ea49) 1 drops at tcp_v4_destroy_sock+115 (0xffffffff814a0cf5) 1 drops at tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 22 drops at tcp_rcv_established+926 (0xffffffff814981b6) 36 drops at tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 2 drops at tcp_v4_reqsk_destructor+fa (0xffffffff814a104a) 49 drops at tcp_v4_md5_hash_skb+248 (0xffffffff8149fa08) 29 drops at tcp_rcv_established+926 (0xffffffff814981b6) 26 drops at tcp_rcv_established+926 (0xffffffff814981b6)
I've been following recommendations from RH (Red Hat Enterprise Linux Network Performance Tuning Guide), even though I've not seen some of the symptoms described there in my servers. In short:
- I've increased the NIC ring buffers to maximum.
- I've fiddled with (increased or changed) several kernel parameters (tcp_syncookies, netdev_budget, tcp_timestamps, tcp_window_scaling, tcp_rmem, dev_weight, tcp_tw_reuse...)
- I've modified the Apache config according to several "Apache optimization guides" extracted from web (even tough there were, and still are, Idle workers on Apache stats)
- I've stop/disabled any system service/daemon not required (basically all that remains is sshd, httpd and snmpd)
All of the above with no luck.
All NIC's at working at Speed: 1000Mb/s, CPU and disk usage are low, and neither netstat
nor ethtool
shows errors.
Any ideas what else can be done?