I'm having an intermittent issue connecting to a database slave with InnoDB. Intermittently I get connections taking longer than 2 seconds. These servers are hosted on Amazon's EC2.
The app server is PHP 5.2/Apache running on Ubuntu. The DB slave is running Percona's XtraDB 5.1 on Ubuntu 9.10. It's using an EBS Raid array for the data storage.
We already use skip name resolve and bind to address 0.0.0.0.
This is a stub of the PHP code that's failing
$tmp = mysqli_init(); $start_time = microtime(true); $tmp->options(MYSQLI_OPT_CONNECT_TIMEOUT, 2); $tmp->real_connect($DB_SERVERS[$server]['server'], $DB_SERVERS[$server]['username'], $DB_SERVERS[$server]['password'], $DB_SERVERS[$server]['schema'], $DB_SERVERS[$server]['port']); if(mysqli_connect_errno()){ $timer = microtime(true) - $start_time; mail($errors_to,'DB connection error',$timer); }
There's more than 300Mb available on the DB server for new connections and the server is nowhere near the max allowed (60 of 1,200). Loading on both servers is < 2 on 4 core m1.xlarge instances.
Some highlights from the mysql config
max_connections = 1200 thread_stack = 512K thread_cache_size = 1024 thread_concurrency = 16 innodb-file-per-table innodb_additional_mem_pool_size = 16M innodb_buffer_pool_size = 13G
Any help on tracing the source of the slowdown is appreciated.
[EDIT] I have been updating the sysctl values for the network but they don't seem to be fixing the problem. I made the following adjustments on both the database and application servers.
net.ipv4.tcp_window_scaling = 1 net.ipv4.tcp_sack = 0 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_fin_timeout = 20 net.ipv4.tcp_keepalive_time = 180 net.ipv4.tcp_max_syn_backlog = 1280 net.ipv4.tcp_synack_retries = 1 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 87380 16777216
[EDIT] Per jaimieb's suggestion, I added some tracing and captured the following data using time. This server handles about 51 queries/second at this the time of day. The connection error was raised once (at 13:06:36) during the 3 minute window outlined below. Since there was 1 failure and roughly 9,200 successful connections, I think this isn't going to produce anything meaningful in terms of reporting.
Script:
date >> /root/database_server.txt (time mysql -h database_Server -D schema_name -u appuser -p apppassword -e '') > /dev/null 2>> /root/database_server.txt
Results:
=== Application Server 1 === Mon Feb 22 13:05:01 EST 2010 real 0m0.008s user 0m0.001s sys 0m0.000s Mon Feb 22 13:06:01 EST 2010 real 0m0.007s user 0m0.002s sys 0m0.000s Mon Feb 22 13:07:01 EST 2010 real 0m0.008s user 0m0.000s sys 0m0.001s === Application Server 2 === Mon Feb 22 13:05:01 EST 2010 real 0m0.009s user 0m0.000s sys 0m0.002s Mon Feb 22 13:06:01 EST 2010 real 0m0.009s user 0m0.001s sys 0m0.003s Mon Feb 22 13:07:01 EST 2010 real 0m0.008s user 0m0.000s sys 0m0.001s === Database Server === Mon Feb 22 13:05:01 EST 2010 real 0m0.016s user 0m0.000s sys 0m0.010s Mon Feb 22 13:06:01 EST 2010 real 0m0.006s user 0m0.010s sys 0m0.000s Mon Feb 22 13:07:01 EST 2010 real 0m0.016s user 0m0.000s sys 0m0.010s
[EDIT] Per a suggestion received on a LinkedIn question, I tried setting the back_log value higher. We had been running the default value (50) and increased it to 150. We also raised the kernel value /proc/sys/net/core/somaxconn (maximum socket connections) to 256 on both the application and database server from the default 128. We did see some elevation in processor utilization as a result but still received connection timeouts.