We are currently experiencing a lot of Redis errors with the message
Unable to connect: read error on connection, trying next server
We run Redis on FreeBSD using PHP Redis and we have a hard time reproducing the error on Ubuntu so this might be a hint. There's a long-running issue on that topic on github.
Basically we get a socket from the operating system with a call to connect(host, port, timeout)
in phpredis, but when we do a select(db_index)
afterwards, we get an exception.
Could there be an issue with persistance? I assume that connect does nothing in the background and select tries to access the connection, which is actually closed.
We don't run into a timeout. We tried tuning TIME_WAIT without success.
Any other ideas on where the problem might come from? What is the best way to track the issue down? dtrace maybe?
Update
We are currently looking into our BGSAVE settings. Interestingly it takes half a second and more to create a fork for the process which regularly writes the data to disk (persistence) and maybe redis can't respond to connect()
requests during that timespan.
We reduced the error rate by 90% with the following redis command:
This disables BGSAVE, which regulary stores all database changes on disk. The reason for the connect errors most likely come from a blocking
fork()
operation of the main redis process to start the BGSAVE process.The redis.conf says:
Also see how the mechanism is implemented with a simple
fork()
here. We think about using a dedicated redis server from our pool which will be responsible for the BGSAVE operations and just using the others for reading/writing.From IRC chat, it seems like a couple of other companies ran into the same error. Bump was using a master/slave system as well. The slave does not accept connections and is only there to persist the data (see the discussion on hackernews here)
Hulu says the following: "To keep performance consistent on the shards, we disabled the writing to disk across all the shards, and we have a cron job that runs at 4am everyday doing a rolling “BGSAVE” command on each individual instance." (see here)
Edit:
It turns out that this was just a temporary fix. Load increased and we are back at the high error rates. Nevertheless I'm quite confident that a background operation (e.g. a fork, or a short-running background process) is causing the errors as the error messages always appear in blocks.
Edit2:
Since Redis is single-threaded, always keep an eye on long-running operations because they block everything else. An example is the
keys *
command. Avoid it and usescan
instead