We have a Sphinx install (2.0.3) running on a cluster of 3 EC2 instances (currently m3.large).
Currently we have workers = threads
and max_children = 30
in our Sphinx config (same on each box). We are periodically getting the dreaded "temporary searchd error: server maxed out, retry in a second". Our instances are hovering around 5% CPU utilization. Some example top output:
top - 19:51:56 up 22:15, 1 user, load average: 0.08, 0.04, 0.01
Tasks: 82 total, 2 running, 80 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.0%us, 0.0%sy, 0.0%ni, 98.5%id, 0.3%wa, 0.0%hi, 0.0%si, 0.2%st
Mem: 7872040k total, 2911920k used, 4960120k free, 245168k buffers
Swap: 0k total, 0k used, 0k free, 2190992k cached
All the Sphinx doc seems to say about setting max_children is that it is "useful to control server load". While searching I found a forum post indicating that setting it either too high or too low can cause "server maxed out" - I presume the former is because the individual queries are starved - but had no further tips on choosing the right level. (I can't find the link to this post again to save my life. Sorry.)
Two related questions:
- Am I right in thinking the low CPU suggests max_children could/should be higher than 30?
- How can I find the optimal number (i.e., the max number of children which [usually] does not lead to query slowdown)? I'm not entirely sure what kind of info Sphinx logs beyond
query.log
. Is there a tool I can use to determine whether query slowdown is occurring (due to too many parallel queries), and if not, are queries CPU-bound or memory-bound (or should I be looking at some other value entirely)?