We have a Graphite server to collect data through collectd, statsd, JMXTrans ... Since a few days, we frequently have holes in our data. Digging through the data we still have, we can see an increase in the carbon cache size (from 50K to 4M). We don't see an increase in the number of metrics collected (metricsReceived is stable at around 300K). We have an increase in the number of queries from 1000 to 1500 on average.
Strangely, the cpuUsage decreases slightly from 100% (we have 4 CPU) to 50% when the cache size increase.
Strangely again, we see an increase in the number if octets read from disk, and a decrease in the number of octets written.
We have carbon configure mostly with default values:
- MAX_CACHE_SIZE = inf
- MAX_UPDATES_PER_SECOND = 5000
- MAX_CREATES_PER_MINUTE = 2000
Obviously, something has changed in our system, but we dont understand what, nor how we can find this cause ...
Any help ?
This is not a graphite stack's bug, but rather a IO bottleneck, most probably because your storage does not have the high enough IOPS. Because of this, the queue keeps building up, and overflows at 4M. At that point, You lose that much queued data, which is reflected later, as random 'gaps' in your graph. Your system cannot keep-up with the scale at which it is receiving metrics. It keeps filling up and overflowing.
This is because your system begins swapping and the CPUs get a lot of 'idle time', because of the IO wait.
To add context, i have 500 provisioned IOPS at aws on a system on which i receive some 40K metrics. The queue is stable at 50K.
Other answerer mentioned disk i/o bottleneck. I'll talk about network bottleneck as another cause of this.
In my environment, we run a cluster of front end UI servers (httpd, memcached); another cluster of middle layer relays (carbon-c-relay performing forwarding and aggregation); and a backend layer (httpd, memcached, carbon-c-relay, and carbon-cache.) Each of these clusters consists of multiple instances in EC2 and in total process 15 million metrics per minute.
We had a problem where we were seeing gaps for the metrics generated by the aggregate "sum" function, and also the aggregated values were incorrect (too low). The problem would alleviate by restarting carbon-c-relay in the middle layer, but gaps would start appearing again after several hours.
We had aggregation taking place in both the middle layer and the backend layer (the backend layer aggregated the aggregated metrics passed to it from the middle layer).
The middle layer hosts were not cpu bound, not disk bound, and no constraints on memory. This combined with the fact that the problem would only appear a few hours after restarting the relay proceses, meant that there was a network bottleneck. Our solution was simply to add more hosts to the middle layer. Doing this instantly resulted in the aggregated metrics performing correctly and not experiencing gaps.
The exact place in the network stack where was the bottleneck? I couldn't tell you. It could have been on the linux hosts; it could have been on the Amazon side.