We have a 4 core CPU production system which does a lot of cronjobs , having constant proc queue and an usual load of ~1.5.
During night time we do some IO intensive stuff with postgres. We generate a graph showing the load/memory usage (rrd-updates.sh) This "fails" sometimes on high IO load situations. It is happening nearly every night, but not on every high IO situation.
My "normal" solution would be to nice and ionice the postgres stuff and increase the prio of the graph generation.
However this still fails.
The graph generation is semi-thread-proof with flock.
I do log the execution times and for the graph generation it is up to 5 min during high IO load, seemingly resulting in a missing graph for up to 4 min.
The timeframe is exactly matching with the postgres activity (this sometimes happens during the day aswell, though not that often)
Ionicing up to realtime prio (C1 N6 graph_cron vs C2 N3 postgres), nicing way above the postgres (-5 graph_cron vs 10 postgres) did not solve the issue.
Assuming the data is not collected, the additional issue is the ionice/nice somehow still not working.
Even with 90% IOwait and a load of 100 i was still able to use the data generation command free without more than maybe 5 sec delay (on testing at least).
Sadly i have not been able to reproduce this exactly in testing (having only a virtualized dev system)
Versions:
Kernel 2.6.32-5-686-bigmem
Debian Squeeze
rrdtool 1.4.3
Hardware: SAS 15K RPM HDD with LVM in hardware RAID1
mount options: ext3 with rw,errors=remount-ro
Scheduler: CFQ
crontab:
* * * * * root flock -n /var/lock/rrd-updates.sh nice -n-1 ionice -c1 -n7 /opt/bin/rrd-updates.sh
There seems to be a somhow possibly related BUG from Mr Oetiker on github for rrdcache:
https://github.com/oetiker/rrdtool-1.x/issues/326
This actually could be my issue (concurrent writes) but it does not explain the cronjob to not fail.
In the asumption i actually have 2 concurrent writes flock -n
would return exit code 1 (per man page ,confirmed in testing)
As i do not get an email with the output either and the observation that the cronjob do actually run fine all the other time i am somehow lost.
Example output:
Based on the comment i added the important source of the update script.
rrdtool update /var/rrd/cpu.rrd $(vmstat 5 2 | tail -n 1 | awk '{print "N:"$14":"$13}')
rrdtool update /var/rrd/mem.rrd $(free | grep Mem: | awk '{print "N:"$2":"$3":"$4}')
rrdtool update /var/rrd/mem_bfcach.rrd $(free | grep buffers/cache: | awk '{print "N:"$3+$4":"$3":"$4}')
What do i miss or where can i check further?
Remember: Productive system so no dev, no stacktrace or similiar available or installable.
I guess it's not the rrdtool that cannot update the graph, but rather data cannot be measured at this point. By the way, your method of measuring CPU and memory stats is just wrong, because it gives you instant result. CPU and Memory load can change drastically along the 60 seconds interval, but you will take only one value. You should really consider taking SNMP data, which gives average data on an interval. Plus, the whole pipe seems to be more expensive and slow that a snmpget call. Could be the gaps main reason.