Does anyone have any formulas, or maybe some sample data from their environment that can help me estimate how much disk space will be used by graphite per datapoint?
Does anyone have any formulas, or maybe some sample data from their environment that can help me estimate how much disk space will be used by graphite per datapoint?
whisper-info.py
gives you a lot of insight into what and how each file is aggregated, including the file's size.However it's only useful for existing whisper files.
When you want to see predictive sizing of a schema before putting it in place, try a Whisper Calculator, such as the one available at https://gist.github.com/jjmaestro/5774063
EDIT:
When asked for an example...
storage_schema:
Looking at my file
applied-in-last-hour.wsp
,ls -l
yieldsand
whisper-info.py ./applied-in-last-hour.wsp
yieldsSo, basically you combine your hosts per retention-match per retention-period-segment per stat, multiply by a factor of systems that you intend to apply this too, factor in the number of new stats that you're going to track. Then you take whatever amount of storage that is and at least double it (because we're buying storage, and we know we'll use it...)
In the documentation for statsd they give an example for a data retention policy.
The retentions are
10s:6h,1min:7d,10min:5y
which is 2160 + 10080 + 262800 = 275040 data points and they give an archive size of 3.2 MiB.Assuming a linear relationship, this would be approximately 12.2 Bytes per data point.
No direct experience with Graphite, but I imagine the same logic as we used for Cacti or anything else RRD or time-rollover driven would apply (Graphite doesn't use RRD internally anymore but the storage logic seems comparable.)
The quick answer is "Probably not as much space as you think you'll need."
The long answer involves some site-specific math. For our monitoring system (InterMapper) I figure out the retention periods, resolutions, and datapoint size, do some multiplcation, and add in overhead.
As an example I'll use disk space - we store figures with a 5 minute precision for 30 days, a 15 minute precision for a further 60 days, and then an hourly precision for a further 300 days, and we're using a 64-bit (8 byte) integer to store it:
At 8 bytes per sample that's about 173KB, plus healthy overhead for storage indexing and the like brings it to about 200KB for one partition's disk usage data (any error tending toward overestimation).
From the base metrics I can work out an average "per machine" size (10 disk partitions, swap space, RAM, load average, network transfer, and a few other things) -- works out to about 5MB per machine.
I also add a healthy 10% on top of the final number and round up, so I size things at 6MB per machine.
Then I look at the 1TB of space I have laying around for storing metrics data for charting and say "Yeah, I'm probably not running out of storage in my lifetime unless we grow a whole lot!" :-)
I have 70 nodes that generate a lot of data. Using Carbon/Whisper, one node created 91k files alone (the node generates multiple schemas each having multiple counters and variable fields which need to be selectable. eg: (nodename).(schema).(counter).(subcounter).(etc)....and so on).
This provided the granularity I needed to plot any graph I want. After running the script to populate the remaining 69 nodes, I had 1.3Tb of data on disk. And that is only 6hrs worth of data/node. What gets me is the actual flat csv file for 6hrs worth of data is about 230Mb/node. 70 nodes is ~16Gb of data. My storage-schema was 120s:365d.
I'm relatively new to databases, so I might be doing something wrong, but I'm guessing it's all the overhead for each sample.
So it was a fun experiment, but I don't think it makes sense to use whisper for the kind of data I'm storing. MongoDB seems like a better soluton, but I need to figure out how to use it as a backend to Grafana.