We use graphite to track history of disk utilisation over time. Our alerting system looks at the data from graphite to alert us when the free space falls below a certain number of blocks.
I'd like to get smarter alerts - what I really care about is "how long do I have before I have to do something about the free space?", e.g. if the trend shows that in 7 days I'll run out of disk space then raise a Warning, if it's less than 2 days then raise an Error.
Graphite's standard dashboard interface can be pretty smart with derivatives and Holt Winters Confidence bands but so far I haven't found a way to convert this to actionable metrics. I'm also fine with crunching the numbers in other ways (just extract the raw numbers from graphite and run a script to do that).
One complication is that the graph is not smooth - files get added and removed but the general trend over time is for disk space usage to increase, so perhaps there is a need to look at local minimum's (if looking at the "disk free" metric) and draw a trend between the troughs.
Has anyone done this?
Honestly "Days Until Full" is really a lousy metric anyway -- filesystems get REALLY STUPID as they approach 100% utilization.
I really recommend using the traditional 85%, 90%, 95% thresholds (warning, alarm, and critical you-really-need-to-fix-this-NOW, respectively) - this should give you lots of warning time on modern disks (let's say a 1TB drive: 85% of a terabyte still leaves you lots of space but you're aware of a potential problem, by 90% you should be planning a disk expansion or some other mitigation, and at 95% of a terabyte you've got 50GB left and should darn well have a fix in motion).
This also ensures that your filesystem functions more-or-less optimally: it has plenty of free space to deal with creating/modifying/moving large files.
If your disks aren't modern (or your usage pattern involves bigger quantities of data being thrown onto the disk) you can easily adjust the thresholds.
If you're still set on using a "days until full" metric you can extract the data from graphite and do some math on it. IBM's monitoring tools implement several days-until-full metrics which can give you an idea of how to implement it, but basically you're taking the rate of change between two points in history.
For the sake of your sanity you could use the derivative from Graphite (which will give you the rate of change over time) and project using that, but if you REALLY want "smarter" alerts I suggest using daily and weekly rate of change (calculated based on peak usage for the day/week).
The specific projection you use (smallest rate of change, largest rate of change, average rate of change, weighted average, etc....) depends on your environment. IBM's tools offer so many different views because it's really hard to nail down a one-size-fits-all pattern.
Ultimately no algorithm is going to be very good at doing the kind of calculation you want. Disk utilization is driven by users, and users are the antithesis of the Rational Actor model: All of your predictions can go out the window with one crazy person deciding that today is the day they're going to perform a full system memory dump to their home directory. Just Because.
We've recently rolled out a custom solution for this using linear regression.
In our system the primary source of disk exhaustion is stray log files that aren't being rotated.
Since these grow very predictably, we can perform a linear regression on the disk utilization (e.g.,
z = numpy.polyfit(times, utilization, 1)
) then calculate the 100% mark given the linear model (e.g,(100 - z[1]) / z[0]
)The deployed implementation looks like this using ruby and GSL, though numpy works quite well too.
Feeding this a week's worth of average utilization data at 90 minute intervals (112 points) has been able to pick out likely candidates for disk exhaustion without too much noise so far.
The class in the gist is wrapped in a class that pulls data from scout, alerts to slack and sends some runtime telemetry to statsd. I'll leave that bit out since it's specific to our infrastructure.
We keep a "mean time till full" or "mean time to failure" metric for this purpose, using the statistical trend and its standard deviation to add the smarter (less dumb) logic over a simple static threshold.
Simplest Alert: Just an arbitrary threshold. Doesn't consider anything to do with the actual diskspace usage.
Simple TTF: A little smarter. Calculate the unused percentage minus a buffer and divide by the zero protected rate. Not very statistically robust, but has saved my butt a few times when my users upload their cat video corpus (true story).
Better TTF: But I wanted to avoid alerting for static read-only volumes at 99% (unless they ever have any changes), and I wanted more proactive notice for noisy volumes, and to detect applications with un-managed diskspace footprints. Oh, and the occasional negative values in the Simple TTF just bothered me.
I still keep a static buffer of 1%. Both the standard deviation and the consumption rate increase on abnormal usage patterns, which sometimes over compensates. In graphana or alertmanager speak you'll end up with some rather expensive sub-queries. But I did get the smoother timeseries, and less noisy alert I was seeking.
clamp_min((100 - 1 - stddev_over_time(usedPct{}[12h:]) - max_over_time(usedPct{}[6h:])) / clamp_min(deriv(usedPct{}[12:]),0.00001), 0)
Quieter drives make for very smooth alerts.
Longer ranges tame even the noisiest public volumes.