The setup
I am collecting statistics from Varnish with Logstash, which is configured to increment statsd counters based on the vhost in the server logs and the result code. I also have carbon creating whisper archives for graphite.
I'm reading logs from varnishncsa which is configured to add vhost and request disposition to the standard logs:
VARNISHNCSA_LOG_FORMAT="%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\" %{Host}i %{Varnish:hitmiss}x"
My logstash shipper config looks like this:
input {
file {
path => "/var/log/varnish/varnishncsa.log"
type => varnish
}
}
filter {
grok {
type => varnish
pattern => "%{COMBINEDAPACHELOG} %{NOTSPACE:vhost} %{WORD:varnish_handling}"
pattern => "%{COMBINEDAPACHELOG}"
}
mutate {
rename => [ 'response', 'status' ]
}
}
output {
statsd {
type => varnish
host => "my-statsd-host"
port => 8125
sender => "%{@fields.vhost}"
increment => "varnish.response.%{@fields.status}"
increment => "varnish.handling.%{@fields.varnish_handling}"
}
}
The problem
Hundreds of distinct counters are being created by carbon due to variations in the domain entered into users' browsers. So, for example, I have
www_mywebsite_com
WWW_MyWebsite_Com
www_mywebsite_net <-- an alias
...etc...
Obviously these are then missed by my graphs, which only look at statistics under the vhost's canonical name.
What I'd like is for some canonicalising process to happen beforehand. I can write a script to take a 'raw' domain and spit out a 'real' vhost name, but I'm not sure how to integrate that. Do I put it in the logstash config, or in statsd, or carbon? Could I do something with carbon's storage aggregation feature?
Update: I've worked around the worst cases by running carbon's aggregator daemon in front of the cache, and adding rules to rewrite-rules.conf
. However, there's very little documentation for that file, and I can't do more powerful things like smash everything down to lowercase.
you can lowercase a field with the mutate filter:
Logstash 1.1.13 Docs
Cheers, Jan