I have read and understood Can you help me with my capacity planning?, but I'm not sure I understand what my next steps are in a DNS server scenario. I think my CPU loads are high or that I might be starting to drop queries, but I'd like to better understand the load of my server before I take action on it. This is particularly concerning to me because it's common knowledge that scaling your infrastructure to DDoS loads is losing battle.
What should I be analyzing in order to understand my environment?
On Serverfault, we usually tell you that we can't help with your capacity planning. This is for good reason: we don't know the specifics of your environment, and the answers on how to measure it are pretty much the same. Unfortunately, DNS capacity measurement is a poorly understood topic and most admins will assume that high CPU usage means that it's time to consider adding capacity. This is a really, really bad idea, and scaling to a DNS DDoS will inevitably lead to your network devices choking. (or worse, people reaching out to your legal department)
Server logs and packet captures are what most admins will try leveraging first, but the simple truth is that SNMP can tell you far more about the environment than what your logs do. Don't ignore logs and packet captures, but SNMP can usually help you spot the existence of a problem faster.
In addition to tracking the default system stats provided by your SNMP monitoring tool (which should include CPU load, per-interface throughput and packet counters, disk I/O, etc.), I recommend adding the following OIDs:
udpInErrors
(angry red color strongly recommended)udpInDatagrams
,udpOutDatagrams
udpNoPorts
tcpInSegs
,tcpOutSegs
Interpreting the graphs
These graphs can be lumped into two categories: metrics that indicate a problem, and metrics that help you diagnose it.
Indicators
udpInErrors
is your primary sign of a capacity problem. This counter increments every time the kernel drops a UDP datagram because the application isn't processing traffic fast enough. This means that your DNS service is overloaded and not able to keep up with the incoming traffic.If you cannot correlate increases in these metrics to other performance problems on the system, congratulations: you are legitimately nearing/over capacity and it's time to add servers. Consider me impressed. :)
Diagnosing
This covers DNS specific items only. Use your head here, and don't expect this to be all-inclusive. (example: disk I/O saturation is not a problem specific to DNS)
Side note:
udpNoPorts
isn't really a capacity metric, but it's useful for identifying cache poisoning attempts. This counter increments every time a UDP packet was seen on an unexpected port, and a sustained wall of these during normal operation can suggest that someone is trying to forge a reply. (either that, or one of your listeners isn't running: turn that back on foo'!)With DNS servers (indeed any type of server) sometimes you need to look at and analyse the requests that are being made of it in case a misconfiguration (possibly elsewhere) is amplifying request volume (see for example Windows DNS servers repeatedly requesting records in zone when they get SERVFAIL response). Look at the proportions of queries and responses and then try to find a comparator to check for normality.