I've 4 NetApp 2240-4 filer heads. They're single chassis 'cluster in a box' so two separate units.
Over the last few days, at about the same time - all of them started logging a LOT of Low water mark consistency points.
Running wafl_susp -w
gives me cp_from_low_water
clocking up at a rate of 10/sec or more. Before this started, they were almost entirely cp_from_timer
at a rate of 1 every 10s or so.
Two of my boxes have become unresponsive and been rebooted, and the problem has now gone again. I'm not 100% sure that's connected, but it seems a reasonable bet as to a culprit.
The other two - are completely idle, as in they have a base OS, and a couple of vfilers - and nothing else. But yet - Low watermark, suggests they're running out of memory, for some reason. I can only assume some sort of denial of service condition is occurring (perhaps 'failed SSH logins'?).
Can anyone offer an insight on how to troubleshoot this? Specifically from a NetApp perspective, I'm looking for some hints as to how to extract what's hogging my memory.
Open a ticket- this is an indication that there's a lack of system memory, and if there's little work being done and you still had boxes go unresponsive, there's something screwy happening. I've walked through the process of inspecting internal memory usage before with support on the line, but it's not something clients are supposed to do on their own. You'd need to use a
priv set
command and check running processes.Case opened with vendor regarding problem.
Low Water Mark CPs are the result of memory exhaustion: (Vendor link)
To interface with vendor, we ran a 'perfstat' - a NetApp downloadable tool that allows submitting perf related support information. This lead us to bug ID 697790 (Support login required), present on the version of code we were on, fixed in ONTAP 8.2.3
Specifically a memory leak in the specific case where LDAP authentication was failing. Because all 4 hosts were using the same account, and because at some point the lockout had tripped, they were all failing absurdly frequently. (And were specifically very low memory systems in the first place).
I have looked at other systems where this bug has been present, and there's some signs of it happening, but even on systems with 700+ day uptime an insignificant amount had occurred.
In general (and with a caveat that 'diag' commands are potentially dangerous to use so should be done with extreme caution without talking to the vendor) - we could identify the problem by looking at
mem_stat
- second column is 'bytes' and look for 'sasl'.I don't know at what level the problem crops up - I'm waiting for the systems to crash again to check. But would suggest that over 5% memory utilisation you should be considering taking action. A reboot fixes, as does a code update.
Am now capturing cp_types and memory footprint as part of my monitoring regime, so I can observe it occuring. Also being a bit more proactive about spotting LDAP account lockouts.