Ping a Specific Port

Question

Sobrique

Asked: 2015-01-09 07:09:15 +0800 CST2015-01-09 07:09:15 +0800 CST 2015-01-09 07:09:15 +0800 CST

NetApp filer - lots of 'low water mark' CPs triggering on idle filer

772

I've 4 NetApp 2240-4 filer heads. They're single chassis 'cluster in a box' so two separate units.

Over the last few days, at about the same time - all of them started logging a LOT of Low water mark consistency points.

Running wafl_susp -w gives me cp_from_low_water clocking up at a rate of 10/sec or more. Before this started, they were almost entirely cp_from_timer at a rate of 1 every 10s or so.

Two of my boxes have become unresponsive and been rebooted, and the problem has now gone again. I'm not 100% sure that's connected, but it seems a reasonable bet as to a culprit.

The other two - are completely idle, as in they have a base OS, and a couple of vfilers - and nothing else. But yet - Low watermark, suggests they're running out of memory, for some reason. I can only assume some sort of denial of service condition is occurring (perhaps 'failed SSH logins'?).

Can anyone offer an insight on how to troubleshoot this? Specifically from a NetApp perspective, I'm looking for some hints as to how to extract what's hogging my memory.

2 Answers

Voted

Basil · Answer 1 · 2015-01-09T07:29:30+08:00

Best Answer

Basil

2015-01-09T07:29:30+08:002015-01-09T07:29:30+08:00

Open a ticket- this is an indication that there's a lack of system memory, and if there's little work being done and you still had boxes go unresponsive, there's something screwy happening. I've walked through the process of inspecting internal memory usage before with support on the line, but it's not something clients are supposed to do on their own. You'd need to use a priv set command and check running processes.

2

Sobrique · Answer 2 · 2015-02-25T11:48:20+08:00

Case opened with vendor regarding problem.

Low Water Mark CPs are the result of memory exhaustion: (Vendor link)

CP caused by low water mark; the amount of memory available for routine housekeeping tasks is low enough that it is ideal to start a CP to release some more

To interface with vendor, we ran a 'perfstat' - a NetApp downloadable tool that allows submitting perf related support information. This lead us to bug ID 697790 (Support login required), present on the version of code we were on, fixed in ONTAP 8.2.3

Specifically a memory leak in the specific case where LDAP authentication was failing. Because all 4 hosts were using the same account, and because at some point the lockout had tripped, they were all failing absurdly frequently. (And were specifically very low memory systems in the first place).

I have looked at other systems where this bug has been present, and there's some signs of it happening, but even on systems with 700+ day uptime an insignificant amount had occurred.

In general (and with a caveat that 'diag' commands are potentially dangerous to use so should be done with extreme caution without talking to the vendor) - we could identify the problem by looking at mem_stat - second column is 'bytes' and look for 'sasl'.

1306719 5268691008 maytag.ko::sasl_client_new+149

I don't know at what level the problem crops up - I'm waiting for the systems to crash again to check. But would suggest that over 5% memory utilisation you should be considering taking action. A reboot fixes, as does a code update.

Am now capturing cp_types and memory footprint as part of my monitoring regime, so I can observe it occuring. Also being a bit more proactive about spotting LDAP account lockouts.

NetApp filer - lots of 'low water mark' CPs triggering on idle filer

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?