I use a redundant pair of OpenLDAP servers for PAM auth and directory services via NSS. It's been 100% reliable so far, but nothing runs flawlessly forever.
What steps should I take now so I have a fighting chance of recovering from failure of the LDAP server(s)? In my informal testing, it appears that even already authenticated shells are largely useless as all username/uid lookups hang until the directory server comes back.
So far I've come up with only two things:
- Do not use NSS-LDAP and PAM-LDAP on the LDAP servers themselves.
- Create a root-level account on all boxes that only accepts publickey authentication from our local subnet and protect that key well. I'm not sure how much good this would do me as once I'm logged in, I suspect I wouldn't be able to accomplish anything since all the userid lookups would be hanging.
Any other suggestions?
Rule #1 of network-based authentication: Always have a local account available.
Beyond rule #1 (and in order to make it useful without getting blocked up behind nss_ldap trying to talk to a dead server):
Using pam_ldap/nss_ldap you can set the
bind_policy
to "soft" (return immediately on server failure), which eliminates the blocking problem. You can also set thetimelimit
vallues to make nss_ldap return if it can't contact the LDAP server. Note that this has other implications (like a soft fail during SSHing in will make your LDAP account inaccessible, and sporadic failures will result in unknown usernames for some LDAP UIDs.There are also some undocumented nss_ldap options as well:
nss_reconnect_tries
,nss_reconnect_sleeptime
,nss_reconnect_maxsleeptime
, &nss_reconnect_maxconntries
, which do what their names imply and will help you work around failures of your LDAP server without setting your bind_policy to soft (this is what I'm doing -- 3 reconnect tries with 10 second max sleep = max 30sec delay waiting for the LDAP server).