Chris Phillips

Asked: 2011-10-11 03:41:15 +0800 CST2011-10-11 03:41:15 +0800 CST 2011-10-11 03:41:15 +0800 CST

nss_ldap failover with partially dead server

I've implemented a pair of openldap proxies running a few meta databases to merge and filter an AD DC cluster. I'm having a few issues around the clients failing over between boxes when one service is not functioning. One example was when a central syslog server was down and slapd couldn't send logs to it, so whilst the TCP socket opened, the process would not respond as it was waiting to clear a backlog of syslog messages first. Here I'd have expected the clients to fail to the second server, but they didn't, despite reporting that they were connecting to it:

2011-10-10T11:45:01.220367+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: could not search LDAP server - Server is unavailable
2011-10-10T11:45:01.231725+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: could not search LDAP server - Server is unavailable
2011-10-10T11:45:01.235354+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: could not search LDAP server - Server is unavailable
2011-10-10T11:45:01.242156+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: reconnected to LDAP server ldap://10.5.10.117:389/
2011-10-10T11:45:01.248505+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: could not search LDAP server - Server is unavailable

So we see the "reconnected to" message for the backup server, yet it doesn't seem to actually ever get anything out of it. If I suspend down the VM that is fudged up, so no TCP connectivity is possible, then everything fails over nicely.

It feels like there is a subtly in the failure logic that would sort this out, but I'm unable to tune it nicely, assuming that is the solution. The matching client ldap.conf reads:

scope sub
ldap_version 3
nss_base_passwd          dc=domain,dc=local?sub?&(uidNumber=*)
nss_base_group          dc=domain,dc=local?sub?&(gidNumber=*)
nss_initgroups_ignoreusers root,ldap,dbus,xfs,haldaemon,nscd,nocpulse
bind_timelimit 1
timelimit 5
idle_timelimit 5
nss_reconnect_tries 3
nss_reconnect_sleeptime 1
nss_reconnect_maxconntries 3
bind_policy soft
uri ldap://10.3.110.117:389/ ldap://10.5.10.117:389/
base dc=bwinparty,dc=local
nss_initgroups backlink
pam_login_attribute uid
ssl no

It seems strange that whilst nscd etc are connecting to the 2nd server, we can not log in. Previously the bind_timelimit and timelimit were both 5, which to me suggested that if it ran out of time binding, then there was no time left to do anything else within the over all timelimit window. No improvement was noticed on that tweak though.

nss_ldap failover with partially dead server

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Resolve host name from IP address

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

nss_ldap failover with partially dead server

0 Answers