I've implemented a pair of openldap proxies running a few meta databases to merge and filter an AD DC cluster. I'm having a few issues around the clients failing over between boxes when one service is not functioning. One example was when a central syslog server was down and slapd couldn't send logs to it, so whilst the TCP socket opened, the process would not respond as it was waiting to clear a backlog of syslog messages first. Here I'd have expected the clients to fail to the second server, but they didn't, despite reporting that they were connecting to it:
2011-10-10T11:45:01.220367+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: could not search LDAP server - Server is unavailable
2011-10-10T11:45:01.231725+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: could not search LDAP server - Server is unavailable
2011-10-10T11:45:01.235354+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: could not search LDAP server - Server is unavailable
2011-10-10T11:45:01.242156+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: reconnected to LDAP server ldap://10.5.10.117:389/
2011-10-10T11:45:01.248505+01:00 gibsvlin-mkt-product worker_nscd: nss_ldap: could not search LDAP server - Server is unavailable
So we see the "reconnected to" message for the backup server, yet it doesn't seem to actually ever get anything out of it. If I suspend down the VM that is fudged up, so no TCP connectivity is possible, then everything fails over nicely.
It feels like there is a subtly in the failure logic that would sort this out, but I'm unable to tune it nicely, assuming that is the solution. The matching client ldap.conf reads:
scope sub
ldap_version 3
nss_base_passwd dc=domain,dc=local?sub?&(uidNumber=*)
nss_base_group dc=domain,dc=local?sub?&(gidNumber=*)
nss_initgroups_ignoreusers root,ldap,dbus,xfs,haldaemon,nscd,nocpulse
bind_timelimit 1
timelimit 5
idle_timelimit 5
nss_reconnect_tries 3
nss_reconnect_sleeptime 1
nss_reconnect_maxconntries 3
bind_policy soft
uri ldap://10.3.110.117:389/ ldap://10.5.10.117:389/
base dc=bwinparty,dc=local
nss_initgroups backlink
pam_login_attribute uid
ssl no
It seems strange that whilst nscd etc are connecting to the 2nd server, we can not log in. Previously the bind_timelimit and timelimit were both 5, which to me suggested that if it ran out of time binding, then there was no time left to do anything else within the over all timelimit window. No improvement was noticed on that tweak though.
0 Answers