Several hours ago, a handful of our member servers became unable to authenticate against the two domain controllers they should be using. The member servers and DC are located in the same datacenter, and are on a separate "site" in AD. Running DCDiag shows no problems, and we've confirmed that the servers and DCs have network connectivity with each other. Running nslookup on the member servers shows the proper DC listed as the name server in each case.
LDAP authentication seems to be working, however, Kerberos authentication has stopped working. Basically, all of the key internal services have stopped.
Here are specifics on some of the problems we are having with member servers:
Exchange - Topology Service cannot find any domain controllers. Therefore, the Exchange Information Store cannot start.
SharePoint - Authentication is failing at the IIS level and between IIS and SQL (this farm has been up for mutliple years).
Additional troubleshooting:
NLTEST /DCLIST:domainname - No DC can be found to get a DC List
NLTEST /Server:Servername - Both DCs Complete Successfully.
NLTEST /DSGetDC:Domain - Commands complete sucessfully.
NLTEST /dsgetsite - Completes successfully.
GPUpdate - User cannot be found. No domain exists
Output of nslookup -type=SRV _kerberos._tcp.dc._msdcs.subdomain.mydomain.com
on the exchange server:
Server: colo-dc-001.subdomain.mydomain.com
Address: 10.11.2.20
_kerberos._tcp.dc._msdcs.subdomain.mydomain.com SRV service location:
priority = 0
weight = 100
port = 88
svr hostname = branchf-dc-001.subdomain.mydomain.com
_kerberos._tcp.dc._msdcs.subdomain.mydomain.com SRV service location:
priority = 0
weight = 100
port = 88
svr hostname = colo-dc-001.subdomain.mydomain.com
_kerberos._tcp.dc._msdcs.subdomain.mydomain.com SRV service location:
priority = 0
weight = 100
port = 88
svr hostname = hq-dc-003.subdomain.mydomain.com
_kerberos._tcp.dc._msdcs.subdomain.mydomain.com SRV service location:
priority = 0
weight = 100
port = 88
svr hostname = colo-dc-002.subdomain.mydomain.com
_kerberos._tcp.dc._msdcs.subdomain.mydomain.com SRV service location:
priority = 0
weight = 100
port = 88
svr hostname = hq-dc-004.subdomain.mydomain.com
_kerberos._tcp.dc._msdcs.subdomain.mydomain.com SRV service location:
priority = 0
weight = 100
port = 88
svr hostname = branchc-dc-002.subdomain.mydomain.com
_kerberos._tcp.dc._msdcs.subdomain.mydomain.com SRV service location:
priority = 0
weight = 100
port = 88
svr hostname = branchm-dc-001.subdomain.mydomain.com
_kerberos._tcp.dc._msdcs.subdomain.mydomain.com SRV service location:
priority = 0
weight = 100
port = 88
svr hostname = branchs-dc-001.subdomain.mydomain.com
branchf-dc-001.subdomain.mydomain.com internet address = 10.10.2.22
colo-dc-001.subdomain.mydomain.com internet address = 10.11.2.20
hq-dc-003.subdomain.mydomain.com internet address = 10.1.2.20
colo-dc-002.subdomain.mydomain.com internet address = 10.11.2.21
hq-dc-004.subdomain.mydomain.com internet address = 10.1.2.21
branchc-dc-002.subdomain.mydomain.com internet address = 10.5.2.21
branchm-dc-001.subdomain.mydomain.com internet address = 10.6.2.21
branchs-dc-001.subdomain.mydomain.com internet address = 10.7.2.22
We can RDP to any of the servers that are hosting the above services, but the services will not work.
System logs on the member servers include some error messages about not being able to find a DC.
So basically, the network seems to be up, and the DCs seem to be up, but member servers right there on the same network segment can't find them. Where should we look for the problem?
I'd start looking at DNS. This really smells like DNS to me.
Does it look like things are missing from the
_msdcs.domain.com
forward-lookup zone?If you run a
nslookup -type=SRV _kerberos._tcp.dc._msdcs.domain.com
what are you getting back for output?Sniff the traffic on either the DCs or the member servers when you're running your failing diagnostic commands and post the output here if the problem isn't glaringly obvious. The
NLTEST /DCLIST:domain.com
command, for example, should cause the client to emit some DNS looking for an LDAP server in its site, followed by a couple of RPC binds.This issue was caused by a group policy change that was intended for end user workstations, but was mistakenly applied to some member servers. The group policy change enabled DirectAccess.
For our servers at the hosting facility, applying this policy caused those servers to conclude they were on an untrusted network. Thus, they enabled Windows Firewall, which prevented them from locating or communicating with our domain controllers.
We rolled back the changes applied by group policy, removed the member servers from the domain, and added them back to the domain, and that fixed the problem.