We have a customer with a Windows Server 2016 domain controller. It's a small business so their server infrastructure consists of a Hyper-V host and this DC. The DC hosts file shares and Azure AD Connect for syncing identity with Office 365.
We monitor for event ID 4625 and have an alerting threshold to help us identify potential brute force attacks against the network.
In October of last year we began receiving alerts that the failed logon alert threshold had been exceeded. Upon investigation we have the following description of the problem:
- The failed logon events occur organically whenever our VSS (Datto) backups run or whenever AADC syncs
- Backups succeed and AADC syncs normally with no errors
- There are two accounts that fail to log on:
- SERVERNAME$ (e.g. the SYSTEM account) whenever the backups run
- AAD_* whenever AADC runs
- The event STATUS is 0xC000006D - failed username or password
- The event SUB STATUS is 0x0 - Status OK
- The SYSTEM logon failure can be easily replicated by running
vssadmin list writers
The list of troubleshooting over the last several months is long. This is not a comprehensive list:
- Uninstall RRAS and WID (including deleting WID folders to ensure permissions are set correctly when roles are reinstalled)
- Clearing SYSTEM credential cache (with psexec &
rundll32 keymgr.dll,KRShowKeyMgr
- no credentials cached) - Log in to WID with SQL Server Management Studio and verified database permissions (for LOCALSERVICE and NETWORKSERVICE accounts)
- Adjusting permissions on various registry keys (this did get rid of unrelated CAPI2 & WIDWRITER errors in the application event log)
- Running DCDIAG & reviewing application event logs and clearing up any errors & warnings (including DNS warnings, adding SPNs, and reregistering AD DNS entries, and running a D4 authoritative restore of DFSR to clear up warnings from the server migration)
- Monitoring with sysinternals ProcessMonitor to identify any access denied or other errors (this got me on to adjusting permissions on folders & registry entries to make sure that both LOCALSERVICE and NETWORKSERVICE (the service accounts running the WIDWRITER & other VSS services) had access)
- Verifying service startup type and logon accounts for VSS services & writers, AADC, etc.
- Stopping services and running tests (stopping AADC sync service resolves all logon failures. That's how I narrowed it down to AADC)
- Uninstalling AADC
- Run a repair on SQL Express Local DB
- Calling Microsoft Support using our MS partnership support contract - who said "there's no loss of functionality and you don't have an actual user who can't log on, so we can't help you, sign up for Premier support!" (I'm sorry, does SYSTEM not count as an important user????)
- Banging my head against several walls and many other things
A useful thing learned during all of this
- When uninstalling and reinstalling AADC, running
vssadmin list writers
continually during the installation process, the errors begin immediately after the SQL components are installed, before the installer has even finished running. - When SSMS is installed and I log in to a database, I also get failed login events for the dom admin account I logged in with, though my SSMS session seems unaffected.
The problem is clearly related to AADC because I can stop the AAD sync service or uninstall AADC and all failed logon events go away. But uninstalling AADC & deleting AADC folders & cleaning out AADC user accounts & clearing AADC registry entries to try and get a truly fresh install has no effect, the errors return immediately when I reinstall AADC.
At this point I'm at my wits end and I don't know what else to do or where else to even look. I'm hoping someone out there in the aether knows more than I do (likely) or has experienced this before and found a fix.
One final note - the server's DNS name is 9 characters long, meaning that it does not match its NETBIOS name. I don't think this is the cause, but if necessary I can rename the server. It's just a bit of a headache to do for an in-production DC & file server.
This problem originally began occurring in October of 2019. It took almost a year but I finally found a solution which hints at a possible explanation.
The solution was to configure the following registry key and value:
This resolved the failed logins whenever VSS ran against a SQL database.
This is part of a security function introduced in Windows Server 2003 called Loopback Check Functionality.
From what I've read about how Loopback Check Functionality works, what I believe is going on is that whenever VSS logs on to SQL to perform a backup, it logs on as SYSTEM. LSA expects the logon for SYSTEM to come from the server's DNS name, but the logon is actually coming from the server's NETBIOS name. Because the DNS name does not match the NETBIOS name in this case, LSA fails the Kerberos authentication and the login falls back to NTLM which accepts the NETBIOS name.
By configuring
BackConnectionHostNames
we tell LSA to accept the connection from both the NETBIOS and DNS names and kerberos authentication succeeds.I was able to trace the error by using Sysinternals ProcessMonitor to track down everything that VSS was doing when the errors occurred. I found VSS accessing folders located at C:\Users\ {AzureADConnect Account} \AppData\Local\Microsoft\Microsoft SQL Server Local DB\Instances\ADSync where I found error.log files. These logs contained the following error:
This was the breakthrough I needed, since that error information led me to several locations, such as this SE question, which recommended disabling loopback checks entirely. Not wanting to disable a security feature, I continued searching until I found sources (1) and (2) that described how to configure Loopback Check Functionality without disabling it, by creating the registry entry for
BackConnectionHostNames
as I outlined above.