Since a sitewide upgrade to Windows 7 on desktop, I've started having a problem with virus checking. Specifically - when doing a rename operation on a (filer hosted) CIFS share. The virus checker seems to be triggering a set of messages on the filer:
[filerB: auth.trace.authenticateUser.loginTraceIP:info]: AUTH: Login attempt by user server-wk8-r2$ of domain MYDOMAIN from client machine 10.1.1.20 (server-wk8-r2).
[filerB: auth.dc.trace.DCConnection.statusMsg:info]: AUTH: TraceDC- attempting authentication with domain controller \\MYDC.
[filerB: auth.trace.authenticateUser.loginRejected:info]: AUTH: Login attempt by user rejected by the domain controller with error 0xc0000199: STATUS_NOLOGON_WORKSTATION_TRUST_ACCOUNT.
[filerB: auth.trace.authenticateUser.loginTraceMsg:info]: AUTH: Delaying the response by 5 seconds due to continuous failed login attempts by user server-wk8-r2$ of domain MYDOMAIN from client machine 10.1.1.20.
This seems to specifically trigger on a rename
so what we think is going on is the virus checker is seeing a 'new' file, and trying to do an on-access scan. The virus checker - previously running as LocalSystem and thus sending null
as it's authentication request is now looking rather like a DOS attack, and causing the filer to temporarily black list.
This 5s lock out each 'access attempt' is a minor nuisance most of the time, and really quite significant for some operations - e.g. large file transfers, where every file takes 5s
Having done some digging, this seems to be related to NLTM authentication:
Symptoms
Error message:
System error 1808 has occurred.
The account used is a computer account. Use your global user account or local user account to access this server.
A packet trace of the failure will show the error as:
STATUS_NOLOGON_WORKSTATION_TRUST_ACCOUNT (0xC0000199)
Cause
Microsoft has changed the functionality of how a Local System account identifies itself
during NTLM authentication. This only impacts NTLM authentication. It does not impact
Kerberos Authentication.
Solution
On the host, please set the following group policy entry and reboot the host.
Network Security: Allow Local System to use computer identity for NTLM: Disabled
Defining this group policy makes Windows Server 2008 R2 and Windows 7 function like Windows Server 2008 SP1.
So we've now got a couple of workaround which aren't particularly nice - one is to change this security option. One is to disable virus checking, or otherwise exempt part of the infrastructure.
And here's where I come to my request for assistance from ServerFault - what is the best way forwards? I lack Windows experience to be sure of what I'm seeing.
I'm not entirely sure why NTLM is part of this picture in the first place - I thought we were using Kerberos authentication. I'm not sure how to start diagnosing or troubleshooting this. (We are going cross domain - workstation machine accounts are in a separate AD and DNS domain to my filer. Normal user authentication works fine however.)
And failing that, can anyone suggest other lines of enquiry? I'd like to avoid a site wide security option change, or if I do go that way I'll need to be able to supply detailed reasoning. Likewise - disabling virus checking works as a short term workaround, and applying exclusions may help... but I'd rather not, and don't think that solves the underlying problem.
EDIT: Filers in AD ldap have SPNs for:
nfs/host.fully.qualified.domain
nfs/host
HOST/host.fully.qualified.domain
HOST/host
(Sorry, have to obfuscate those).
Could it be that without a 'cifs/host.fully.qualified.domain' it's not going to work? (or some other SPN? )
Edit: As part of the searching I've been doing I've found: http://itwanderer.wordpress.com/2011/04/14/tread-lightly-kerberos-encryption-types/
Which suggests that several encryption types were disabled by default in Win7/2008R2. This might be pertinent, as we've definitely had a similar problem with Keberized NFSv4. There is a hidden option which may help some future Keberos users: options nfs.rpcsec.trace on (This hasn't given me anything yet though, so may just be NFS specific).
Edit: Further digging has me tracking it back to cross domain authentication. It looks like my Windows 7 workstation (in one domain) is not getting Kerberos tickets for the other domain, in which my NetApp filer is CIFS joined. I've done this separately against a standalone server (Win2003 and Win2008) and didn't get Kerberos tickets for those either.
Which means I think Kerberos might be broken, but I've no idea how to troubleshoot further.
Edit: A further update: It looks like this may be down Kerberos tickets not being issued cross domain. This then triggers NTLM fallback, which then runs into this problem (since Windows 7). First port of call will be to investigate the Kerberos side of things, but in neither case do we have anything pointing at the Filer being the root cause. As such - as the storage engineer - it's out of my hands.
However, if anyone can point me in the direction of troubleshooting Kerberos spanning two Windows AD domains (Kerberos Realms) then that would be appreciated.
Options we're going to be considering for resolution:
- Amend policy option on all workstations via GPO (as above).
- Talking to AV vendor about the rename triggering scanning.
- Talking to AV vendor regarding running AV as service account.
- investigating Kerberos authentication (why it's not working, whether it should be).
I would modify your antivirus policies to not scan files shared over the network. You could potentially have a dozen clients trying to AV scan the same file across the network simultaneously.
So in Windows 2000, 2003, Windows XP, Vista, and 2008, the default behavior is this:
But in Windows 7 and 2008 R2 and above, the default behavior was changed to this:
Source: http://technet.microsoft.com/en-us/library/jj852275.aspx
You say that you'd like to avoid a site-wide security option change, but you already made one when you upgraded all clients to Windows 7.
As for why you aren't using Kerberos in the first place, that is an entirely different question that you've not given us enough data to be able to answer. For Kerberos to work, the CIFS service needs a trust relationship with the domain and registered service principal names, and the client must address the service with hostname or FQDN, not IP address.
Are your Filers domain joined? If so, do they have CIFS/* SPNs?
I've come to the end of the run on this one, and now know why it's happening.
In summary:
Since Windows 7/2008 the default behaviour for 'LocalSystem' on a client machine changed. Where before it would use a 'null' login, it uses machine accounts for NTLM.
Because we are going between two AD forests, Kerberos isn't being used. This is by design. http://technet.microsoft.com/en-us/library/cc960648.aspx "Kerberos authentication uses transparent transitive trust among domains in a forest, but it cannot authenticate between domains in separate forests"
Sophos is scanning files 'on access' which is triggered by a rename. For security policy reasons, this includes network drives.
Because Sophos is running as LocalSystem, it's presenting the machine account via NTLM to the filer. This account is then rejected, with STATUS_NOLOGON_WORKSTATION_TRUST_ACCOUNT and after 10 retries the filer triggers a lockout.
Because of this lockout, subsequent virus scan attempts will stall for 5s per attempt. This is the root of our problem, because our process copies and renames hundreds of files, and after the 10th, each will take 5s.
This leaves us with solutions of:
Amend the security policy option as mentioned above: Network Security: Allow Local System to use computer identity for NTLM: Disabled
Apply an exclusion in the virus checker for network drives
Merge your separate domain into the same forest, so Kerberos works.(Another option is outlined here: http://xitnotes.wordpress.com/2012/03/29/kerberos-in-an-active-directory-forest-trust-vs-external-trust/ that involves upgrading the relationship between the domains such that Kerberos works again.
Use vfilers, and CIFs join it to the other domain.
There is also an option on the filer to up the number of retries before this lockout occurs - it's a hidden option, and I don't have precise syntax handy.