I have a SQL Server instance (SQL Server 2008 R2, Windows 2008 R2) that complains, for very short, random periods of about 15-20 seconds, that some of its I/O requests are taking longer than 15 seconds. ("SQL Server has encountered x occurrence(s) of I/O requests taking longer than 15 seconds to complete on file x") The disks in question are part of a SAN. Typically, in such a scenario, it's common to see IOPS or throughput demands on the disk spike, thus producing the latency, and suggesting perhaps that the LUNs need to be beefed up to match the server's needs. In this case, however, there is no such spike--on the contrary, according to perfmon, activity on the affected disk goes from a steady state to almost nothing at all, and latency actually improves a good deal. (And, I should add, we've searched on the SQL Server side for evidence of any sudden burst of activity, to no avail. The nature of the workload is such that a sudden drop in server activity is not possible.) There is a brief compensatory spike after the slow I/O incident, as requests catch up after the interruption.
The SAN folks have gone over everything with a fine-toothed comb (including the configuration of the host) and declare that nothing is wrong from their perspective. It so happens that we are using both anti-virus on this server (with proper file exclusions) and an encryption solution that operates like a file system driver, so I am naturally suspicious that either or both of these may be the source of the problem. But I'd like to be able to present a smoking gun when I call everyone into the sitting room to reveal the murderer. Other than consulting the vendors (which naturally we are doing), any suggestions for troubleshooting intermittent latency issues that may be caused by an application intercepting file system requests? Any tools or techniques, perhaps, that might show exactly what's slowing things down? I'm afraid that turning off either the AV or the encryption to see what happens is a non-starter. Just to complicate matters, this problem, so far, cannot be reproduced on demand.