In advance: sorry for the length of the question... unable to get the right balance between detail and brevity.
We are having issues with the DB Server for our web application where queries that should (and usually do) run in a very short time (< 10ms) are, on random occasions, taking anywhere between 1 ~ 30 seconds to execute - with no obvious pattern. According to our profiler traces, some of them are even "do-nothing" queries, such as "exec sp_reset_connection"
(usually runs in 0ms; observed peaks of 3~6s), and "SET NO_BROWSETABLE ON"
, etc. Some examples are:
SELECT * FROM [Localisation].[TimeZoneRule] WHERE [Name] = 'EU'
Where TimeZoneRule
has about 500,000 rows in 5 columns. Has a surrogate primary key, and an index on Name
. Usually takes 0.97ms, peaks at 11s. Table is NEVER written to (was pre-populated prior to go-live). Profiler records it as taking 0-15 CPU, 18-25 Reads, 0-1 Writes (no idea why the write).
UPDATE [Core].[User] SET [LastUsed] = GETUTCDATE() WHERE Id = '<uid>'
Where User
has about 30,000 rows over about 10 columns (one of which is an Xml column). Id
is the clustered primary key. Table is regularly written to and read from. Usually takes 10~20ms, peaks at 26s. Profiler records it as taking 0 CPU, 15-36 Reads, 0-1 Writes.
INSERT INTO [Log].[Session] (ASPSessionId, Start, ClientAddress, ClientSoftware, ProxyAddress, ProxySoftware)
VALUES(<number>, GETUTCDATE(), '<ipv4address>', '<User agent string>', '<ipv4address>', '<proxy software name (if present)>')
Where Session
has about 1,000,000 rows in it over about 8 columns. Has a surrogate primary key (identity), and an index on ASPSessionId
. Table is regularly written to, but rarely read from (only by us directly from SSMS). Usually takes 15~150ms, peaks at 5s. I don't have the profile record for it at hand, but from memory, CPU is around 0, Reads and writes were between 0 and 100 each.
The setup we're using is a mirrored setup with a Dell 2950 as the principle (2 4-core xeon 2.6, 16Gb RAM), and a Dell 6850 as the mirror (4 HT Xeon 3.2, 8Gb RAM). Both running SQL 2005 SP4 64-bit. The database in question is not particularly large, around 16Gb in size. The primary has 6 SAS disks divided into 3 RAID-1 volumes; one for System + Page + TempDB, one for the database's MDF, and one for the transaction log + hourly log backup + daily DB backup. I know the log situation is far from the best - in terms of disk IO (see below), and data security.
So far, we think we've eliminated:
- The mirror. We separated the servers, and ran using one of them (and then switched to the other), but the performance issues remained.
- Blocking due to locks(*).
TimeZoneRule
is never written to and, by my reckoning, should never have an exclusive lock on it. Additionally, we've checked through the traces and on many occasions the "problem query" is the only one running - the only other activity is other connections disconnecting - Poor indexing. With the low figures for Reads and CPU, it would suggest that SQL Server is using indexes effectively.
- Disk IO. PerfMon indicates some odd figures for the data file drive (but only that drive) - whilst data read/ write rates rarely seem to exceed 32KB/s, the Current Disk Queue Length spikes to around 215 for durations of 2-5 seconds at approximately 45-60minute intervals with no fixed pattern. However, these do not correlate with times of poor query performance. The disk queue lengths for the other two drives [system + page + tempdb] and [log + backups] never exceeds 3.
(*) We've tried getting profiler to capture events related to lock acquisition, but the trace bloats to unreadable proportions and, worse, the web application grinds to a halt.
Not being DBAs, we're rapidly running out of ideas. Can anyone think of anything I should consider looking at next or anything I've stupidly missed?
Few things that maybe unhelpful or could be useful;
If this is happening to stored procedures, it could be parameter sniffing -> http://omnibuzz-sql.blogspot.com/2006/11/parameter-sniffing-stored-procedures.html
Are you using ASP for the web application? We had an issue that was similar but related to ASP + IIS and SQL using stored procedures. I seem to remember it being semaphore timeouts that would cause this. It'd take almost 30+ seconds to run a query but then everything was fine for a while. I can't find my info on it but I seem to remember it being related to IIS timeouts, this was on the IIS side.
This tool might be helpful too -> http://blog.brianhartsock.com/2008/12/16/quick-and-dirty-sql-server-slow-query-log/
As you are running SQL 2005 you can take your SQL Profiler data and compare it against the Perfmon data to see if you can see a correlation. This is done by saving your trace data and perfmon data to files using the normal techniques. Then open the SQL Profiler trace within profiler, and then one of the options in the file menu will be Import Performance Data. This will let you select a query and see what the counters were doing at that time (or close to it depending on your perfmon collection interval).
Disk queue spikes are never good. Especially that high. What's the IO that you are pushing to the disk when the queue gets that high? Basically you don't want a disk queue any higher than (2*n) where n in the number of disks in the array. Since you are using a 2 disk RAID 1 n=1 in your case (since you only get the speed of a single disk).
In perfmon there's a counter which is the seconds per read and seconds per write. What do these counters look like when the queries start taking a long time to run. What about normally? (Anything over .02 seconds is bad.) What's the estimated page life expectancy? (Anything under 300 seconds is usually bad, but this can vary.) What's the SQL Server cache hit ratio? (Anything below ~97% is usually bad. I like mine above 99.9%.)
Are you seeing database and/or log growth events? Such events would show up in ERRORLOG and in the performance counters.
Couple of things to try, the most useful is the Display Estimated Execution Plan, and the Include Actual Execution Plan in the SMSS.
If you tick those Include Actual Execution Plan button before running the query, after the query is executed it will show you where the costs of the query were. Based on the costs, it's usually pretty easy to find out where it went wrong. If it's a SORT then it's a bad index. If it's building a hash table then that's bad index/bad join, there's all sorts of things that can go wrong that you might not even know happen during a simple SELECT * query.
The 2nd thing to try is to run the SQL Query Profiler (Highlight the query, right-click, Trace Query in SQL profiler). It will also find inefficiencies that can be eliminated.
However, your queries are all pretty simple, and are not pointing to database design flaws, however it might at least give you an idea on where to go next (obviously you will to inspect it on a time when the execution takes longer than expected).
Another place to look, and this can be Information Overload sometimes, is the SQL Server Profiler (which you mention you've already used, but did you catch it during the 26s execution?). You can watch everything that the SQL server does in (almost) real-time. Depending on how long your window of this kind of activity is, if you have it standing by ready to run a trace, as soon as it starts to get chuggy enable the trace, see if there's something in the SQL server that it's waiting for.
Are you manually rebuilding your databases' tables' statistics regularly? If they're out of date and the automatically update statistics option is set then queries can pause whilst the stats are rebuilt.
As well as manually updating your stats you could also consider enabling asynchronous statistics.
Here's the T-SQL :
ALTER DATABASE dbName SET AUTO_UPDATE_STATISTICS_ASYNC ON
Further reading:
http://msdn.microsoft.com/en-us/library/ms190397.aspx
I'm not convinced that this is the root cause of your problem - but it may be worth ruling it out.