In advance: sorry for the length of the question... unable to get the right balance between detail and brevity.
We are having issues with the DB Server for our web application where queries that should (and usually do) run in a very short time (< 10ms) are, on random occasions, taking anywhere between 1 ~ 30 seconds to execute - with no obvious pattern. According to our profiler traces, some of them are even "do-nothing" queries, such as "exec sp_reset_connection"
(usually runs in 0ms; observed peaks of 3~6s), and "SET NO_BROWSETABLE ON"
, etc. Some examples are:
SELECT * FROM [Localisation].[TimeZoneRule] WHERE [Name] = 'EU'
Where TimeZoneRule
has about 500,000 rows in 5 columns. Has a surrogate primary key, and an index on Name
. Usually takes 0.97ms, peaks at 11s. Table is NEVER written to (was pre-populated prior to go-live). Profiler records it as taking 0-15 CPU, 18-25 Reads, 0-1 Writes (no idea why the write).
UPDATE [Core].[User] SET [LastUsed] = GETUTCDATE() WHERE Id = '<uid>'
Where User
has about 30,000 rows over about 10 columns (one of which is an Xml column). Id
is the clustered primary key. Table is regularly written to and read from. Usually takes 10~20ms, peaks at 26s. Profiler records it as taking 0 CPU, 15-36 Reads, 0-1 Writes.
INSERT INTO [Log].[Session] (ASPSessionId, Start, ClientAddress, ClientSoftware, ProxyAddress, ProxySoftware)
VALUES(<number>, GETUTCDATE(), '<ipv4address>', '<User agent string>', '<ipv4address>', '<proxy software name (if present)>')
Where Session
has about 1,000,000 rows in it over about 8 columns. Has a surrogate primary key (identity), and an index on ASPSessionId
. Table is regularly written to, but rarely read from (only by us directly from SSMS). Usually takes 15~150ms, peaks at 5s. I don't have the profile record for it at hand, but from memory, CPU is around 0, Reads and writes were between 0 and 100 each.
The setup we're using is a mirrored setup with a Dell 2950 as the principle (2 4-core xeon 2.6, 16Gb RAM), and a Dell 6850 as the mirror (4 HT Xeon 3.2, 8Gb RAM). Both running SQL 2005 SP4 64-bit. The database in question is not particularly large, around 16Gb in size. The primary has 6 SAS disks divided into 3 RAID-1 volumes; one for System + Page + TempDB, one for the database's MDF, and one for the transaction log + hourly log backup + daily DB backup. I know the log situation is far from the best - in terms of disk IO (see below), and data security.
So far, we think we've eliminated:
- The mirror. We separated the servers, and ran using one of them (and then switched to the other), but the performance issues remained.
- Blocking due to locks(*).
TimeZoneRule
is never written to and, by my reckoning, should never have an exclusive lock on it. Additionally, we've checked through the traces and on many occasions the "problem query" is the only one running - the only other activity is other connections disconnecting - Poor indexing. With the low figures for Reads and CPU, it would suggest that SQL Server is using indexes effectively.
- Disk IO. PerfMon indicates some odd figures for the data file drive (but only that drive) - whilst data read/ write rates rarely seem to exceed 32KB/s, the Current Disk Queue Length spikes to around 215 for durations of 2-5 seconds at approximately 45-60minute intervals with no fixed pattern. However, these do not correlate with times of poor query performance. The disk queue lengths for the other two drives [system + page + tempdb] and [log + backups] never exceeds 3.
(*) We've tried getting profiler to capture events related to lock acquisition, but the trace bloats to unreadable proportions and, worse, the web application grinds to a halt.
Not being DBAs, we're rapidly running out of ideas. Can anyone think of anything I should consider looking at next or anything I've stupidly missed?