Related to: Current wisdom on SQL Server and Hyperthreading
Recently we upgraded our Windows 2008 R2 database server from an X5470 to a X5560. The theory is both CPUs have very similar performance, if anything the X5560 is slightly faster.
However, SQL Server 2008 R2 performance has been pretty bad over the last day or so and CPU usage has been pretty high.
Page life expectancy is massive, we are getting almost 100% cache hit for the pages, so memory is not a problem.
When I ran:
SELECT * FROM sys.dm_os_wait_stats
order by signal_wait_time_ms desc
I got:
wait_type waiting_tasks_count wait_time_ms max_wait_time_ms signal_wait_time_ms ------------------------------------------------------------ -------------------- -------------------- -------------------- -------------------- XE_TIMER_EVENT 115166 2799125790 30165 2799125065 REQUEST_FOR_DEADLOCK_SEARCH 559393 2799053973 5180 2799053973 SOS_SCHEDULER_YIELD 152289883 189948844 960 189756877 CXPACKET 234638389 2383701040 141334 118796827 SLEEP_TASK 170743505 1525669557 1406 76485386 LATCH_EX 97301008 810738519 1107 55093884 LOGMGR_QUEUE 16525384 2798527632 20751319 4083713 WRITELOG 16850119 18328365 1193 2367880 PAGELATCH_EX 13254618 8524515 11263 1670113 ASYNC_NETWORK_IO 23954146 6981220 7110 1475699 (10 row(s) affected)
I also ran
-- Isolate top waits for server instance since last restart or statistics clear
WITH Waits AS (
SELECT
wait_type,
wait_time_ms / 1000. AS [wait_time_s],
100. * wait_time_ms / SUM(wait_time_ms) OVER() AS [pct],
ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS [rn]
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN ('CLR_SEMAPHORE','LAZYWRITER_SLEEP','RESOURCE_QUEUE',
'SLEEP_TASK','SLEEP_SYSTEMTASK','SQLTRACE_BUFFER_FLUSH','WAITFOR','LOGMGR_QUEUE',
'CHECKPOINT_QUEUE','REQUEST_FOR_DEADLOCK_SEARCH','XE_TIMER_EVENT','BROKER_TO_FLUSH',
'BROKER_TASK_STOP','CLR_MANUAL_EVENT','CLR_AUTO_EVENT','DISPATCHER_QUEUE_SEMAPHORE',
'FT_IFTS_SCHEDULER_IDLE_WAIT','XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN'))
SELECT W1.wait_type,
CAST(W1.wait_time_s AS DECIMAL(12, 2)) AS wait_time_s,
CAST(W1.pct AS DECIMAL(12, 2)) AS pct,
CAST(SUM(W2.pct) AS DECIMAL(12, 2)) AS running_pct
FROM Waits AS W1
INNER JOIN Waits AS W2 ON W2.rn <= W1.rn
GROUP BY W1.rn, W1.wait_type, W1.wait_time_s, W1.pct
HAVING SUM(W2.pct) - W1.pct < 95; -- percentage threshold
And got
wait_type wait_time_s pct running_pct CXPACKET 554821.66 65.82 65.82 LATCH_EX 184123.16 21.84 87.66 SOS_SCHEDULER_YIELD 37541.17 4.45 92.11 PAGEIOLATCH_SH 19018.53 2.26 94.37 FT_IFTSHC_MUTEX 14306.05 1.70 96.07
That shows huge amounts of time synchronizing queries involving parallelism (high CXPACKET). Additionally, anecdotally many of these problem queries are being executed on multiple cores (we have no MAXDOP hints anywhere in our code)
The server has not been under load for more than a day or so. We are experiencing a large amount of variance with query executions, typically many queries appear to be slower that they were on our previous DB server and CPU is really high.
Will disabling Hyperthreading help at reducing our CPU usage and increase throughput?
I agree that
It looks like we should be tuning two things:
MAXDOP (Maximum Degrees of Parallelism). Everything I read indicates that having this unbounded is probably a bad idea, and the Microsoft documentation says:
anything higher than
8
is not generally recommended .. so I set it to4
for now. It was zero (unbounded) initially.Cost Threshold for Parallelism. Apparently the default of
5
here is considered a pretty low default according to a few SQL MVP posts I've found -- we can tune it up to reduce how much parallelism is even attempted by the scheduler.But honestly these feel like workarounds; I think the true solution for our workload (full-text index heavy) is to disable HT.
I still feel that testing your specific workload, as per the original answer, is the only way to be sure. It's not an ideal answer when you're trying to tune a production system (so I'd ask if it was possible to get an identical testbed in systems where both performance and availability really matter) but it's the only one I'm really comfortable with.
We can talk about the theory of whether or not Hyperthreading should hurt or improve things in general (I find it to be more likely to hurt than help on servers so for a "generic" deployment I'd probably disable it), but there is only one way to see for sure if it's going to make a difference in your specific case, and that is try it and see.
Anandtech found that with the pure read load, it hurt a little, and with a write heavy load, it was a bit of a win. I've not seen anything to make me think it is going to get you a hit much worse than -5%, or a win much better than 15%. Note what with a Atom, it is a huge win, but that is a very odd cpu.
All you changed was the cpu? You went from 12MB cache and 4 threads, so 3MB of cache per thread, to 8 MB of cache, and 8 threads, so 1MB per thread. Now, that is oversimplifying, but I bet that is what is killing you, you used to run queries in cache, and now run them from RAM because they need more than 1MB but less than 3MB. Turning off HT will probably help, but I'd go back to the old CPU. Turn off HT, and you get 2MB per thread, but if your workload thrashes with that much, it will not help. It may well be that the old 12MB cache cpu is hugely faster for your workload.
I would try turning HT off, and see if that is an improvement, but I suspect that cache is king for your work load, and you may well need to go back to the 12 MB chip.
Hyperthreading is, at best, just a way of abstracting task switching away from the operating system and placing it on-die, with direct access to the L1 and L2 cache, which makes task switching a crapload faster.
Testing with VMWare have indicated that disabling HT made no discernable difference under standard load, and a 5% increase under heavy load, due to the fact that ESXi is smart enough to know the difference between the "real" thread and the "fake" thread (there's a lot more to it than that, but that's in laymens terms). SQL Server 2005 isn't quite that smart, but it combined with an up-to-date operating system there should be little advantage to disabling HT.
All that said, I agree with Ronald that it's most likely going to be your L2 cache. A 33% drop in cache size is substantial, and when we spec our SQL Servers we always go for cache over raw clock speed every time.
Based on my experience, HT was making I/O operations take forever on my of my active nodes on a Windows 2008 R2 Cluster (running SQL Server 2008 R2). An interesting fact was that it was neither reflected in the wait stats nor in the pssdiag I ran for Microsoft support.
The way I noticed low I/O was just by watching the OS counters for physical disk. As Sam pointed out, I wrote about it here and here
If you do NOT experience I/O problems and are CPU bound I suggest you start this way:
Pinpoint which processes and T-SQL blocks are causing the most CPU utilization. In our experience, after we fixed the problem with I/O (by turning HT off) we identified code that was performing horribly in 2008 R2 and doing fine in 2005. I wrote about it here.
While under high load, run Adam Machanic's sp_whoisactive. You can download it from here. We were experiencing a very high CPU utilization due to the excessive amount of logical reads (20 million per query) due to a really bad plan. Our processes were performing anti-semi joins with tables that were partitioned.
My next recommendation is to run profiler to identify a set of T-SQL code that are both high in CPU and I/O logical reads.
With the steps above we were able to tune the offending processes and go from 85% sustained CPU utilization to almost nil.
Good Luck and please feel free to drop me a line if you find a fix as I would like to add the case to my blog.
Thanks
Oscar
Whether HT is good or bad is hard to pin down.
It really does depend on the server load pattern based on experience and reading. That is, when it affects performance it does so badly: otherwise you don't notice it.
The theory I read was that the threads share cache which means under adverse conditions each thread can overwrite the other thread's cache. If you don't have much parallelism, or your load is many short queries, then it may not affect you.
I've tried with MAXDOP and processor affinity (back in my last real DBA role on SQL Server 2000) but could never find anything conclusive: but only for my shop at that time.
As a quick test, you can set processor affinity to use only physical cores (the lower numbers) and see what happens.
However, at most you lose half your cores. Nowadays that may not matter compared to what I was playing with a few years ago when it was 2 vs 4 or 4 vs 8. Now it's 8 vs 16 or 16 vs 32.
Edit: A test by Slava Oks
Unfortunately, I don't think you are going to get any more definitive answer than "try turning hyperthreading off and see if that helps".
Despite the helpful answer from Jonathan in my original thread (which you linked in your question), I was never able to get any definitive evidence about the impact of HT on the specific servers I was investigating. In my case, the servers were already scheduled for replacement, so we simply let those replacements "take care of the issue" so to speak.
My advice:
Try a server-level MAX Degree of Parallelism setting of 1. Parallelism on SQL is most useful for larger, longer running queries anyway, and your load (I assume) consists of a massively high number of smaller queries anyway. This should entirely eliminate CXPACKET waits. This could make certain individual queries run slightly longer, but should allow more "throughput" of total queries on the server.
I've had good results doing this on OLTP servers. Other kinds of servers (reporting servers, processing servers, data warehousing) definitely need the MAXDOP set higher.
And just to be clear, this setting would still allow SQL to use multiple threads for each individual table in a JOIN, so you're not really eliminating parallelism entirely.
At least worth a try, since this setting change takes effect immediately and doesn't even require you to restart the SQL service: http://msdn.microsoft.com/en-us/library/ms181007.aspx
This means you could switch it back immediately if things started going to hell.
Turning off hyperthreading in the BIOS would require a full server reboot, so is a bit more risky.
For the record, we also had unexpectedly bad performance after a server upgrade. It turned out to be due to issues with the BIOS and CPU power saving. The default setting on the server (HP) was to ignore the OS control of CPU speed and use its own algorithm. Changing this to OS control, and updating the BIOS, resulted in significant improvements. There were some release notes (can't find them now) that there was a BIOS bug that was locking the CPU at the lowest performance state.
https://serverfault.com/a/196329/6390