Jeff Atwood

Asked: 2016-06-28 21:28:25 +0800 CST

Dealing with NTP reflection attacks in IPTables

16

We're dealing with an NTP reflection / amplification attack at our colocated servers. This question is specific to responding to NTP reflection attacks, and not directed at DDoS in general.

Here's the traffic:

It's churning a bit of CPU on our router:

Unfortunately it is not big enough to cause our upstream provider to blackhole the traffic, which means it's passing through to us.

We've used the following rule to block the NTP traffic, which originates on port 123:

-p udp --sport 123 -j DROP

This is the first rule in IPTables.

I searched a lot, and I can't find very much information on how to use IPTables to mitigate a NTP reflection attack. And some of the information out there seems flat out incorrect. Is this IPTables rule correct? Is there anything else we can add, or do, to mitigate a NTP reflection / amplification attack other than contact our upstream network provider?

Also: since these attackers must be using networks that

allow spoofing of IP address in packets
have unpatched, circa 2010 NTP code

is there any global clearinghouse we can report these IP addresses to, so that they get fixed to stop allowing spoofed packets, and patch their NTP servers?

Jeff Atwood

Asked: 2015-08-25 12:56:07 +0800 CST

Workarounds for maximum DNS-Interactive terms limit exceeded in SPF record?

16

As a hosting provider, we send email on behalf of our clients, so we help them set up DKIM and SPF email records in their DNS to get email deliverability just right. We've been advising them to use http://mail-tester.com to test that they didn't miss anything, and I like this tool a lot.

One problem we've run into a few times, and I'm not sure about, is the DNS "limit" on the SPF record based on domain name. So if you have this:

v=spf1 a include:aspmx.googlemail.com include:campaignmonitor.com include:authsmtp.com include:mail.zendesk.com include:salesforce.com include:_hostedspf.discourse.org ~all

You'll get

example.com ... campaignmonitor.com: Maximum DNS-interactive term limit (10) exceeded

Like so:

mail-tester results

I have some questions about this.

~~I count six domain names here, not 10, so why is it hitting "ten" DNS requests here?~~ Answered here
~~Is this 10 DNS interactive term limit a warning or a real error? e.g. should we care? It is nagging our customers a bit and they email us for support.~~ Answered here
Is this 10 DNS interactive term limit a real problem on today's web? As you can see, this customer has a lot of services sending email for them and they are all legitimate. Perhaps this DNS limit was set in the year 2000 when delegating email services like this were not common?

Yes, we can have our customers change the include to IPs in the SPF record but that puts us in a bind if we ever change IPs, a bunch of customers' stuff will break. Really don't want to do that..

What workarounds are there for this?

Jeff Atwood

Asked: 2011-05-23 01:25:48 +0800 CST

Setting up an anonymous Windows Server 2008 network share?

17

How do I create a truly anonymous read/write network share in Windows Server 2008?

That is ...

no login dialog shown
no domain credentials required

... so that any Windows machine can get to this share no matter who is logged in, whether the machine is on a domain or not?

Jeff Atwood

Asked: 2010-06-11 17:46:14 +0800 CST

postfix takes 60-90ms to queue email -- normal?

8

We're seeing some (maybe?) strange delays when submitting individual emails to our local Postfix server.

To help diagnose the issue, I wrote a little test program which sends 5 emails:

get smtp 1ms   (  1 ms)
email 0 677ms  (676 ms)
email 1 802ms  (125 ms)
email 2 890ms  ( 88 ms)
email 3 973ms  ( 83 ms)
email 4 1088ms (115 ms)

Discounting the handshaking in the first email, that's about 90ms per email. These timings have also been corroborated with another test app written by someone else using a different codepath, so it appears to be server related.

I turned on detailed logging and I can see that the delay is between the end of message \r\n\r\n and the receive:

[16:31:29.95] [SEND] \r\n.\r\n
[16:31:30.05] [RECV] 250 2.0.0 Ok: queued as B128E1E063\r\n

[16:31:30.08] [SEND] \r\n.\r\n
[16:31:30.17] [RECV] 250 2.0.0 Ok: queued as 4A7DE1E06E\r\n

[16:31:30.19] [SEND] \r\n.\r\n
[16:31:30.27] [RECV] 250 2.0.0 Ok: queued as 68ACC1E072\r\n

[16:31:30.28] [SEND] \r\n.\r\n
[16:31:30.34] [RECV] 250 2.0.0 Ok: queued as 7EFFE1E079\r\n

[16:31:30.39] [SEND] \r\n.\r\n
[16:31:30.45] [RECV] 250 2.0.0 Ok: queued as 9793C1E07A\r\n

The time intervals tell the story (discounting the handshaking required for the initial email) -- each email is waiting about 60-90 milliseconds for postfix to queue!

This seems .. excessive .. to me. Is it "normal" for postfix to take 60-90 ms for every email you send it? Or do I just have unreasonable expectations? I would expect the local postfix server to queue the email in about 20ms, tops!

Jeff Atwood

Asked: 2010-05-01 03:26:17 +0800 CST

How much network latency is "typical" for east - west coast USA?

113

At the moment we're trying to decide whether to move our datacenter from the west coast to the east coast.

However, I am seeing some disturbing latency numbers from my west coast location to the east coast. Here's a sample result, retrieving a small .png logo file in Google Chrome and using the dev tools to see how long the request takes:

West coast to east coast:
215 ms latency, 46 ms transfer time, 261 ms total
West coast to west coast:
114 ms latency, 41 ms transfer time, 155 ms total

It makes sense that Corvallis, OR is geographically closer to my location in Berkeley, CA so I expect the connection to be a bit faster.. but I'm seeing an increase in latency of +100ms when I perform the same test to the NYC server. That seems .. excessive to me. Particularly since the time spent transferring the actual data only increased 10%, yet the latency increased 100%!

That feels... wrong... to me.

I found a few links here that were helpful (through Google no less!) ...

... but nothing authoritative.

So, is this normal? It doesn't feel normal. What is the "typical" latency I should expect when moving network packets from the east coast <--> west coast of the USA?

Jeff Atwood

Asked: 2010-02-05 16:51:49 +0800 CST

Windows services not starting automatically?

14

We've had some nasty time sync problems on our Windows Server 2008 R2 servers lately.

I traced this back to something very simple: the Windows Time Service was not started! The time can't possibly sync via NTP when the time service isn't running...

The Windows Time Service was set to start "automatically" in the services control panel, which I double and triple checked. I also checked the event logs and I didn't see any service failures or anything like that. In fact, it looked a heck of a lot like the Windows Time Service never started up automatically after the weekly Windows Updates were installed and the servers were rebooted. (this is set to happen every Saturday at 7 PM.)

The minute I started the Time Service, the time synced fine.

So, then, the question: why would a service set to start "Automatically" ... not be started automatically? That seems sort of crazy to me.

Jeff Atwood

Asked: 2010-01-27 12:41:46 +0800 CST

Tuning Linux IP routing parameters -- secret_interval and tcp_mem

33

We had a little failover problem with one of our HAProxy VMs today. When we dug into it, we found this:

Jan 26 07:41:45 haproxy2 kernel: [226818.070059] __ratelimit: 10 callbacks suppressed
Jan 26 07:41:45 haproxy2 kernel: [226818.070064] Out of socket memory
Jan 26 07:41:47 haproxy2 kernel: [226819.560048] Out of socket memory
Jan 26 07:41:49 haproxy2 kernel: [226822.030044] Out of socket memory

Which, per this link, apparently has to do with low default settings for net.ipv4.tcp_mem. So we increased them by 4x from their defaults (this is Ubuntu Server, not sure if the Linux flavor matters):

current values are:    45984   61312   91968
new values are:       183936  245248  367872

After that, we started seeing a bizarre error message:

Jan 26 08:18:49 haproxy1 kernel: [ 2291.579726] Route hash chain too long!
Jan 26 08:18:49 haproxy1 kernel: [ 2291.579732] Adjust your secret_interval!

Shh.. it's a secret!!

This apparently has to do with /proc/sys/net/ipv4/route/secret_interval which defaults to 600 and controls periodic flushing of the route cache

The secret_interval instructs the kernel how often to blow away ALL route hash entries regardless of how new/old they are. In our environment this is generally bad. The CPU will be busy rebuilding thousands of entries per second every time the cache is cleared. However we set this to run once a day to keep memory leaks at bay (though we've never had one).

While we are happy to reduce this, it seems odd to recommend dropping the entire route cache at regular intervals, rather than simply pushing old values out of the route cache faster.

After some investigation, we found /proc/sys/net/ipv4/route/gc_elasticity which seems to be a better option for keeping the route table size in check:

gc_elasticity can best be described as the average bucket depth the kernel will accept before it starts expiring route hash entries. This will help maintain the upper limit of active routes.

We adjusted elasticity from 8 to 4, in the hopes of the route cache pruning itself more aggressively. The secret_interval does not feel correct to us. But there are a bunch of settings and it's unclear which are really the right way to go here.

/proc/sys/net/ipv4/route/gc_elasticity (8)
/proc/sys/net/ipv4/route/gc_interval (60)
/proc/sys/net/ipv4/route/gc_min_interval (0)
/proc/sys/net/ipv4/route/gc_timeout (300)
/proc/sys/net/ipv4/route/secret_interval (600)
/proc/sys/net/ipv4/route/gc_thresh (?)
rhash_entries (kernel parameter, default unknown?)

We don't want to make the Linux routing worse, so we're kind of afraid to mess with some of these settings.

Can anyone advise which routing parameters are best to tune, for a high traffic HAProxy instance?

Jeff Atwood

Asked: 2010-01-26 10:57:05 +0800 CST

Should network hardware be set to "autonegotiate" speeds or fixed speeds?

90

We recently had a little problem with networking where multiple servers would intermittently lose network connectivity in a fairly painful-to-resolve way (required hard reboot). This has been going on for about two weeks, seemingly at random, on different servers. No particular pattern that we could discern to it.

After some digging into it, we saw that the switch was reporting 100 Mbps for the problem port:

This sounds remarkably like what happened in the Joel Spolsky article Five Whys

Michael spent some time doing a post-mortem, and discovered that the problem was a simple configuration problem on the switch. There are several possible speeds that a switch can use to communicate (10, 100, or 1000 megabits/second). You can either set the speed manually, or you can let the switch automatically negotiate the highest speed that both sides can work with. The switch that failed had been set to autonegotiate. This usually works, but not always, and on the morning of January 10th, it didn’t.

We have now disabled auto-negotiate on our network hardware and set it to a fixed rate of 1000 Mbps (gigabit).

My questions to those with more server hardware networking expertise:

How common are auto-negotiate problems with modern networking hardware?
Is it considered good, standard networking practice to disable auto-negotiate and set fixed speeds when setting up networking?

Jeff Atwood

Asked: 2010-01-09 19:01:11 +0800 CST

Is Round-Robin DNS "good enough" for load balancing static content?

70

We have a set of shared, static content that we serve up between our websites at http://sstatic.net. Unfortunately, this content is not currently load balanced at all -- it's served from a single server. If that server has problems, all the sites that rely on it are effectively down because the shared resources are essential shared javascript libraries and images.

We are looking at ways to load balance the static content on this server, to avoid the single server dependency.

I realize that round-robin DNS is, at best, a low end (some might even say ghetto) solution, but I can't help wondering -- is round robin DNS a "good enough" solution for basic load balancing of static content?

There is some discussion of this in the [dns] [load-balancing] tags, and I've read through some great posts on the topic.

I am aware of the common downsides of DNS load balancing through multiple round-robin A records:

there's typically no heartbeats or failure detection with DNS records, so if a given server in the rotation goes down, its A record must manually be removed from the DNS entries
the time to live (TTL) must necessarily be set quite low for this to work at all, since DNS entries are cached aggressively throughout the internet
the client computers are responsible for seeing that there are multiple A records and picking the correct one

But, is round robin DNS good enough as a starter, better than nothing, "while we research and implement better alternatives" form of load balancing for our static content? Or is DNS round robin pretty much worthless under any circumstances?

Jeff Atwood

Asked: 2009-12-15 09:24:06 +0800 CST

Does CPU power management affect server performance?

44

I was doing some simple hand benchmarking on our (live) database server during non-peak hours, and I noticed that queries returned somewhat erratic benchmark results.

I had enabled the "Balanced" power saving plan on all our servers a while ago, because I figured they were nowhere near high utilization and this way we could save some energy.

I had assumed this would have no significant, measurable impact on performance. However, if CPU power saving features are impacting typical performance -- particularly on the shared database server -- then I am not sure it's worth it!

I was a little surprised that our web tier, even when at 35-40% load, is down-clocking from 2.8 Ghz @ 1.25V to 2.0 Ghz @ 1.15V.

I fully expect the down-clocking to save power, but that load level seems high enough to me that it should be kicking up to full clock speed.

Our 8-cpu database server has a ton of traffic, but extremely low CPU utilization (just due to the nature of our SQL queries -- lots of them, but really simple queries). It's usually sitting at 10% or less. So I expect it was downclocking even more than the above screenshot. Anyway, when I turned power management to "high performance" I saw my simple SQL query benchmark improve by about 20%, and become very consistent from run to run.

I guess I was thinking that power management on lightly loaded servers was win-win -- no performance loss, and significant power savings because the CPU is commonly the #1 or #2 consumer of power in most servers. That does not appear to be the case; you will give up some performance with CPU power management enabled, unless your server is always under so much load that the power management has effectively turned itself off. This result surprised me.

Does anyone have any other experience or recommendations to share on CPU power management for servers? Is it something you turn on or off on your servers? Have you measured much power are you saving? Have you benchmarked with it on and off?

Jeff Atwood

Asked: 2009-12-05 22:16:30 +0800 CST

How do I remove a specific bad plan from the SQL Server query cache?

42

We have one particular SQL Server 2008 query (not a stored proc, but the same SQL string -- executes every 5 minutes) that intermittently caches a very bad query plan.

This query normally runs in a few milliseconds, but with this bad query plan, it takes 30+ seconds.

How do I surgically remove just the one bad cached query plan from SQL Server 2008, without blowing away the entire query cache on the production database server?

Jeff Atwood

Asked: 2009-12-04 09:53:12 +0800 CST

Are IP addresses "trivial to forge"?

70

I was reading through some of the notes on Google's new public DNS service:

I noticed under the security section this paragraph:

Until a standard system-wide solution to DNS vulnerabilities is universally implemented, such as the DNSSEC2 protocol, open DNS resolvers need to independently take some measures to mitigate against known threats. Many techniques have been proposed; see IETF RFC 4542: Measures for making DNS more resilient against forged answers for an overview of most of them. In Google Public DNS, we have implemented, and we recommend, the following approaches:

Overprovisioning machine resources to protect against direct DoS attacks on the resolvers themselves. Since IP addresses are trivial for attackers to forge, it's impossible to block queries based on IP address or subnet; the only effective way to handle such attacks is to simply absorb the load.

That is a depressing realization; even on Stack Overflow / Server Fault / Super User, we frequently use IP addresses as the basis for bans and blocks of all kinds.

To think that a "talented" attacker could trivially use whatever IP address they want, and synthesize as many unique fake IP addresses as they want, is really scary!

So my question(s):

Is it really that easy for an attacker to forge an IP address in the wild?
If so, what mitigations are possible?

Jeff Atwood

Asked: 2009-08-04 21:30:17 +0800 CST

Delete Files Older than (x) Days?

35

What's a good Windows command line option for deleting all files in a given folder older than (n) days?

Also note there may be many thousands of these files, so forfiles with a shell to cmd is not a great idea here.. unless you like spawning thousands of command shells. I consider that a pretty nasty hack, so let's see if we can do better!

Ideally, something built into (or easily installable into) Windows Server 2008.

Jeff Atwood

Asked: 2009-07-26 03:19:19 +0800 CST

Recommended LogParser queries for IIS monitoring?

86

As Stack Overflow grows, we're starting to look closely at our IIS logs to identify problem HTTP clients -- things like rogue web spiders, users who have a large page set to refresh every second, poorly written one-off web scrapers, tricksy users who try to increment page count a zillion times, and so forth.

I've come up with a few LogParser queries that help us identify most of the oddities and abnormalities when pointed at an IIS log file.

Top bandwidth usage by URL

SELECT top 50 DISTINCT 
SUBSTR(TO_LOWERCASE(cs-uri-stem), 0, 55) AS Url, 
Count(*) AS Hits, 
AVG(sc-bytes) AS AvgBytes, 
SUM(sc-bytes) as ServedBytes 
FROM {filename} 
GROUP BY Url 
HAVING Hits >= 20 
ORDER BY ServedBytes DESC

url                                                   hits  avgbyte  served
-------------------------------------------------     ----- -------  -------
/favicon.ico                                          16774 522      8756028
/content/img/search.png                               15342 446      6842532

Top hits by URL

SELECT TOP 100 
cs-uri-stem as Url, 
COUNT(cs-uri-stem) AS Hits 
FROM {filename} 
GROUP BY cs-uri-stem 
ORDER BY COUNT(cs-uri-stem) DESC

url                                                                    hits
-------------------------------------------------                      -----
/content/img/sf/vote-arrow-down.png                                    14076
/content/img/sf/vote-arrow-up.png                                      14018

Top bandwidth and hits by IP / User-Agent

SELECT TOP 30
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
Sum(sc-bytes) AS TotalBytes, 
Count(*) as Hits 
FROM {filename} 
group by c-ip, cs(User-Agent) 
ORDER BY TotalBytes desc

client         user-agent                                      totbytes   hits
-------------  ---------------------------------------------   ---------  -----
66.249.68.47   Mozilla/5.0+(compatible;+Googlebot/2.1;         135131089  16640
194.90.190.41  omgilibot/0.3++omgili.com                       133805857  6447

Top bandwidth by hour by IP / User-Agent

SELECT TOP 30
TO_STRING(time, 'h') as Hour, 
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
Sum(sc-bytes) AS TotalBytes, 
count(*) as Hits 
FROM {filename} 
group by c-ip, cs(User-Agent), hour 
ORDER BY sum(sc-bytes) desc

hr   client        user-agent                                  totbytes   hits
--   ------------- -----------------------------------------   --------   ----
9    194.90.190.41 omgilibot/0.3++omgili.com                   30634860   1549
10   194.90.190.41 omgilibot/0.3++omgili.com                   29070370   1503

Top hits by hour by IP / User-Agent

SELECT TOP 30
TO_STRING(time, 'h') as Hour, 
c-ip as Client, 
SUBSTR(cs(User-Agent), 0, 70) as Agent, 
count(*) as Hits, 
Sum(sc-bytes) AS TotalBytes 
FROM {filename} 
group by c-ip, cs(User-Agent), hour 
ORDER BY Hits desc

hr   client         user-agent                                  hits  totbytes
--   -------------  -----------------------------------------   ----  --------
10   194.90.190.41  omgilibot/0.3++omgili.com                   1503  29070370
12   66.249.68.47   Mozilla/5.0+(compatible;+Googlebot/2.1      1363  13186302

The {filename} of course would be a path to an IIS logfile, such as

c:\working\sologs\u_ex090708.log

I did a lot of web searches for good IIS LogParser queries and found precious little. These 5, above, have helped us tremendously in identifying serious problem clients. But I'm wondering -- what are we missing?

What other ways are there to slice and dice the IIS logs (preferably with LogParser queries) to mine them for statistical anomalies? Do you have any good IIS LogParser queries you run on your servers?

Jeff Atwood

Asked: 2009-07-20 02:30:32 +0800 CST

DNS failing to propagate worldwide

66

I haven't changed anything related to the DNS entry for serverfault.com, but some users were reporting today that the serverfault.com DNS fails to resolve for them.

I ran a justping query and I can sort of confirm this -- serverfault.com dns appears to be failing to resolve in a handful of countries, for no particular reason that I can discern. (also confirmed via What's My DNS which does some worldwide pings in a similar fashion, so it's confirmed as an issue by two different sources.)

Why would this be happening, if I haven't touched the DNS for serverfault.com ?
our registrar is (gag) GoDaddy, and I use default DNS settings for the most part without incident. Am I doing something wrong? Have the gods of DNS forsaken me?
is there anything I can do to fix this? Any way to goose the DNS along, or force the DNS to propagate correctly worldwide?

Update: as of Monday at 3:30 am PST, everything looks correct.. JustPing reports site is reachable from all locations. Thank you for the many very informative responses, I learned a lot and will refer to this Q the next time this happens..

Jeff Atwood

Asked: 2009-07-03 05:42:24 +0800 CST

Why is CPU use so asymmetric on our 8-cpu SQL Server box?

7

I've noticed that the CPU usage on our 8-CPU database server, running SQL Server 2008, is not balanced at all.

Here are the 1 day averages for a random day a while back, which is typical and consistently asymmetric:

9, 15, 10, 21, 18, 21, 14, 9

(only thumbnail here, because the image is really tall, but click through for the full size image)

Compared with our 4-CPU web servers, which are all almost exactly and perfectly balanced all the time, that struck me as odd.

Now, this is a dedicated server, so the only thing running on it is SQL Server 2008 (and the built-in full text indexing, which we use pretty heavily), so I'm not sure why the CPU usage would be so asymmetric. Thoughts?

Jeff Atwood

Asked: 2009-07-02 21:19:43 +0800 CST

In-place upgrade from Windows Server Standard to Enterprise or Datacenter

14

We ran into a little issue where Windows Server Standard only supports 32 GB of memory, which is depressing to find out when your server was recently upgraded to 48 awesome GB of memory.

Is it possible to do an in-place upgrade from Windows Server Standard (max 32 GB memory support) to Windows Server Enterprise or Datacenter (max 2TB memory support)?

I tried kicking off the Windows Server 2008 setup.exe from within the OS, which gives me an "upgrade" option.. but when I get to the "Where do you want to install Windows" part of the process, it tells me

The partition you selected might contain files from a previous Windows installation. If it does, these files and folders will be moved to a folder named Windows.old. You will be able to access the information in Windows.old, but you will not be able to use your previous versions of Windows

That's... not... what I had in mind when I clicked "upgrade"!

I really don't want to deal with a clean OS install on our production database sever, and would much rather do an upgrade install if at all possible!

Jeff Atwood

Asked: 2009-06-06 01:50:16 +0800 CST

Dynamically blocking excessive HTTP bandwidth use?

24

We were a little surprised to see this on our Cacti graphs for June 4 web traffic:

We ran Log Parser on our IIS logs and it turns out this was a perfect storm of Yahoo and Google bots indexing us.. in that 3 hour period, we saw 287k hits from 3 different Google IPs, plus 104k from Yahoo. Ouch?

While we don't want to block Google or Yahoo, this has come up before. We have access to a Cisco PIX 515E, and we're thinking about putting that in front so we can dynamically deal with bandwidth offenders without touching our web servers directly.

But is that the best solution? I'm wondering if there is any software or hardware that can help us identify and block excessive bandwidth use, ideally in real time? Perhaps some bit of hardware or open-source software we can put in front of our web servers?

We are mostly a Windows shop but we have some Linux skills as well; we're also open to buying hardware if the PIX 515E isn't sufficient. What would you recommend?

Jeff Atwood

Asked: 2009-05-01 10:16:30 +0800 CST

Windows Server unable to synchronize NTP time reliably

41

Why does Windows Server (2008, in this case, but I've seen the same problem in 2003) seem to have problem synchronizing time? I've seen this error in my System log across a variety of servers:

The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service will not update the local system time until it is able to synchronize with a time source. If the local system is configured to act as a time server for clients, it will stop advertising as a time source to clients. The time service will continue to retry and sync time with its time sources. Check system event log for other W32time events for more details. Run 'w32tm /resync' to force an instant time synchronization.

Under Control Panel, Date and Time, The Internet Time Settings are set to synchronize with time-nw.nist.gov ; the last successful sync was 2 days ago, indicating there's some kind of problem. But if I click the "update now" button on that dialog, indeed, it updates with the time!

So why can't windows server reliably time sync via NTP in the background without me manually intervening? What am I doing wrong?

Jeff Atwood

Asked: 2009-04-30 23:04:18 +0800 CST

Do RAID controllers commonly have SATA drive brand compatibility issues?

22

We've struggled with the RAID controller in our database server, a Lenovo ThinkServer RD120. It is a rebranded Adaptec that Lenovo / IBM dubs the ServeRAID 8k.

We have patched this ServeRAID 8k up to the very latest and greatest:

RAID bios version
RAID backplane bios version
Windows Server 2008 driver

This RAID controller has had multiple critical BIOS updates even in the short 4 month time we've owned it, and the change history is just.. well, scary.

We've tried both write-back and write-through strategies on the logical RAID drives. We still get intermittent I/O errors under heavy disk activity. They are not common, but serious when they happen, as they cause SQL Server 2008 I/O timeouts and sometimes failure of SQL connection pools.

We were at the end of our rope troubleshooting this problem. Short of hardcore stuff like replacing the entire server, or replacing the RAID hardware, we were getting desperate.

When I first got the server, I had a problem where drive bay #6 wasn't recognized. Switching out hard drives to a different brand, strangely, fixed this -- and updating the RAID BIOS (for the first of many times) fixed it permanently, so I was able to use the original "incompatible" drive in bay 6. On a hunch, I began to assume that the Western Digital SATA hard drives I chose were somehow incompatible with the ServeRAID 8k controller.

Buying 6 new hard drives was one of the cheaper options on the table, so I went for 6 Hitachi (aka IBM, aka Lenovo) hard drives under the theory that an IBM/Lenovo RAID controller is more likely to work with the drives it's typically sold with.

Looks like that hunch paid off -- we've been through three of our heaviest load days (mon,tue,wed) without a single I/O error of any kind. Prior to this we regularly had at least one I/O "event" in this time frame. It sure looks like switching brands of hard drive has fixed our intermittent RAID I/O problems!

While I understand that IBM/Lenovo probably tests their RAID controller exclusively with their own brand of hard drives, I'm disturbed that a RAID controller would have such subtle I/O problems with particular brands of hard drives.

So my question is, is this sort of SATA drive incompatibility common with RAID controllers? Are there some brands of drives that work better than others, or are "validated" against particular RAID controller? I had sort of assumed that all commodity SATA hard drives were alike and would work reasonably well in any given RAID controller (of sufficient quality).

Dealing with NTP reflection attacks in IPTables

Workarounds for maximum DNS-Interactive terms limit exceeded in SPF record?

Setting up an anonymous Windows Server 2008 network share?

postfix takes 60-90ms to queue email -- normal?

How much network latency is "typical" for east - west coast USA?

Windows services not starting automatically?

Tuning Linux IP routing parameters -- secret_interval and tcp_mem

Should network hardware be set to "autonegotiate" speeds or fixed speeds?

Is Round-Robin DNS "good enough" for load balancing static content?

Does CPU power management affect server performance?

How do I remove a specific bad plan from the SQL Server query cache?

Are IP addresses "trivial to forge"?

Delete Files Older than (x) Days?

Recommended LogParser queries for IIS monitoring?

DNS failing to propagate worldwide

Why is CPU use so asymmetric on our 8-cpu SQL Server box?

9, 15, 10, 21, 18, 21, 14, 9

In-place upgrade from Windows Server Standard to Enterprise or Datacenter

Dynamically blocking excessive HTTP bandwidth use?

Windows Server unable to synchronize NTP time reliably

Do RAID controllers commonly have SATA drive brand compatibility issues?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?

Jeff Atwood's questions

9, 15, 10, 21, 18, 21, 14, 9