I'm implementing a monitoring system for an existing modest sized data center deployment.
So far I've only gotten to the host / application side of the monitoring equation but I'm noticing what I consider to be an alarming number of Ethernet errors on various hosts. To me, alarming is 3 or 4 per day per host (some have none). When I look at the SNMP counters for the switches, I again see lots of errors on the counters but I'm not graphing those errors (yet).
In my prior environments with many more ports my error rate was approximately zero except for those hosts that had actual problems like duplex mismatches.
None of these interfaces are saturated; they're pushing approximately 40-50 megabytes / sec over gig links.
My gut feeling is that there shouldn't be any errors at all over any interface if everything is working properly but I'm worried that if I pick a fight over resolving these problems I'll just alienate everyone else who believes "it works fine; it's been working for years that way".
Anyone have some good stories / studies / statistics for when to be alarmed at ethernet errors? Or something to indicate how a small volume of errors would affect, say, an iSCSI volume?
Thanks!
TCP/IP can handle errors quite well. A single error will be retransmitted and everything will generally be hunky-dory.
Consistent numbers of 3-4 errors per day is alarming because it indicates a possible problem (bad cable, port, etc), but in itself it is not an itch worth scratching. A single error could be the result of anything from electromagnetic interference to a very ill-positioned subatomic event. In both cases, the impact on your network is negligible.
If it will become a political issue, just leave it be (but keep an eye on it). I'd only throw a fit if I started seeing errors happening MUCH more often, or at least as a higher percentage of total packets. 0.1% may be a good threshold, but it's all a matter of how armored the neck you will be sticking out is.