We're a small shop from an IT standpoint and buy almost all of our server and network hardware from gray market supplies—typically via eBay. I'm interested in developing a more rigorous testing process prior to deploying this gray market hardware into production. What hardware stress tests, test suites, etc. are recommended for this scenario?
Note: For this question, I'm not interested in debating the merits of buying new hardware or gray market hardware. Given our size and budget, we believe that buying off the gray market—whether used or new hardware—gives us the best bang for our buck. Thanks.
Typical Server Configurations
- Servers: IBM x335, x345, and HS20 blades
- HDDs: SCSI running in RAID1 or RAID5 configurations
- OS: Ubuntu Server 8.04LTS or 9.10
I think it depends on what your uptime requirements are, and what level of "grey market" you're dealing with.
If your uptime requirements are high, then you want to rely on infrastructure redundancy, so that the loss of a single machine doesn't mean the loss of services to your customers. Buy double, build in redundancy, and monitor your hosts and network so that you know when you need to replace something.
If your uptime requirements aren't that high, but you just need working hardware, then evaluate the people you buy from. Don't buy stuff that isn't guaranteed non-DOA. If you can, buy from refurb shops with at least 90 day warranties, though a year would be great if you could afford it, and make sure that you can get spare parts for whatever you're buying from another source if the original closes up.
We buy a reasonable quantity second-hand IBM equipment alongside the new stuff at $JOB. It's all HS/LS blades now but we have had a lot of x3** pizza boxes in the past. As I'm sure you're aware, there's some great stuff to be had from other people's end of lease agreements and hardware refreshes. Frequently even with some time left on manufacturer warranties.
Typically any problems that we have seen arise have done so fairly quickly and become apparent in the event logs of the BladeCenter or BIOS. They can usually be teased out just by running the machine up for a short period of time and restarting.
It's not that common to see S/H gear populated with drives. Whenever we do they get thrown away. Spinning media is nearly always the weakest link in the hardware chain. You have no idea whether the drives have exhibited problems previously or have been dropped in transit. Drives are so cheap to purchase new that it's just not worth our hassle.
As for the technical question of what tool to use. Since you only seem to be dealing with IBM machines, you might as well use the handy and comprehensive diagnostics tool that IBM have already thrown-in. Just hit F2 at boot.
memtestp and iozone are my two favourites.
Another thought is to try and standardize your systems. Try to buy the same HW which can then be used as spares, if needed. Actual testing will depend on the time available. I would try to create my own automated (and repeatable) test suite that would stress all the major components. CPU, Memory, Disk I/O, Network I/O are what I would try to stress in a test that should be run a few times and used to set a baseline. Every system which performed below that baseline (10 or 20% below) should be reexamined before deploying into production.
I usually boot the system under test using an external medium i.e. USB flash or network PXE boot into a ramdisk. This allows me to test the drives in a destructive manner and develop a good multipurpose test environment.
For drive testing I use badblocks destructive 4 pass test on the raw device e.g.
NOTE: this will wipe out all the data on the drive! If you have multiple drives, it may further stress the system to test them in parallel.
Compiling the Linux kernel is considered a good overall system test. I run one compile loop per CPU core. Configure a default kernel source tree and copy it for each instance. Then in each instance do something like:
While the kernel compile is going on you might want to watch CPU temperature with sensors e.g.:
Run this for 24 hours and you should have a good reliable system at the end of it.
I like to use memtest86 to test the memory subsystem. It will let you know if there are any bad memory modules in your system.
For CPU testing, I like to run the Distributed.net RC5-72 client; this will load your CPUs to 100% crunching 72bit RSA keys. If there is a problem with the CPUs or related components, I would think this would find them. I let it run for as long as I can --in addition to stressing the hell out of my CPUs, it also ups my DNETC stats :) Probably though, for stress testing a system, I'd run it at LEAST 24 hours.
badblocks, as mentioned above, is a good way to stress test disk drives, should you wish to keep them (a separate discussion). An alternative to the destructive read/write test mentioned by VMBed is the non-destructive read/write test, which will leave data intact.