Let me start off by explaining why I want to do this. Everything was running fine. I imported a snapshot of a MySQL DB on another server in prep for setting up master-master replication (this one will become the primary in the array once it's setup). I had turned MySQL slave replication and it was catching up. I also had a rsync transfer going on via cygwin. I forgot something, so I issued a STOP SLAVE
command to MySQL. This caused the entire server to literally hang. No reply on ping, nothing. After about 15 minutes in this state, the box was manually hard rebooted.
This raises the question in my mind if I can trust the server. STOP SLAVE
isn't an intensive call at all. It's beyond me why that would cause MySQL to crash, yet alone the whole operating system. So now I'm wondering if it's a hardware problem. We just got brand new Ram (32gb) installed in the server, but they never ran memtest on it. Since I don't have physical access to the server (in a different country), they won't run memtest until Monday morning. I want to do as much testing over the weekend as I possibly can.
I had a similar issue in Linux a few years ago which was caused by a faulty bios, where under high I/O loads the box would just freeze. What I did then to reproduce it was have a few python scripts generate a number of large (10gb+) files, and then randomly seek to different positions among those files. This caused the machine to halt within minutes.
So that got me thinking, why not do a similar thing. So I wrote a python program to read and write to a series of files (running in 4 processes) to hopefully saturate the disks. Then I wrote another one to just try eating as much ram as possible (it's at 32gb now and climbing) while randomly reading and writing to positions in its list. It's been cranking for about an hour now, and still solid (the swapping is slowing things down, but it's still stable).
So I come here to ask, are there any user-land ways of stress testing 2k8 that aren't really application dependent? Once MySQL catches up, I'll write a script to randomly query that to increase the I/O and memory workout. But I'm more looking to test the machine and OS more than the application... But until that point, I want to punish this machine for the halt.
Thanks
For testing the hard drive run the full surface scan from the drive manufacturer, multiple times if you so wish.
For testing the CPU and memory, there are quite a few software packages out there. "Burn-in" tests would most likely be what you're looking for, but most benchmarking suites can be looped to stress a computer. I am a fan of the SiSoft Sandra package, though I haven't used it in years.
If you're looking for something a little closer to your Python scripts, try IOZone
I might be stating the obvious here but have you checked the event logs on the server to see if that can help identify what exactly caused the crash?
I'm not sure if its a misleading superstition of mine since I don't have the chart to prove it but I notice that majority of times I've seen an issue with a server its been a software/OS related error.