a couple of weeks ago, my linux server (kubuntu 10.04) started to give me trouble.
it freezes after a certain uptime, seemingly between a couple of minutes and a few hours - GUI is unresponsive, no reaction to mouse or keyboard (not even REISUB), top
in an ssh session stops updating and the session itself is aborted after a timeout:
Read from remote host 10.1.1.9: Operation timed out
Connection to 10.1.1.9 closed.
back then, I assumed a hardware issue, so i started replacing more and more hardware - graphics card, motherboard, cpu, ram, harddrives, psu. now i've replaced the entire machine and it still freezes.
i've checked /var/log/messages
and some other logs - there is no clue in them at all. a hardware issue seems unlikely considering it's all been replaced, but is still possible.
i've stripped the machine down to the bare minimum. i boot a kubuntu live system from a usb stick, mount a couple of harddrives read-only and start diffing folders on them. this seems to produce the freeze somewhat reliably. so far, i haven't gotten beyond a few hours of uptime.
my server is down, this has been going on for weeks now. i am at the end of my wisdom and i am clutching at straws.
how can i reliably determine if this is a hardware or a software issue ? how would you approach a problem like that ?
Since you have replaced such lot of hardware, I presume you have already made sure your problem isn't about temperature issues.
What if you try out some completely different distro instead of Kubuntu 10.04? Download some other live distribution, for example openSUSE or even some BSD flavour, and see if they reproduce the freeze as well. That way you can be sure this isn't some kind of bug in Kubuntu 10.04.
How much data you have under the directory trees you are diffing? And more importantly, are there only couple of large files or huge number of small files?
When you replaced the hard drives, how did you copy the data from the old drive to another? dd_rescue or some imaging program? Just plain old
cp
? If you used some kind of imaging program or dd_rescue and the original filesystem somehow contained some strange corruption, perhaps diffing hits the corrupted area and causes a crash? Rare and unlikely, but certainly possible. Just like it's possible that a lightning hits you out there.You need to get a crash dump and take a look through it. Looking in the logs won't help as they won't have anything written to them in the event of kernel panic/oops. If you have console access you may get to see if there is a panic message. A crash dump will have the contents of the kernel ring buffer (what you see in dmesg if it gets written to disk). If that doesn't help you need to start doing a full analysis of the dump
https://wiki.ubuntu.com/Kernel/CrashdumpRecipe?action=show&redirect=KernelTeam%2FCrashdumpRecipe
appears to be a start for ubuntu. Googling "redhat crash whitepaper" will also give you some pointers.
On the temperature suggestion, try running some sensor monitoring software, and see what it shows in the moment of freeze.
For KDE (as you use Kubuntu: http://kde-look.org/content/show.php/Sensors-Monitor