I appoligize in advance for not being the proper admin, I'm just a programmer with a server on which I installed Debian Etch plus mysql, php, apache and ISPConfig.
So, it had an uptime of more than 900 days with not a single problem (there's no important load on it, just a couple of our services), and then it started to behave badly - suddenly it freezes (only ping is working, nothing else) and when I try to restart it via ISP's interface, it freezes completely. Then I have to request support for a manual restart. After that, it works fine for a couple of days, then the same thing happens again (it happened three times so far).
Now I performed a network boot and run fsck (found 1.1% non-contiguous) and I hope it will help
My question is did anyone had similar experience and what could be causing such a problem (when only ping works)?
Also, I looked in system log, but found nothing which could indicate a problem. Is there some other log I should look into?
thanks for a lot of answers!
Sorry, I didn't register yet, so I have no option to vote up. But thanks!
First, to clear the issue, this is a housed server, and there is network boot / reset / manual reset function at the ISP's support.
It probably is a HDD issue, since -after the fsck- everything seemed to work fine, until i looked deeper and realized that only the front page works, while others don't (pages give '403 forbidden' error or just a blank page or mysql error...).
SSH is also seems to work but it actually doesn't work: i can try to log in and it will refuse the wrong password, but when I enter the correct one - the connection just closes.
I will try to access the files once again through network boot and backup as much as possible, then will have to replace the disk...
Is it possible to clone a disk with errors on it? Is it worth trying, anyway?
UPDATE: Today (one day after I asked the question) it turned out that the HDD is definitely defective. Once again, thanks for your time and help!
Assuming this is a dedicated physical server:
The next time it freezes, you should have your hosting company plug in a "crash cart" and see what's on the screen (console), or go down yourself. The next time it starts to act up, if you're able to login, type "dmesg" and look for error messages; include them by editing your question and pasting them, or using pastebin.
I've snapped photos with a digital camera or cellphone in the past for later reference or showing to someone remotely. Any serious kernel messages will most likely be on screen (it depends on how logging is configured); without this information, the answers you get will be essentially wild guesses.
My wild guess is hard drive failure; bring a bootable CD (Ubuntu is probably easiest) and run smartctl -A insert hard drive device path here. You'll get a list of drive health parameters, and more importantly, a log of errors from the drive, if any.
Also: ignore the person who suggested doing an OS upgrade. That is exceptionally dangerous advice.
Update: Yes, it's possible to clone a damaged drive, if you don't have good or recent backups. Look at GNU ddrescue. It's an advanced tool, though. If money is on the line, send it out for recovery, or at least hire a pro sysadmin who has experience with data recovery.
It's possible this is a hardware issue. Disk or memory errors, over heating (clogged fan or air vents), network card that went bad. Unless there are any hardware errors then as a first thing I would upgrade the system to lenny, then squeeze. It's possible it may automagically fix your problems.
I would also scan the system for badblocks (that's the command name). In mkfs.ext3 there exists the following option:
So you will be able to avoid disk errors caused by bad blocks.
Also consider running a memory test using memtest86 or memtest86+. If it finds errors and you feel adventurous you can use memtest's output to feed to the kernel and map out any bad memory: http://rick.vanrein.org/linux/badram/
I know for a fact it works very well. I once had a bad dimm which would predictably crash and burn the system at some point of memory allocation. After using memtest and finding the bad memory area I used badram kernel parameter to map it out and the problem was solved.