I've currently a strange problem in that my WIND BOX DE500-5123L Atom D510 320GB 2048MB DVD SM HD4330
sometimes crashes and I'm not sure where to start.
Linux kernel is 2.6.26-2-vserver-686 #1 SMP Thu May 13 01:30:39 UTC 2010 i686 GNU/Linux
(just the Debian pristine).
The kernel log doesn't give my anything suspicious:
02:30:01 CRON[15102]: pam_unix(cron:session): session opened for user root by (uid=0)
02:30:01 /USR/SBIN/CRON[15104]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
02:30:01 CRON[15102]: pam_unix(cron:session): session closed for user root
02:31:01 kernel: [ 1974.992964] vxW: [<BB>ck-collect-sess<AB>,15715:#400|400|400] did lookup hidden f70b449c[#0,5] <BB>/dev/pts<AB>.
02:31:32 kernel: [ 2028.565867] vxW: [<BB>console-kit-dae<AB>,6459:#400|400|400] did lookup hidden f70b449c[#0,5] <BB>/dev/pts<AB>.
02:34:27 sshd[6137]: syslogin_perform_logout: logout() returned an error
02:34:27 sshd[6137]: pam_unix(sshd:session): session closed for user user
02:35:01 CRON[15865]: pam_unix(cron:session): session opened for user root by (uid=0)
02:35:01 /USR/SBIN/CRON[15866]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
02:35:01 CRON[15865]: pam_unix(cron:session): session closed for user root
18:31:34 kernel: imklog 3.18.6, log source = /proc/kmsg started.
It's suddenly just dead. When I came home, the machine was really shut down.
I've munin installed and checked the graphs, but there wasn't something with jumped directly to me. I only remember that I started a job to compile Ruby which takes quite some time on the machine (that's why the load is so high).
Munin sensors:
Here's the load:
Disk usage OK, enough space everywhere. I'm running around 6 virtual machines with linux-vserver, which are doing things like DNS (internal), MTA/IMAP, virus scanning, some HTTP. Besides SMTP, nothing is publicly accessible (the Linux machine is behind a Netgear router and also selected ports are forwarded).
I'm happy to provide more information and will update the question.
It seems the culprit was really the temperature: I opened the system, removed all dust (and there was plenty of it) and restarted: an immediate drop in the temperatures can be seen:
Now I suspected the temperature but actually thought it was OK. I think I found the motherboard specs at http://www.intel.com/Assets/PDF/prodbrief/322518.pdf and it says :
The operating temp would already be above, the storage temp not. But I've no idea what's the difference between them.
I now assume the system detected the overheat and simply shut down the computer immediately (without giving the OS a chance to properly shut down). I found nothing in the Bios indicating this, maybe the shutdown was also forced.
System runs stable now and I need to keep an eye on the temp and the dust around the system.