Although this question involves an embedded motherboard, serverfault seemed to me the best stackexchange forum to post to. My colleagues and I have been investigating this weird Linux boot problem for a few months now, and we're kinda stuck. Any suggestions are appreciated.
We have a Portwell motherboard (single Atom core with hyperthreading), running Centos 6.4. A customer has come back to us with a really weird boot problem that we've finally been able to reproduce.
Everything works fine if you do this:
- Boot up normally
- Halt the system
- Unplug power from the system for any period of time
- Plug in power
- Boot up
However, if you do the following, the usual fsck that is run as a normal part of boot will give us an error:
- Boot up normally
- Shut down normally
- DO NOT DISCONNECT POWER
- Wait 8 to 12 hours (shorter doesn't seem to cause the problem)
- Boot up again
The error we get is as in the following image:
We can press ctrl-D and reboot as many times as we want, and the error will keep coming back. But note that there's nothing wrong with the filesystem.
We can get the error to go away this way:
- Shut down
- Pull power for at least 10 minutes (much shorter, and the problem doesn't go away)
- Plug power back in and boot up.
Our hypothesis as of yesterday was that the hard drive might be spun down but not off, and over time, the disk cache might get degraded. However, that turns out to be false because the following procedure also makes the problem go away:
- Get the system to boot up so that it gives the error
- Do not disconnect power
- Go into the BIOS and change the system date to some time in the future
- Boot into Linux.
This motherboard does not have a battery, so when we first power on, it always comes up with the wrong date, some time in January of 2010. So in the usual case, the date is wrong, but the OS boots up normally. When the OS comes up, the date is set properly by NTP. If we leave it plugged in but off for 12 hours, the date gets reset again, but for some reason, fsck now cares about the date and wants to so a manual fsck because it considers that discrepancy to be a major problem. If we manually change the date to the future, it boots fine. If we change it back to the past, it errors again. But if we disconnect power for long enough and boot up, we get no error despite the fact that the date is wrong.
Can anyone help us reason through the various things that fsck might be looking at such that is decides sometimes to error out due to the date being wrong but never if the system date is in the future?
If we can program the BIOS to default to some date in the far future, that might solve this problem, but it's important to understand why it's happening, because we don't just want to stick on a bandaid and hope.
Thanks for any suggestions.
I found the answer to this here: https://unix.stackexchange.com/questions/8409/how-can-i-avoid-run-fsck-manually-messages-while-allowing-experimenting-with-s
Apparently, because the system clock is "broken," we have to put
broken_system_clock = true
in the[options]
section of/etc/e2fsck.conf
.