Context
The company sells access to a sort of cash register web application. Access to the application is given through a VPN. The VPN entrypoint for the clients is a Soekris board running Voyage Linux (a trimmed down version of Debian). These boards have run for 3 years MySQL with replication and a RoR application stack.
The storage support for these boards is a Compact Flash 4GB card.
The problem
We are getting regular errors and random application crashes on these boards. The most frequent errors are the following :
Aug 24 14:54:44 box45 puppetd[3669]: Could not run Puppet::Network::Client::Master: Stale NFS file handle - /var/lib/puppet/state/state.yaml
Aug 24 13:37:01 box76 kernel: [ 2091.575622] EXT2-fs error (device hda1): read_block_bitmap: Cannot read block bitmap - block_group = 30, block_bitmap = 983040
If these were HDD-based, I would run SMART monitoring tools to check for bad sectors and general disk health. Except, due to them being CF cards, I am in the dark and have difficulty measuring how bad (or good !) the situation is.
What can I do to monitor the health of these cards and measure their health ? I insist on "measure" as I need to give some hard facts that will eventually motivate the change of all the CF cards.
And to make things a little more complex, I do not have physical access to the Soekris boards so all this needs to be remote.
The error seems to point pretty solidly to a problem with a section of the CF card media. If it has been running for some time without any problems and now it's giving these issues, I'd think that the card has started going bad. Easiest way to test is to send a tech out with a replacement card and swap it out, especially if you're seeing this on a limited number of the systems. All media have lifespans and failure rates; the more read/write cycles you have going to the cards the sooner they'll die.
Another thing to look at: are the errors in reading near the same spot(s) each time? That would tell me it's probably a bad cell as well in a specific part of the card.
I don't know if fsck would work the same way on these cards or not. My first inclination seeing that error is to replace it.
Why in the world would you run things off of CF cards? Use solid state media (meant for the purpose) if you need flash storage. CF cards are not made with technical standards to include monitoring. The most you can do is a checkdisk and check it for bad sectors.