On cloud platforms you often hear that because of high load on neighboring VMs, disks over oversubscribed ethernet, backups, or live migration to other hardware, the virtual machine can 'freeze' for a moment.
I have the suspicion that this is happening to one of our Ubuntu virtual machines on a Cloud provider I'm not looking to publicly shame.
Every night it's unavailable to external monitoring services. The machine itself looks healthy in terms of load, traffic, etc. The provider suggests the network is fine though.
I would like to be able to (dis)prove VM freezes are causing these pagers.
One idea I had was to write the date to a log every second, and after an short moment of unavailability see if we skipped a 'beat'.
However that seems flawed because what if the VM maintains its own clock and allows a drift from the Host's hardware.
If our internal clock freezes along with the VM, we'd still have a nice sequence of seconds in that log file, and a clock that's now behind on the real time.
Is there a better way / tool I can use to determine that there are machine freezes?
I would guess real time and our time would be a tell, then again, there are other causes for drifting clocks.
I think you're on the right track in with writing the time to a log file every second, but for the reasons you pointed out that may not be reliable. In addition to writing the time to a local disk, why not have your cron process reach out to a known stable system over the network and have that system log the request to disk? Something as simple as wget could work assuming you're doing an http request to a system and that system is logging the requests. Of course, you'd ideally want to have the target system relatively "close" to the system you suspect of being problematic network-wise, but that could help you get some debugging data at least.
You could use Nagios, an IT solution for monitoring. With this, you could check the CPU load (and a lot of other things), and receive an alert to your mail or by the web console. You have to install the server in you PC and the remote plugin executor in the Virtual Machine.
Here's a pretty cool tutorial: http://www.tecmint.com/how-to-add-linux-host-to-nagios-monitoring-server/
I would ping a VM from an external host. If the VM freezes, its network stack should too and it should show in a sequence of logged pings.