I have a template from CentOS 7 (1602) that I have deployed roughly 200 VMs using it until I noticed the issue, so it would be ideal to fix these VM's rather than start from scratch.
The VM's 'randomly' fail, usually between 7PM and 11PM, sometimes two nights in a row, sometimes not for a week or two. When one VM fails, most of them also fail. They seem to loose disk access. Rebooting the VM immediately solves the issue and it does not reoocur for at least 24 hours. Even when we don't reboot them till the next day they still reboot during this time period.
Some of the VM's have nothing installed on them and still have this issue. Root partition and boot partition are hardly used. Logs show no issues.
No other VMs are affected except this particular centos template. We are using VMWare 4 (I know, I know) but we have never had any issues other than this and new images have no issue. I see no spikes in CPU or disk use in VMWare around the failure.
Here is a screenshot as it fails:
Here is a screenshot when trying to access the VM after a number of minutes has elapsed:
Example bootstrap script used on these servers: http://pastebin.com/gs3AzV5m
This is probably due to OS support or a resource issue. EL7 was not intended for use with vSphere 4. The VMware support matrix reinforces this.
I see you're using
open-vm-tools
, but it looks like you may have a deeper issue.See: https://access.redhat.com/solutions/21849
and: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009996
200 VMs is a large number, and vSphere 4 is an old release. I couldn't imagine starting a new rollout on such an old release of vSphere, and I'm sure you're no longer under VMware support.
Are you heavily overcommitted to the point where your system is killing itself?