We have a bunch of compute engine VMs on GCP with static external ip addresses. We observe the external ip addresses become unavailable once and then. http, ssh or ICMP connections are not accepted. Usually the fact is discovered (or may be triggered?) by a SSH connect attempt. The server is still alive. I can connect via serial console and verify that. The outbound connections from the VM still do work (we have cron jobs checking files from the internet, and they do run and complete gracefully during these outages), so this is not a VM nic problem.
After some time (roughly 10mins) the external ip becomes available again by itself.
Any ideas on how to investigate the problem root cause further?
It turned out to be not a GCP problem at all. Our VMs run Ubuntu, which by default installs sshguard. Sshguard will block an IP if it detects a burst of connection failures.
The "outage" appeared each time I had to run Ansible to update the VM configuration and forgot to add my private SSH key to authentication agent. Ansible performed multiple attempts to connect, failing each time. Sshguard did not like this and blocked the IP for all ports and protocols.