Currently running 28+ deployments under GKE logging to Stackdriver using the automatic fluentd collection.
Master K8s Version: 1.10.6-gke.2
Node Version: 1.10.6-gke.2
Logging approximately 200 JSON strings per minute spread across multiple pods.
We are experiencing intermittent gaps in log ingestion in Stackdriver. When checking the logs using kubectl
we can clearly see the logs being produced in the correct structured logging format. Deleting pods / redeploying seems to momentarily resume ingestion.
I have checked the logs for fluentd and its autoscaler for errors but it appears to be operating normally.
The Stackdriver API quotas page in the cloud console indicates that some requests to the logging API have failed although I have no idea how to surface the errors being produced by Stackdriver when using the GKE automatic log ingestion. The error rate reported on the quotas page is also not nearly enough to explain the sometimes 24 hour+ gaps we are seeing.
This is not expected behavior, if you have support, please open a case for further investigation, else, please open a bug report here.