I am trying to run a snakemake pipeline on a kubernetes cluster (GKE). The job is being initiated from a GCE VM. Sometimes it works, mostly it doesn't.
Steps I took were
gcloud container clusters get-credentials snakemake-k8s-demo
kubectl delete pod $(kubectl get pods | grep snakejob|colprint 1)
snakemake --kubernetes --container-image eu.gcr.io/scailyte-is/snakemake-gsdk --use-conda --default-remote-provider GS --default-remote-prefix xxxxxx-snakemake-test-1 --jobs 2
This first try worked very well.
I then deleted the files created by the snakemake pipeline and ran the identical job again without changing anything.
The job failed with the following error message:
HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /storage/v1/b/xxxxxxx-snakemake-test-1/o?projection=noAcl (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f3ee35159d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution'))
According to the Google Cloud Status Dashboard, there are no problems with the Google Cloud Storage.
Subsequent attempts failed in the same way.
Any tips for a resolution gratefully accepted.
The problem was caused by lack of appropriate firewall rules allowing intra-cluster communication. Once the appropriate ALLOW rules had been created the immediate problem went away, as the pod was able to reach its DNS server.
Our policy is to forbid all communication (including VPC-internal) between endpoints unless explicitly allowed. So the new cluster in a new address space needed additions to enable it to work properly.
I figured this out by enabling logging on the final DENY rule that fires when no previous ALLOW rule was applicable.
I haven't worked out why it worked the first time but I suspect there were some differences that I didn't make a note of.