We are running two separate subdomains, each on a separate external IP address, and each matched to its own kubernetes nginx service. The configuration looks like this:
#--------------------
# config for administrative nginx ssl termination deployment and associated service
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: admin-nginx
labels:
name: admin-nginx
spec:
replicas: 1
template:
metadata:
name: admin-nginx
labels:
name: admin-nginx
spec:
nodeSelector:
cloud.google.com/gke-nodepool: currentNodePool
containers:
- name: admin-nginx
image: path/to/nginx-ssl-image:1
ports:
- name: admin-http
containerPort: 80
- name: admin-https
containerPort: 443
apiVersion: v1
kind: Service
metadata:
name: admin-nginx
spec:
ports:
- name: https
port: 443
targetPort: admin-https
protocol: TCP
- name: http
port: 80
targetPort: admin-http
protocol: TCP
selector:
name: admin-nginx
type: LoadBalancer
#--------------------
# config for our api's nginx ssl termination deployment and associated service
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: public-nginx
labels:
name: public-nginx
spec:
replicas: 2
strategy:
rollingUpdate:
maxUnavailable: 0
template:
metadata:
labels:
name: public-nginx
spec:
nodeSelector:
cloud.google.com/gke-nodepool: currentNodePool
containers:
- name: public-nginx
image: path/to/nginx-ssl-image:1
ports:
- name: public-http
containerPort: 80
- name: public-https
containerPort: 443
apiVersion: v1
kind: Service
metadata:
name: public-nginx
spec:
ports:
- name: https
port: 443
targetPort: public-https
protocol: TCP
- name: http
port: 80
targetPort: public-http
protocol: TCP
selector:
name: public-nginx
type: LoadBalancer
#--------------------
Inside of our kubernetes cluster, we have, associated with each of the nginx deployments, a custom API router/gateway we use internally. These routers each have a /health endpoint, for, ah, health checks. This will be important in a second.
Some of the detail above have been elided; there is also a bit of configuration that makes nginx aware of the target service's address and port.
The configuration above creates 2 load balancers, sort of. I guess, technically, it creates two forwarding rules, each with an associated external IP, and a target pool consisting of all the instances in our k8s cluster. This, generally speaking, should work fine. Each k8s generated forwarding rule has an annotation in its description field, like this:
{"kubernetes.io/service-name":"default/admin-nginx"}
An associated firewall entry is created as well, with a similar annotation in its description field:
{"kubernetes.io/service-name":"default/admin-nginx", "kubernetes.io/service-ip":"external.ip.goes.here"}
The external IPs are then wired up to one of our subdomains via CloudFlare's DNS service.
Ideally, the way this should all work, and the way it had worked in the past, is as follows.
An incoming request to admin.ourdomain.com/health returns the health status page for everything handled by the API router deployment (well, the service pointing to the pods that implement that deployment anyway) that deals with admin stuff. It does this by way of the nginx pod, pointed to by the nginx service, pointed to by the description annotation on the forwarding rule, pointed to by way of the GCE external IP address manager and the firewall, though I'm less clear on the ordering for that last part.
Like this:
server status lookupMicros
https://adminservice0:PORT/health Ok 910
https://adminservice1:PORT/health Ok 100
https://adminservice2:PORT/health Ok 200
https://adminservice3:PORT/health Ok 876
And so on.
Meanwhile, a request to public.ourdomain.com/health should return pretty much the same thing, except for public services.
Like this:
server status lookupMicros
https://service0:PORT/health Ok 910
https://service1:PORT/health Ok 100
https://service2:PORT/health Ok 200
https://service3:PORT/health Ok 876
Etc.
Pretty reasonable, right?
As best I understand it, the whole thing hinges around making sure a request to the admin subdomain, by way of the external address linked to the admin annotated forwarding rule, eventually make its through GCE's network apparatus and into the kubernetes cluster, somewhere. It shouldn't matter where in the cluster it ends up first, as all of the nodes are aware of what services exist and where they are.
Except... That's not what I'm seeing now. Instead, what I'm seeing is this: every couple of refreshes on admin.ourdomain.com/health, which is definitely on a different IP address than the public subdomain, returns the health page for the public subdomain. That's bad.
On the bright side, I'm not seeing, for some reason, requests destined for the public subdomain's /health end up returning results from the admin side, but it's pretty disturbing anyway.
Whatever is going on, it might also be interesting to note that requests made on the wrong side, like admin.ourdomain.com/publicendpoint, are 404'd correctly. I'd imagine that's just because the /health is the only endpoint that inherently belongs to the API router, and moreover that that bolsters the case that whatever is happening seems to be happening because of an issue in the path from the GCE forwarding rule to the correct kubernetes service.
So, I guess we finally get to the part where I ask a question. Here goes:
Why are requests through an external ip associated with a forwarding rule targeting a particular kubernetes service being sent, intermittently, to the wrong kubernetes service?
Any assistance or information on this issue would be greatly appreciated.