I have two coreos stable v1122.2.0 machines, each one with etcd2 configured with tls.
I created the certificates using https://github.com/coreos/etcd/tree/master/hack/tls-setup.
now I'm trying to configure calico-node to operate on my coreos master node with rkt.
I have the following in cloud-config configuration:
write_files:
- path: "/etc/kubernetes/cni/net.d/10-calico.conf"
content: |
{
"name": "calico",
"type": "flannel",
"delegate": {
"type": "calico",
"etcd_endpoints": "https://10.79.218.2:2379,https://10.79.218.3:2379",
"log_level": "none",
"log_level_stderr": "info",
"hostname": "10.79.218.2",
"policy": {
"type": "k8s",
"k8s_api_root": "http://127.0.0.1:8080/api/v1/"
}
}
}
- path: "/etc/kubernetes/manifests/policy-controller.yaml"
content: |
apiVersion: v1
kind: Pod
metadata:
name: calico-policy-controller
namespace: calico-system
spec:
hostNetwork: true
containers:
# The Calico policy controller.
- name: k8s-policy-controller
image: calico/kube-policy-controller:v0.2.0
env:
- name: ETCD_ENDPOINTS
value: "https://10.79.218.2:2379,https://10.79.218.3:2379"
- name: K8S_API
value: "http://127.0.0.1:8080"
- name: LEADER_ELECTION
value: "true"
# Leader election container used by the policy controller.
- name: leader-elector
image: quay.io/calico/leader-elector:v0.1.0
imagePullPolicy: IfNotPresent
args:
- "--election=calico-policy-election"
- "--election-namespace=calico-system"
- "--http=127.0.0.1:4040"
...
units:
- name: calico-node.service
enable: true
command: start
content: |
[Unit]
Description=Calico per-host agent
Requires=network-online.target
After=network-online.target
[Service]
Slice=machine.slice
Environment=CALICO_DISABLE_FILE_LOGGING=true
Environment=HOSTNAME=10.79.218.2
Environment=IP=10.79.218.2
Environment=FELIX_FELIXHOSTNAME=10.79.218.2
Environment=CALICO_NETWORKING=false
Environment=NO_DEFAULT_POOLS=true
Environment=ETCD_ENDPOINTS=https://10.79.218.2:2379,https://10.79.218.3:2379
ExecStart=/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci \
--volume=modules,kind=host,source=/lib/modules,readOnly=false \
--mount=volume=modules,target=/lib/modules \
--trust-keys-from-https quay.io/calico/node:v0.19.0
KillMode=mixed
Restart=always
TimeoutStartSec=0
[Install]
WantedBy=multi-user.target
please ignore the space indentation.. i don't think I copy/paste it properly :)
when I try to start calico-node service I get the following error:
Sep 14 05:45:17 localhost systemd[1]: Started Calico per-host agent.
Sep 14 05:45:17 localhost rkt[1644]: image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
Sep 14 05:45:18 localhost rkt[1644]: image: using image from local store for image name quay.io/calico/node:v0.19.0
Sep 14 05:45:25 localhost rkt[1644]: Traceback (most recent call last):
Sep 14 05:45:25 localhost rkt[1644]: File "startup.py", line 292, in <module>
Sep 14 05:45:25 localhost rkt[1644]: client = IPAMClient()
Sep 14 05:45:25 localhost rkt[1644]: File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 228, in __init__
Sep 14 05:45:25 localhost rkt[1644]: "%s" % (ETCD_CA_CERT_FILE_ENV, etcd_ca))
Sep 14 05:45:25 localhost rkt[1644]: pycalico.datastore_errors.DataStoreError: Invalid ETCD_CA_CERT_FILE. Certificate Authority cert is required and m
Sep 14 05:45:25 localhost rkt[1644]: Calico node failed to start
Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Main process exited, code=exited, status=1/FAILURE
Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Unit entered failed state.
Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Failed with result 'exit-code'.
Sep 14 05:45:25 localhost systemd[1]: calico-node.service: Service hold-off time over, scheduling restart.
Sep 14 05:45:25 localhost systemd[1]: Stopped Calico per-host agent.
Sep 14 05:45:25 localhost systemd[1]: Started Calico per-host agent.
Sep 14 05:45:25 localhost rkt[1714]: image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
Sep 14 05:45:26 localhost rkt[1714]: image: using image from local store for image name quay.io/calico/node:v0.19.0
Sep 14 05:45:28 localhost rkt[1714]: Traceback (most recent call last):
Sep 14 05:45:28 localhost rkt[1714]: File "startup.py", line 292, in <module>
Sep 14 05:45:28 localhost rkt[1714]: client = IPAMClient()
Sep 14 05:45:28 localhost rkt[1714]: File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 228, in __init__
Sep 14 05:45:28 localhost rkt[1714]: "%s" % (ETCD_CA_CERT_FILE_ENV, etcd_ca))
Sep 14 05:45:28 localhost rkt[1714]: pycalico.datastore_errors.DataStoreError: Invalid ETCD_CA_CERT_FILE. Certificate Authority cert is required and m
lines 2-25
so I get Invalid ETCD_CA_CERT_FILE.
. I didn't really specify to calico what keys to use..so I guess i'm missing some configuration.
I have the following etc related keys at /etc/ssl/etcd
8 -rw-------. 1 etcd etcd 1050 Sep 14 05:45 ca.pem
8 -rw-------. 1 etcd etcd 289 Sep 14 05:45 etcd1-key.pem
8 -rw-------. 1 etcd etcd 1058 Sep 14 05:45 etcd1.pem
8 -rw-------. 1 etcd etcd 227 Sep 12 03:49 server1-key.pem
8 -rw-------. 1 etcd etcd 822 Sep 12 03:49 server1.pem
I tried adding Environment=ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
to the calico-node systemd file, but I get the exact same results.
any ideas ?
update
so I tried to run calico manually, not with systemd. and I also added all the required environment variables that calico requires
export CALICO_DISABLE_FILE_LOGGING=true
export HOSTNAME=10.79.218.2
export IP=10.79.218.2
export FELIX_FELIXHOSTNAME=10.79.218.2
export CALICO_NETWORKING=false
export NO_DEFAULT_POOLS=true
export ETCD_ENDPOINTS=https://10.79.218.2:2379,https://10.79.218.3:2379
export ETCD_AUTHORITY=10.79.218.2:2379
export ETCD_SCHEME=https
export ETCD_CA_CERT_FILE=/etc/ssl/etcd/ca.pem
export ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem
export ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem
when I try to execute the calico container with:
/usr/bin/rkt run --inherit-env --stage1-from-dir=stage1-fly.aci \
--volume=modules,kind=host,source=/lib/modules,readOnly=false \
--mount=volume=modules,target=/lib/modules \
--trust-keys-from-https quay.io/calico/node:v0.19.0
I get
image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
image: using image from local store for image name quay.io/calico/node:v0.19.0
Traceback (most recent call last):
File "startup.py", line 292, in <module>
client = IPAMClient()
File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 221, in __init__
ETCD_CERT_FILE_ENV, etcd_cert))
pycalico.datastore_errors.DataStoreError: Cannot read ETCD_KEY_FILE and/or ETCD_CERT_FILE. Both must be readable file paths. Values provided: ETCD_KEY_FILE=/etc/ssl/etcd/etcd1-key.pem, ETCD_CERT_FILE=/etc/ssl/etcd/etcd1.pem
I changed the file permissions of the certificate files to 666 but that doesn't resolve the issue. and I know that these certificates are valid because etcd tls works properly. so what am I missing?
update 2
It appears I was missing to mount the certificates directory on the calico container.
so now I'm running the calico container with
/usr/bin/rkt run --volume etcd-ssl,kind=host,source=/etc/ssl/etcd/,readOnly=true --inherit-env --stage1-from-dir=stage1-fly.aci --volume=modules,kind=host,source=/lib/modules,readOnly=false --mount=volume=modules,target=/lib/modules --trust-keys-from-https quay.io/calico/node:v0.19.0 --mount volume=etcd-ssl,target=/etc/ssl/etcd
I get the following output:
image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
image: using image from local store for image name quay.io/calico/node:v0.19.0
Traceback (most recent call last):
File "startup.py", line 292, in <module>
client = IPAMClient()
File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 246, in __init__
allow_reconnect=True)
File "/usr/lib/python2.7/site-packages/etcd/client.py", line 204, in __init__
set(self.machines))
File "/usr/lib/python2.7/site-packages/etcd/client.py", line 299, in machines
return self.machines
File "/usr/lib/python2.7/site-packages/etcd/client.py", line 301, in machines
raise etcd.EtcdException("Could not get the list of servers, "
etcd.EtcdException: Could not get the list of servers, maybe you provided the wrong host(s) to connect to?
Calico node failed to start
I'm a bit closer.. but still no solution.
update 3
I tried setting ETCD_ENDPOINTS to the etcd server on the coreos machine by running export ETCD_ENDPOINTS=https://10.79.218.2:2379
, and now when i try to run the calico rkt image i get:
image: using image from file /usr/lib64/rkt/stage1-images/stage1-fly.aci
image: using image from local store for image name quay.io/calico/node:v0.19.0
Traceback (most recent call last):
File "startup.py", line 295, in <module>
main()
File "startup.py", line 251, in main
warn_if_hostname_conflict(ip)
File "startup.py", line 192, in warn_if_hostname_conflict
current_ipv4, _ = client.get_host_bgp_ips(hostname)
File "/usr/lib/python2.7/site-packages/pycalico/datastore.py", line 132, in wrapped
"running?" % (fn.__name__, e.message))
pycalico.datastore_errors.DataStoreError: get_host_bgp_ips: Error accessing etcd (Connection to etcd failed due to SSLError(CertificateError("hostname '10.79.218.2' doesn't match u'etcd'",),)). Is etcd running?
Calico node failed to start
I also had this problem, and eventually found the source of the issue by looking at the code for the etcd connection logic and the libraries used, and some pointers from the Calico team in their Slack channel.
The problem is because the current version (at least up to 0.22.0) of Calico uses a Python etcd client which does not support IP SAN (Subject Alt Name) in TLS certificates. This means that the certificates you are using cannot be correctly associated with the etcd servers they are configured on.
This is described in this GitHub issue.
To fix this, you must either wait until a new release of the urllib library is made, it is picked up by the etcd client, and a new release of that is made, and Calico is updated to use the new etcd client. Alternatively, you can re-generate the certificates using FQDNs instead of IP addresses in the SAN fields. This means you will need make sure your servers are accessible through those names, either using DNS or setting
/etc/hosts
correctly. The OpenSSL configuration for generating certificates should contain something like this:The link describing how you generated the certificates uses CFSSL so I suggest reading its documentation on how to change to using hostnames instead of IP addresses. I believe it may be as simple as modifying the JSON configuration as follows:
I find that with this flaky library I can succeed if: the client opens a connection to an IP address; the server's cert asserts that IP address in the Subject; and the server's cert does NOT have any DNS type entries in the Subject Alternative Name list. Following is selected output from
openssl x509 -text ...
for an example server cert that works when the client opens a connection using the IP address10.10.10.1
to identify the server:Also, there are newer versions of the Calico images. I have heard only two bad things about
calico/node:v0.23.0
. One is from someone else --- https://calicousers.slack.com/archives/kubernetes/p1478206011002345 . I have done some testing of that image myself and hand only one problem, https://github.com/projectcalico/calico-containers/issues/1107 . There are v1.0.0 betas and an rc1 right now, I have not heard bad things about them.