I have a Kubernetes cluster in version 1.25.0 with some nodes (Ubuntu server machines). I use calico from https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml. Now I am adding a new node. The node is completely identical. The only exception is that it has a 2.5gbit network port instead of a 1gbit network port. On this node, both calico node and kube proxy crash permanently. On all other nodes it works fine. Calico Node reports the following as the reason for the crash:
Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused W0724 00:54:46.157624 73 feature_gate.go:241] Setting GA feature gate ServiceInternalTrafficPolicy=true. It will be removed in a future release.
Back-off restarting failed container
Kube proxy just crashes with back-off restarting failed container
.
The logs of all look good, no errors - not even warning. Here is a part of the logs from the calico node container:
2023-07-24 01:35:34.609 [INFO][115] felix/int_dataplane.go 1893: Received interface update msg=&intdataplane.ifaceStateUpdate{Name:"calico_tmp_B", State:"", Index:76}
2023-07-24 01:35:34.609 [INFO][115] felix/int_dataplane.go 1913: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calico_tmp_B", Addrs:set.Set[string](nil)}
2023-07-24 01:35:34.609 [INFO][115] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calico_tmp_B", Addrs:set.Set[string](nil)}
2023-07-24 01:35:34.609 [INFO][115] felix/int_dataplane.go 1893: Received interface update msg=&intdataplane.ifaceStateUpdate{Name:"calico_tmp_A", State:"", Index:77}
2023-07-24 01:35:34.609 [INFO][115] felix/int_dataplane.go 1913: Received interface addresses update msg=&intdataplane.ifaceAddrsUpdate{Name:"calico_tmp_A", Addrs:set.Set[string](nil)}
2023-07-24 01:35:34.609 [INFO][115] felix/hostip_mgr.go 84: Interface addrs changed. update=&intdataplane.ifaceAddrsUpdate{Name:"calico_tmp_A", Addrs:set.Set[string](nil)}
2023-07-24 01:35:34.609 [INFO][115] felix/int_dataplane.go 1803: Dataplane updates throttled
2023-07-24 01:35:35.603 [INFO][115] felix/int_dataplane.go 1770: Dataplane updates no longer throttled
bird: device1: Initializing
bird: direct1: Initializing
bird: device1: Starting
bird: device1: Initializing
bird: direct1: Initializing
bird: Mesh_192_168_178_58: Initializing
bird: Mesh_192_168_178_25: Initializing
bird: Mesh_192_168_178_70: Initializing
bird: Mesh_192_168_178_38: Initializing
bird: Mesh_192_168_178_72: Initializing
bird: device1: Starting
bird: device1: Connected to table master
bird: bird: device1: Connected to table masterdevice1: State changed to feed
bird: device1: State changed to feed
bird: direct1: Starting
bird: direct1: Connected to table master
bird: direct1: State changed to feed
bird: direct1: Startingbird:
Graceful restart started
bird: bird: direct1: Connected to table masterGraceful restart done
bird: direct1: State changed to feedbird:
Startedbird:
Mesh_192_168_178_58: Starting
bird: bird: Mesh_192_168_178_58: State changed to startdevice1: State changed to up
bird: Mesh_192_168_178_25: Starting
bird: Mesh_192_168_178_25: State changed to start
bird: Mesh_192_168_178_70: Starting
bird: Mesh_192_168_178_70: State changed to start
bird: Mesh_192_168_178_38: Starting
bird: bird: direct1: State changed to upMesh_192_168_178_38: State changed to start
bird: Mesh_192_168_178_72: Starting
bird: Mesh_192_168_178_72: State changed to start
bird: Graceful restart started
bird: Started
bird: device1: State changed to up
bird: direct1: State changed to up
bird: Mesh_192_168_178_58: Connected to table master
bird: Mesh_192_168_178_58: State changed to wait
bird: Mesh_192_168_178_25: Connected to table master
bird: Mesh_192_168_178_25: State changed to wait
bird: Mesh_192_168_178_72: Connected to table master
bird: Mesh_192_168_178_72: State changed to wait
bird: Mesh_192_168_178_70: Connected to table master
bird: Mesh_192_168_178_70: State changed to wait
bird: Mesh_192_168_178_38: Connected to table master
bird: Mesh_192_168_178_38: State changed to wait
bird: Graceful restart done
bird: Mesh_192_168_178_58: State changed to feed
bird: Mesh_192_168_178_25: State changed to feed
bird: Mesh_192_168_178_70: State changed to feed
bird: Mesh_192_168_178_38: State changed to feed
bird: Mesh_192_168_178_72: State changed to feed
bird: Mesh_192_168_178_58: State changed to up
bird: Mesh_192_168_178_25: State changed to up
bird: Mesh_192_168_178_70: State changed to up
bird: Mesh_192_168_178_38: State changed to up
bird: Mesh_192_168_178_72: State changed to up
2023-07-24 01:35:41.982 [INFO][115] felix/health.go 336: Overall health status changed: live=true ready=true
+---------------------------+---------+----------------+-----------------+--------+
| COMPONENT | TIMEOUT | LIVENESS | READINESS | DETAIL |
+---------------------------+---------+----------------+-----------------+--------+
| CalculationGraph | 30s | reporting live | reporting ready | |
| FelixStartup | - | reporting live | reporting ready | |
| InternalDataplaneMainLoop | 1m30s | reporting live | reporting ready | |
+---------------------------+---------+----------------+-----------------+--------+
2023-07-24 01:36:27.256 [INFO][115] felix/int_dataplane.go 1836: Received *proto.HostMetadataV4V6Update update from calculation graph msg=hostname:"storage-controller" ipv4_addr:"192.168.178.72/24" labels:<key:"beta.kubernetes.io/arch" value:"amd64" > labels:<key:"beta.kubernetes.io/os" value:"linux" > labels:<key:"kubernetes.io/arch" value:"amd64" > labels:<key:"kubernetes.io/hostname" value:"storage-controller" > labels:<key:"kubernetes.io/os" value:"linux" > labels:<key:"specialServerType" value:"storage" >
2023-07-24 01:36:29.551 [INFO][115] felix/int_dataplane.go 1836: Received *proto.HostMetadataV4V6Update update from calculation graph msg=hostname:"node1" ipv4_addr:"192.168.178.25/24" labels:<key:"beta.kubernetes.io/arch" value:"amd64" > labels:<key:"beta.kubernetes.io/os" value:"linux" > labels:<key:"kubernetes.io/arch" value:"amd64" > labels:<key:"kubernetes.io/hostname" value:"node1" > labels:<key:"kubernetes.io/os" value:"linux" > labels:<key:"node-role.kubernetes.io/control-plane" value:"" > labels:<key:"node.kubernetes.io/exclude-from-external-load-balancers" value:"" >
2023-07-24 01:36:34.389 [INFO][117] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface enp11s0: 192.168.178.88/24
2023-07-24 01:36:37.850 [INFO][115] felix/summary.go 100: Summarising 20 dataplane reconciliation loops over 1m3.5s: avg=13ms longest=180ms (resync-filter-v4,resync-ipsets-v4,resync-mangle-v4,resync-nat-v4,resync-raw-v4,resync-routes-v4,resync-routes-v4,resync-rules-v4,update-filter-v4,update-ipsets-4,update-mangle-v4,update-nat-v4,update-raw-v4)
I absolutely don't understand this. I have already rebuilt the whole node, updated calico node and also tried other kubernetes versions (1.25.11). No firewall is installed. Can anyone help me here? Thanks
PS: I have already tried all autodetection methods. Right now I use IP_AUTODETECTION_METHOD=can-reach=8.8.8.8
Output of ifconfig on the node:
ifconfig -a
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:50:db:52:31 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
enp11s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.178.88 netmask 255.255.255.0 broadcast 192.168.178.255
inet6 fe80::67c:16ff:fec8:53e6 prefixlen 64 scopeid 0x20<link>
inet6 2a02:908:523:bd80:67c:16ff:fec8:53e6 prefixlen 64 scopeid 0x0<global>
ether 04:7c:16:c8:53:e6 txqueuelen 1000 (Ethernet)
RX packets 8748 bytes 6634723 (6.6 MB)
RX errors 0 dropped 242 overruns 0 frame 0
TX packets 6342 bytes 944277 (944.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 390 bytes 47600 (47.6 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 390 bytes 47600 (47.6 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
tunl0: flags=193<UP,RUNNING,NOARP> mtu 1480
inet 192.168.53.192 netmask 255.255.255.255
tunnel txqueuelen 1000 (IPIP Tunnel)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 39 bytes 13183 (13.1 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
wlp12s0: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 60:e9:aa:5e:01:95 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Error: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl
The above error is caused mainly when you are having multiple interfaces and your calico plugin is not able to detect the correct interface, because of a misconfiguration in setting the interface or IP address. The common fixes are either assigning static IP addresses or to enable auto detection of interfaces.
Below are few additional references for the same:
https://github.com/projectcalico/calico/issues/2834#issuecomment-528400727
https://github.com/projectcalico/calico/issues/2042#issuecomment-419655823
I was able to find and fix the problem myself. I have always installed all nodes with this guide:
https://www.edureka.co/blog/install-kubernetes-on-ubuntu
This seems to be outdated and does not set important kernel parameters. After using this guide, Calico now starts normally:
https://www.linuxtechi.com/install-kubernetes-on-ubuntu-22-04/
Specifically, if I am not mistaken, it is about this command: