For integration testing purposes I'm creating a very simple single-node Kafka deployment:
- 1x Zookeeper
- 1x Kafka
- 1x Kafka client (e.g. AdminClient, creating topics)
(All cleanly deployed in fresh Docker containers.)
I'm seeing intermittent failures to connect from the client to Kafka, but whenever I try to connect again it will just connect fine. I've enabled debug logging on the Kafka client and this is what I'm seeing on the first connection:
Client connects to its configured bootstrap server just fine:
[main] DEBUG o.a.k.c.a.i.AdminMetadataManager - [AdminClient clientId=adminclient-1] Setting bootstrap cluster metadata Cluster(id = null, nodes = [kafka:9092 (id: -1 rack: null)], partitions = [], controller = null). [...] [kafka-admin-client-thread | adminclient-1] DEBUG o.apache.kafka.clients.NetworkClient - [AdminClient clientId=adminclient-1] Completed connection to node -1. Fetching API versions.
Hundreds of lines showing that requesting the nodes in the cluster yields no entries:
[kafka-admin-client-thread | adminclient-1] DEBUG o.a.k.c.a.i.AdminMetadataManager - [AdminClient clientId=adminclient-1] Updating cluster metadata to Cluster(id = q7XgghZqQUW_o5W2-Nn5Qw, nodes = [], partitions = [], controller = null)
Note the
nodes = []
part here in particular.This goes on for at least a few seconds, sometimes even 30 seconds! I can't seem to understand why the Kafka server can't list itself as a node.
In an unlucky case, the timeout is reached and I see the infamous exception thrown:
Exception in thread "main" java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment.
In a lucky case, after a while, it lists itself and it can connect just fine:
[kafka-admin-client-thread | adminclient-1] DEBUG o.a.k.c.a.i.AdminMetadataManager - [AdminClient clientId=adminclient-1] Updating cluster metadata to Cluster(id = q7XgghZqQUW_o5W2-Nn5Qw, nodes = [kafka:9092 (id: 0 rack: null)], partitions = [], controller = kafka:9092 (id: 0 rack: null)) [kafka-admin-client-thread | adminclient-1] DEBUG o.apache.kafka.clients.NetworkClient - [AdminClient clientId=adminclient-1] Initiating connection to node kafka:9092 (id: 0 rack: null) using address kafka/xxxxx
Note the presence of
nodes = [kafka:9092 (id: 0 rack: null)]
here.
My problem is the huge variance in time between 1 and 2. I fail to understand why a client is not given a node rightaway (itself in this case!). It not only slows down my automated testing, it also hurts scrolling through logs full of failures with timeouts, as it regularly does not connect within the default timeout window. Any successive attempt after waiting long enough will immediately connect just fine without all of the oddness, though...
Kafka client + server versions: 2.1.0, running on Kubernetes. Kafka server logs don't indicate anything special, all silent already seconds before the attempt by the client is made, until after the connection is established (if at all).
What am I missing here? Am I looking at random pressure back-offs or do I have to tell Zookeeper/Kafka that it's fine to run on it's own, without having to wait for other nodes?
I have read this infamous blog article about listener configurations, but that's not relevant in my case: my Kafka does not advertise the node(s) for a while, until it does, despite having a list of advertised hosts. And after that, it connects just fine - those having issues with advertised listeners cannot connect at all.
0 Answers