One of our application uses Elasticsearch (1.4.4) as an in-memory cache. The application is a Java webapp deployed on Tomcat 7 with Oracle 1.7. The elasticsearch instance is a one-node setup deployed on the same server.
Since elasticsearch 1.3.3 we're expiriencing around 40 MBit/s in and out on the loopback interface between the application and the Elasticsearch node with an idle application.
This isn't that much but causes a noticable load for an otherwise flatlining system. I don't have a production system with this application at hand so i can't really say how it evolves in prod.
Grabbing the traffic via tcpdump and analyzing it in Wireshark shows that the Elasticsearch-Client in the application continously ask the Node for cluster/node/info
which yields in a 10k answer every time.
Maybe totally unrelated, but enabling the server and client logging gives us:
Elasticsearch Server log:
[2015-05-12 14:45:01,600][INFO ][node ] [Illyana Rasputin] initializing ...
[2015-05-12 14:45:01,608][INFO ][plugins ] [Illyana Rasputin] loaded [], sites []
[2015-05-12 14:45:06,666][INFO ][node ] [Illyana Rasputin] initialized
[2015-05-12 14:45:06,667][INFO ][node ] [Illyana Rasputin] starting ...
[2015-05-12 14:45:06,828][INFO ][transport ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.24.1.128:9300]}
[2015-05-12 14:45:06,851][INFO ][discovery ] [Illyana Rasputin] bkbo_index/TITPDFdtR6SXX5EeOXaidg
[2015-05-12 14:45:09,892][INFO ][cluster.service ] [Illyana Rasputin] new_master [Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[/10.24.1.128:9300]], reason: zen-disco-join (elected_as_master)
[2015-05-12 14:45:09,943][INFO ][http ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.24.1.128:9200]}
[2015-05-12 14:45:09,944][INFO ][node ] [Illyana Rasputin] started
[2015-05-12 14:45:11,283][INFO ][gateway ] [Illyana Rasputin] recovered [2] indices into cluster_state
Elasticsearch client:
2015-05-12 14:46:40,683 INFO [localhost-startStop-1] PluginsService:<init>:151 [Antiphon the Overseer] loaded [], sites []
2015-05-12 14:46:41,548 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Antiphon the Overseer] node_sampler_interval[5ms]
2015-05-12 14:46:41,594 DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Antiphon the Overseer] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,625 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,655 INFO [localhost-startStop-1] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [0] timed out after [6ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-05-12 14:46:41,658 DEBUG [localhost-startStop-1] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call
2015-05-12 14:46:41,661 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,669 INFO [elasticsearch[Antiphon the Overseer][generic][T#1]] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [1] timed out after [5ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2015-05-12 14:46:41,670 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call
2015-05-12 14:46:41,676 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,677 WARN [elasticsearch[Antiphon the Overseer][transport_client_worker][T#2]{New I/O worker #2}] TransportService$Adapter:remove:280 [Antiphon the Overseer] Received response for a request that has timed out, sent [14ms] ago, timed out [9ms] ago, action [cluster:monitor/nodes/info], node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]], id [1]
2015-05-12 14:46:41,682 INFO [localhost-startStop-1] PluginsService:<init>:151 [Ricochet] loaded [], sites []
2015-05-12 14:46:41,722 DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Ricochet] node_sampler_interval[5ms]
2015-05-12 14:46:41,733 DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Ricochet] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,734 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,759 DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,760 DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]
Yes, this application has two client connections which should be OK (according to the devs). These disconnect/reconnect cycles happen about every minute or so.
Any clues whats going on here? I already disabled multicasting via discovery.zen.ping.multicast.enabled: false
.
You client, which appears to be joined to the cluster (which is fine, although if you use Kibana 4, you may strike complaints from Kibana (not sure if these complaints made it out of the 4 beta)
From your client log:
5ms seems fairly aggressive to be sampling nodes in the cluster. I haven't looked to see what this is by default, but I'm guessing that something has been configured with milliseconds when seconds was expected?
At this point, you need to consider the settings of the Client API, although its possible that the client picks this setting up from the cluster (as it is becoming part of the cluster)
Presumably you are using the elastic.co provided Java API?
Do you perhaps have
client.transport.nodes_sampler_interval
configured anywhere?Are you using the compatible client / server versions, as per the docs for the Java client API
I wouldn't be surprised if a unit of measure had changed between version, although the documentation does say the default is
5s
Check your elasticsearch.yaml and your code for instances of
node_sampler_interval
. You may need to replace a naked5
with5s
perhaps?