Ping a Specific Port

Question

ercpe

Asked: 2015-05-13 05:49:59 +0800 CST2015-05-13 05:49:59 +0800 CST 2015-05-13 05:49:59 +0800 CST

Elasticsearch: High traffic on loopback interface

772

One of our application uses Elasticsearch (1.4.4) as an in-memory cache. The application is a Java webapp deployed on Tomcat 7 with Oracle 1.7. The elasticsearch instance is a one-node setup deployed on the same server.

Since elasticsearch 1.3.3 we're expiriencing around 40 MBit/s in and out on the loopback interface between the application and the Elasticsearch node with an idle application.

This isn't that much but causes a noticable load for an otherwise flatlining system. I don't have a production system with this application at hand so i can't really say how it evolves in prod.

Grabbing the traffic via tcpdump and analyzing it in Wireshark shows that the Elasticsearch-Client in the application continously ask the Node for cluster/node/info which yields in a 10k answer every time.

Maybe totally unrelated, but enabling the server and client logging gives us:

Elasticsearch Server log:

[2015-05-12 14:45:01,600][INFO ][node                     ] [Illyana Rasputin] initializing ...
[2015-05-12 14:45:01,608][INFO ][plugins                  ] [Illyana Rasputin] loaded [], sites []
[2015-05-12 14:45:06,666][INFO ][node                     ] [Illyana Rasputin] initialized
[2015-05-12 14:45:06,667][INFO ][node                     ] [Illyana Rasputin] starting ...
[2015-05-12 14:45:06,828][INFO ][transport                ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.24.1.128:9300]}
[2015-05-12 14:45:06,851][INFO ][discovery                ] [Illyana Rasputin] bkbo_index/TITPDFdtR6SXX5EeOXaidg
[2015-05-12 14:45:09,892][INFO ][cluster.service          ] [Illyana Rasputin] new_master [Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[/10.24.1.128:9300]], reason: zen-disco-join (elected_as_master)
[2015-05-12 14:45:09,943][INFO ][http                     ] [Illyana Rasputin] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.24.1.128:9200]}
[2015-05-12 14:45:09,944][INFO ][node                     ] [Illyana Rasputin] started
[2015-05-12 14:45:11,283][INFO ][gateway                  ] [Illyana Rasputin] recovered [2] indices into cluster_state

Elasticsearch client:

2015-05-12 14:46:40,683  INFO  [localhost-startStop-1] PluginsService:<init>:151 [Antiphon the Overseer] loaded [], sites []
2015-05-12 14:46:41,548  DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Antiphon the Overseer] node_sampler_interval[5ms]
2015-05-12 14:46:41,594  DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Antiphon the Overseer] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,625  DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,655  INFO  [localhost-startStop-1] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [0] timed out after [6ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
2015-05-12 14:46:41,658  DEBUG [localhost-startStop-1] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call
2015-05-12 14:46:41,661  DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,669  INFO  [elasticsearch[Antiphon the Overseer][generic][T#1]] TransportClientNodesService$SimpleNodeSampler:doSample:371 [Antiphon the Overseer] failed to get node info for [#transport#-1][dev06][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [1] timed out after [5ms]
    at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
2015-05-12 14:46:41,670  DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:disconnectFromNode:882 [Antiphon the Overseer] disconnecting from [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]] due to explicit disconnect call
2015-05-12 14:46:41,676  DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,677  WARN  [elasticsearch[Antiphon the Overseer][transport_client_worker][T#2]{New I/O worker #2}] TransportService$Adapter:remove:280 [Antiphon the Overseer] Received response for a request that has timed out, sent [14ms] ago, timed out [9ms] ago, action [cluster:monitor/nodes/info], node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]], id [1]
2015-05-12 14:46:41,682  INFO  [localhost-startStop-1] PluginsService:<init>:151 [Ricochet] loaded [], sites []
2015-05-12 14:46:41,722  DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Ricochet] node_sampler_interval[5ms]
2015-05-12 14:46:41,733  DEBUG [localhost-startStop-1] TransportClientNodesService:addTransportAddresses:167 [Ricochet] adding address [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,734  DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[#transport#-1][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,759  DEBUG [elasticsearch[Antiphon the Overseer][generic][T#1]] NettyTransport:connectToNode:751 [Antiphon the Overseer] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]
2015-05-12 14:46:41,760  DEBUG [localhost-startStop-1] NettyTransport:connectToNode:751 [Ricochet] connected to node [[Illyana Rasputin][TITPDFdtR6SXX5EeOXaidg][dev06][inet[localhost/127.0.0.1:9300]]]

Yes, this application has two client connections which should be OK (according to the devs). These disconnect/reconnect cycles happen about every minute or so.

Any clues whats going on here? I already disabled multicasting via discovery.zen.ping.multicast.enabled: false.

1 Answers

Voted

Cameron Kerr · Answer 1 · 2015-05-16T04:13:16+08:00

You client, which appears to be joined to the cluster (which is fine, although if you use Kibana 4, you may strike complaints from Kibana (not sure if these complaints made it out of the 4 beta)

From your client log:

2015-05-12 14:46:41,548  DEBUG [localhost-startStop-1] TransportClientNodesService:<init>:110 [Antiphon the Overseer] node_sampler_interval[5ms]

5ms seems fairly aggressive to be sampling nodes in the cluster. I haven't looked to see what this is by default, but I'm guessing that something has been configured with milliseconds when seconds was expected?

At this point, you need to consider the settings of the Client API, although its possible that the client picks this setting up from the cluster (as it is becoming part of the cluster)

Presumably you are using the elastic.co provided Java API?

Do you perhaps have client.transport.nodes_sampler_interval configured anywhere?

Are you using the compatible client / server versions, as per the docs for the Java client API

Please note that you are encouraged to use the same version on client and cluster sides. You may hit some incompatibility issues when mixing major versions

I wouldn't be surprised if a unit of measure had changed between version, although the documentation does say the default is 5s

Check your elasticsearch.yaml and your code for instances of node_sampler_interval. You may need to replace a naked 5 with 5s perhaps?

Elasticsearch: High traffic on loopback interface

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?