I have 2 ES servers that are being fed by 1 logstash server and viewing the logs in Kibana. This is a POC to work out any issues before going into production. The system has ran for ~1 month and every few days, Kibana will stop showing logs at some random time in the middle of the night. Last night, the last log entry I received in Kibana was around 18:30. When I checked on the ES servers, it showed the master running and the secondary not running (from /sbin/service elasticsearch status), but I was able to do a curl on the localhost and it returned information. So not sure what's up with that. Anyway, when I do a status on the master node, I get this:
curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "gis-elasticsearch",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 2,
"active_primary_shards" : 186,
"active_shards" : 194,
"relocating_shards" : 0,
"initializing_shards" : 7,
"unassigned_shards" : 249
}
When I view the indexes, via "ls ...nodes/0/indeces/" it shows all indexes being modified today for some reason and there are new file for today's date.So I think I'm starting to catch back up after I restarted both servers but not sure why it failed in the first place. When I look at the logs on the master, I only see 4 warning errors at 18:57 and then the 2ndary leaving the cluster. I don't see any logs on the secondary (Pistol) on why it stopped working or what truly happened.
[2014-03-06 18:57:04,121][WARN ][transport ] [ElasticSearch Server1] Transport response handler not found of id [64147630]
[2014-03-06 18:57:04,124][WARN ][transport ] [ElasticSearch Server1] Transport response handler not found of id [64147717]
[2014-03-06 18:57:04,124][WARN ][transport ] [ElasticSearch Server1] Transport response handler not found of id [64147718]
[2014-03-06 18:57:04,124][WARN ][transport ] [ElasticSearch Server1] Transport response handler not found of id [64147721]
[2014-03-06 19:56:08,467][INFO ][cluster.service ] [ElasticSearch Server1] removed {[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.1.1.10:9301]]{client=true, data=false},}, reason: zen-disco-node_failed([Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.13.3.46:9301]]{client=true, data=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout [2014-03-06 19:56:12,304][INFO ][cluster.service ] [ElasticSearch Server1] added {[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.1.1.10:9301]]{client=true, data=false},}, reason: zen-disco-receive(join from node[[Pistol][sIAMHNj6TMCmrMJGW7u97A][inet[/10.13.3.46:9301]]{client=true, data=false}])
Any idea on additional logging or troubleshooting I can turn on to keep this from happening in the future? Since the shards are not caught up, right now I"m just seeing a lot o debug messages about failed to parse. I'm assuming that will be corrected once we catch up.
[2014-03-07 10:06:52,235][DEBUG][action.search.type ] [ElasticSearch Server1] All shards failed for phase: [query] [2014-03-07 10:06:52,223][DEBUG][action.search.type ] [ElasticSearch Server1] [windows-2014.03.07][3], node[W6aEFbimR5G712ddG_G5yQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@74ecbbc6] lastShard [true] org.elasticsearch.search.SearchParseException: [windows-2014.03.07][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"10m"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"(ASA AND Deny)"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1394118412373,"to":"now"}}}]}}}}}}}},"size":0}]]
Usual suspects for ES with Kibana are :
Also the "usual" setup for ES is 3 servers to allow better redundancy when one server is down. But YMMV.
You can try the new G1 garbage collector too, which has (in my case) a much better behavior than CMS in my Kibana ES.
The GC duration problem is usually the one that happens when you're looking somewhere else and will typically lead to a loss of data because ES stops responding.
Good luck with these :)