I've set up a new proof of concept logstash system
CentOS 6.6 (on Vmware 5.5) - single CPU VM with 12G RAM allocated
Elasticsearch and Logstash installed from RPMs …
# rpm -q elasticsearch logstash
elasticsearch-1.7.1-1.noarch
logstash-1.5.3-1.noarch
JVM: 1.8.0_51
Data I'm feeding in is simple records of the form …
M1234 z123 2015-01-31 23:28:09.417 8.55373
(fields are machine name, userid, date, time, time taken to logon - everything is simple US-ASCII)
Logstash config below (this data comes from a MSSQL database and for the moment I'm exporting to text file and transferring file to the logstash server).
This worked fine for a day's worth of logs (11K records) but when I try to process the backlog from this calendar year it 'hangs'.
Symptoms of this are
- elasticsearch is still responsive - searches and access to configuration still fine
- the number of documents in the indices stops going up
- the system becomes essential idle - only background disk activity and minimal CPU usage
- if I try to stop the logstash process (which is still running) it will only die with
kill -9
.
This seems to happen at around 200K documents. It isn't affected by the number of indices - I started off with daily indices and then changed to weekly - it still stops around 200K docs.
Because this is a proof of concept running on single machine I've turned replica count down to 0 and shards to 1 - I don't think that should make any difference to this problem.
I don't see any errors in the logstash or elasticsearch logs despite turning verbosity up on both.
I don't think the system is running out of memory, disk space, file descriptors.
I'm not sure what else to look at. This feels like a trivially sized problem (for ELK) and I don't see this problem on an existing ELK setup which handles our mail logs (though that is running earlier versions and has multiple elasticsearch storage nodes)
Although I'm confident that there are no odd byte sequences in the input files, I've explicitly declared the input as US_ASCII with charset => "US-ASCII"
in the file
input plugin stanza. I don't expect this to make any difference (test is still running).
Update: Although there was nothing interesting in the logs when the import stalled the lines logged when logstash
was asked to shutdown were interesting …
{:timestamp=>"2015-08-03T10:17:39.104000+0100", :message=>["INFLIGHT_EVENTS_REPORT", "2015-08-03T10:17:39+01:00", {"input_to_filter"=>20, "filter_to_output"=>0, "outputs"=>[]}], :level=>:warn}
Implies to me that the problem is at the filtering stage and not the output to elasticsearch
. I've confirmed that by first getting rid of the elasticsearch
output and just having stdout
. That showed the same behaviour - import stalled after a while.
Putting the elasticsearch
output back but clearing out everything in the filter
section gave me a successful, complete data import.
I've now got a fix for this - details in answer.
input {
file {
path => "/var/lib/clusters/*"
type => "clusterF"
start_position => "beginning"
}
}
filter {
mutate {
remove_field => [ "path", "host" ]
}
# 13COMP014 nabcteam 2015-07-29 11:09:21.353 153.493
if [type] == "clusterF" {
grok {
match => { "message" => "%{NOTSPACE:client} +%{WORD:userid} +%{TIMESTAMP_ISO8601:datestamp} +%{BASE10NUM:elapsed:float}" }
}
}
if [elapsed] < 0 {
drop {}
}
if [elapsed] > 1000.0 {
drop {}
}
if [userid] =~ "[a-z][0-9]{7}" {
mutate {
add_field => [ "userClass", "student" ]
}
} else if [userid] =~ "n[a-z].*" {
mutate {
add_field => [ "userClass", "staff" ]
}
} else {
mutate {
add_field => [ "userClass", "other" ]
}
}
date {
match => [ "datestamp", "ISO8601" ]
}
mutate {
remove_field => [ "message" ]
}
}
output {
elasticsearch {
bind_host => "clog01.ncl.ac.uk"
protocol => "http"
cluster => "elasticsearch"
flush_size => 10
index => "clusters-%{+xxxx.ww}"
}
}