Paul Haldane's questions -server

Paul Haldane

Asked: 2015-08-03 12:33:17 +0800 CST

Logstash/elasticsearch stops accepting new data

4

I've set up a new proof of concept logstash system

CentOS 6.6 (on Vmware 5.5) - single CPU VM with 12G RAM allocated

Elasticsearch and Logstash installed from RPMs …

# rpm -q elasticsearch logstash
elasticsearch-1.7.1-1.noarch
logstash-1.5.3-1.noarch

JVM: 1.8.0_51

Data I'm feeding in is simple records of the form …

M1234 z123 2015-01-31 23:28:09.417 8.55373

(fields are machine name, userid, date, time, time taken to logon - everything is simple US-ASCII)

Logstash config below (this data comes from a MSSQL database and for the moment I'm exporting to text file and transferring file to the logstash server).

This worked fine for a day's worth of logs (11K records) but when I try to process the backlog from this calendar year it 'hangs'.

Symptoms of this are

elasticsearch is still responsive - searches and access to configuration still fine
the number of documents in the indices stops going up
the system becomes essential idle - only background disk activity and minimal CPU usage
if I try to stop the logstash process (which is still running) it will only die with kill -9.

This seems to happen at around 200K documents. It isn't affected by the number of indices - I started off with daily indices and then changed to weekly - it still stops around 200K docs.

Because this is a proof of concept running on single machine I've turned replica count down to 0 and shards to 1 - I don't think that should make any difference to this problem.

I don't see any errors in the logstash or elasticsearch logs despite turning verbosity up on both.

I don't think the system is running out of memory, disk space, file descriptors.

I'm not sure what else to look at. This feels like a trivially sized problem (for ELK) and I don't see this problem on an existing ELK setup which handles our mail logs (though that is running earlier versions and has multiple elasticsearch storage nodes)

Although I'm confident that there are no odd byte sequences in the input files, I've explicitly declared the input as US_ASCII with charset => "US-ASCII" in the file input plugin stanza. I don't expect this to make any difference (test is still running).

Update: Although there was nothing interesting in the logs when the import stalled the lines logged when logstash was asked to shutdown were interesting …

{:timestamp=>"2015-08-03T10:17:39.104000+0100", :message=>["INFLIGHT_EVENTS_REPORT", "2015-08-03T10:17:39+01:00", {"input_to_filter"=>20, "filter_to_output"=>0, "outputs"=>[]}], :level=>:warn}

Implies to me that the problem is at the filtering stage and not the output to elasticsearch. I've confirmed that by first getting rid of the elasticsearch output and just having stdout. That showed the same behaviour - import stalled after a while.

Putting the elasticsearch output back but clearing out everything in the filter section gave me a successful, complete data import.

I've now got a fix for this - details in answer.

input {
        file {
                path => "/var/lib/clusters/*"
                type => "clusterF"
                start_position => "beginning"
        }
}

filter {
        mutate {
                remove_field => [ "path", "host" ]
        }
        # 13COMP014   nabcteam    2015-07-29 11:09:21.353 153.493
        if [type] == "clusterF" {
                grok {
                        match => { "message" => "%{NOTSPACE:client} +%{WORD:userid} +%{TIMESTAMP_ISO8601:datestamp} +%{BASE10NUM:elapsed:float}" }
                }
        }
        if [elapsed] < 0 {
                drop {}
        }
        if [elapsed] > 1000.0 {
                drop {}
        }
        if [userid] =~ "[a-z][0-9]{7}" {
                mutate {
                        add_field => [ "userClass", "student" ]
                }
        } else if [userid] =~ "n[a-z].*" {
                mutate {
                        add_field => [ "userClass", "staff" ]
                }
        } else {
                mutate {
                        add_field => [ "userClass", "other" ]
                }
        }
        date {
                match => [ "datestamp", "ISO8601" ]
        }
        mutate {
                remove_field => [ "message" ]
        }
}

output {
        elasticsearch {
                bind_host => "clog01.ncl.ac.uk"
                protocol => "http"
                cluster => "elasticsearch"
                flush_size => 10
                index => "clusters-%{+xxxx.ww}"
        }
}

Paul Haldane

Asked: 2015-01-06 08:03:03 +0800 CST

logstash timestamp on year rollover

3

We use logstash to store/search logs from our mail servers. I noticed today that we didn't have any indices from this year (2015). Quick investigation showed that current logs were being stored as 2014.01.05 (ie same day but last year) and these indices were being deleted by a cron job that looks for old indices.

Restarting logstash fixed things so I assume that logstash is filling in the year information based on the time it started.

We're running Logstash 1.4.1 with Elasticsearch 1.2.4. So not the latest version of Elasticsearch but I don't see anything relevant in the changelog for 1.4.2.

Log entries are sent to logstash using syslog - config below along with example of input line and parsed output.

Is there a better fix for this than just remembering to restart Logstash on New Year's day?

Example of input line

Jan  5 15:03:35 cheviot22 exim[15034]: 1Y89Bv-0003uU-DD <= [email protected] H=adudeviis.ncl.ac.uk (campus) [10.8.232.56] P=esmtp S=2548 [email protected]

{
  "_index": "logstash-2014.01.05",
  "_type": "mails",
  "_id": "HO0TQs66SA-1QkQBYd9Jag",
  "_score": null,
  "_source": {
    "@version": "1",
    "@timestamp": "2014-01-05T15:03:35.000Z",
    "type": "mails",
    "priority": 22,
    "timestamp": "Jan  5 15:03:35",
    "logsource": "cheviot22",
    "program": "exim",
    "pid": "15034",
    "severity": 6,
    "facility": 2,
    "facility_label": "mail",
    "severity_label": "Informational",
    "msg": "1Y89Bv-0003uU-DD <= [email protected] H=adudeviis.ncl.ac.uk (campus) [10.8.232.56] P=esmtp S=2548 [email protected]",
    "tags": [
      "grokked",
      "exim_grokked",
      "dated"
    ],
    "xid": "1Y89Bv-0003uU",
    "exim_rcpt_kv": "[email protected] H=adudeviis.ncl.ac.uk (campus) [10.8.232.56] P=esmtp S=2548 [email protected]",
    "H": "adudeviis.ncl.ac.uk",
    "P": "esmtp",
    "S": "2548",
    "id": "[email protected]"
  },
  "sort": [
    1388934215000,
    1388934215000
  ]
}

Logstash config (with irrelevant bits removed) ...

input {
    syslog {
        codec => "plain"
        debug => false
        port => 514
        type => "mails"
    }
}

filter {
    mutate {
        remove_field => [ "path", "host" ]
    }

    if [type] == "mails" {
        grok {
            patterns_dir => [ "/etc/logstash/patterns" ]
            match => [ "message",  "(?<msg>.*)" ]
            add_tag => [ "grokked" ]
            break_on_match => true
            remove_field => [ "message" ]
        }
    }

    date {
        match => [ "timestamp", "ISO8601", "MMM dd HH:mm:ss", "MMM  d HH:mm:ss"]
        add_tag => [ "dated" ]
    }
}

output {
        elasticsearch {
                cluster => "logstash"
        host => "iss-logstash01"
        flush_size => 1000
        index => "logstash-%{+YYYY.MM.dd}"
        }
}

Paul Haldane

Asked: 2014-03-10 05:52:46 +0800 CST

Windows DNS servers repeatedly requesting records in zone when they get SERVFAIL response

2

We're seeing high levels (over 2000 requests/second) of DNS queries from our caching DNS servers to external servers. This may have been happening for a long time - this came to light recently because of performance problems with our firewall. Talking to colleagues at other institutions it's clear that we're making more queries than they are.

My initial thought was that the problem was lack of caching of SERVFAIL responses. Having done more investigation it's clear that the problem is a high level of requests for the failing record from the Windows DNS servers. It seems that in our environment a single query to one of the Windows DNS servers for a record from a zone which returns SERVFAIL results in a stream of requests for that record from all of the Windows DNS servers. The stream of requests doesn't stop until I add a fake empty zone on one of the Bind servers.

My plan tomorrow is to verify the configuration of the Windows DNS servers - they should just be forwarding to the caching Bind servers. I figure we must have something wrong there as I can't believe that no-one else has hit this if it's not a misconfiguration. I'll update this question after that (possibly closing this one and opening a new, clearer one).

Our setup is a pair of caching servers running Bind 9.3.6 which are used either directly by clients or via our Windows domain controllers. The caching servers pass queries to our main DNS servers which are running 9.8.4-P2 - these servers are authoritative for our domains and pass queries for other domains to external servers.

Behaviour we're seeing is that queries like the one below aren't being cached. I've verified this by looking at network traffic from the DNS servers using tcpdump.

 [root@dns1 named]# dig ptr 119.49.194.173.in-addr.arpa.

 ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.6 <<>> ptr 119.49.194.173.in-addr.arpa.
 ;; global options:  printcmd
 ;; Got answer:
 ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 8680
 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

 ;; QUESTION SECTION:
 ;119.49.194.173.in-addr.arpa.   IN      PTR

 ;; Query time: 950 msec
 ;; SERVER: 127.0.0.1#53(127.0.0.1)
 ;; WHEN: Sun Mar  9 13:34:20 2014
 ;; MSG SIZE  rcvd: 45

Querying google's server directly shows that we're getting a REFUSED response.

[root@dns1 named]# dig ptr 119.49.194.173.in-addr.arpa. @ns4.google.com.

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.6 <<>> ptr 119.49.194.173.in-addr.arpa. @ns4.google.com.
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 38825
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;119.49.194.173.in-addr.arpa.   IN      PTR

;; Query time: 91 msec
;; SERVER: 216.239.38.10#53(216.239.38.10)
;; WHEN: Sun Mar  9 13:36:38 2014
;; MSG SIZE  rcvd: 45

This isn't just happening with google addresses or reverse lookups but a high proportion of the queries are for those ranges (I suspect because of a Sophos reporting feature).

Should our DNS servers be caching these negative responses? I read http://tools.ietf.org/rfcmarkup?doc=2308 but didn't see anything about REFUSED. We don't specify lame-ttl in config file so I'd expect that to default to 10 minutes.

I believe this (the lack of caching) is expected behaviour. I don't understand why the other sites I've talked to aren't seeing the same thing. I've tried a test server running the latest stable version of Bind and that shows the same behaviour. I also tried Unbound and that didn't cache SERVFAIL either. There's some discussion of doing this in djbdns here but conclusion is that the functionality has been removed.

Are there settings in the Bind config that we could change to influence this behaviour? lame-ttl didn't help (and we were running with default anyway).

As part of investigation I've added some fake empty zones on our caching DNS servers to cover the ranges leading to most requests. That's dropped the number of requests to external servers but isn't sustainable (and feels wrong as well). In parallel with this I've asked a colleague to get logs from the Windows DNS servers so that we can identify the clients making the original requests.

Paul Haldane

Asked: 2014-02-16 08:40:08 +0800 CST

Forwarding from rsyslog to syslog-ng over TCP not working (although packets are reaching server)

2

We use syslog-ng on our central syslog server (syslog-ng-2.1.4-9.el5 on CentOS 5.9). We were happily sending logs using syslogd and rsyslog from a mixture of Linux and Solaris hosts over UDP until yesterday when it finally became clear to me that we're losing a significant number of entries (yes, I should have heeded all the warnings).

I'm trying to change to using TCP. I'm keen (at the moment) to stick with syslog-ng at the centre and rsyslog on the senders and my understanding is that this should work. The central syslog server has multiple virtual interfaces which are used to segment sets of logs by function (which is why the udp() and tcp() statements specify the IP address to bind to).

I enabled TCP listeners at the syslog-ng end (see extract from config file below) - netstat -l shows listeners on port 514. As a test I changed the forwarding clause on one host (CentOS 6.4 with rsyslog-5.8.10-6.el6.x86_64) from @unixlog to @@unixlog. I see the packets arriving at the central server and packets going back (looking with tcpdump on unixlog) so I think I've eliminated issues with iptables however nothing appears in the output file. I just tried turning iptables off for a while to check this - same thing.

I haven't tried turning on debugging for syslog-ng because this is a busy server - my next step is likely to be setting up a test syslog-ng server and pointing a single host at it. Before I do that is there anything else I should be looking at? Do I need to change the format of forwarded messages? My reading of the Syslog-ng 2.x docs suggests that this should work without any changes. I've tried changing the compatibility level option that rsyslog is called with. Was initially set to 5, I've tried 0 .. 4 and removing the parameter completely - no difference in behaviour.

Rsyslog.conf on sender (with comments and local files removed) …

$ModLoad imuxsock
$ModLoad imklog
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat
*.err;kern.debug;daemon.notice;mail.crit                @unixlog.ncl.ac.uk
local3.debug                                            @cmdloghost.ncl.ac.uk

Extract from rsyslog.conf on unixlog

options {
  long_hostnames(off);
  sync(0);
  create_dirs(yes);
};

destination d_syslog {
  file("/var/log/incoming/syslogs/$HOST/syslog.$YEAR$MONTH$DAY.log");
};

# unixlog is 10.8.232.202
source unixlog_net { udp(ip(10.8.232.202)); tcp(ip(10.8.232.202)); };

log {
  source(unixlog_net);
  destination(d_syslog);
};

Paul Haldane

Asked: 2012-10-29 02:31:42 +0800 CST

Why can't we reach some (but not all) external web service via VPN connection?

0

At work (UK university) we use a set of Windows servers running WS2008R2 and RRAS which offer VPN service to students in our accommodation. We do this to associate the network connections with individuals. Before they've connected to the VPN all they can talk to is the stuff thats needed to setup the VPN and a local web site with documentation on how to connect. Medium term we'll probably replace this but it's what we're using at the moment. VPN on the 2008 servers allocates client a private (10.x) address. Access to external sites is through NAT on the campus routers (same as any other directly connected client on a private address). Non-VPN connections aren't seeing this problem.

Older servers run WS 2003 and ISA2004. That setup works but has become unreliable under load. Big difference there was that we were allocating non-RFC1918 addresses to the clients (so no NAT required). Behaviour we're seeing is that once connected to the VPN, clients can reach local web sites (that is sites on the campus network) but only some external sites. It seems (but this may be chance) that the sites we can reach are Google ones (including YouTube). We certainly have trouble reaching Microsoft's Office 365 service (which is a pain because that's where mail for most of our students is).

One odd bit of behaviour is that clients can fetch (using wget on a Windows 7 client) http://www.oracle.com/ (which gets a 301 redirect) but hangs when asked to fetch http://www.oracle.com/index.html (which is what the first URL redirects to). Access works reliably if we configure clients to use our local web proxies (Squid).

My gut tells me that this is likely to be something in the chain dropping replies either based on HTTP inspection or the IP address in the reply. However I'm puzzled about why we're seeing this with the VPN clients.

Plan for tomorrow (when I'm back in the office) is to setup a web server on external connection so that we can monitor behaviour at both ends of the conversation (hoping that the problem manifests itself with our test server). Any suggestions for things we should be looking at?

Logstash/elasticsearch stops accepting new data

logstash timestamp on year rollover

Windows DNS servers repeatedly requesting records in zone when they get SERVFAIL response

Forwarding from rsyslog to syslog-ng over TCP not working (although packets are reaching server)

Why can't we reach some (but not all) external web service via VPN connection?

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?