Ping a Specific Port

Question

dualed

Asked: 2014-07-25 06:14:54 +0800 CST2014-07-25 06:14:54 +0800 CST 2014-07-25 06:14:54 +0800 CST

Logstash parsing xml document containing multiple log entries

772

I'm currently evaluating whether logstash and elasticsearch are useful for our use-case. What I have is a log file containing multiple entries which is of the form

<root>
    <entry>
        <fieldx>...</fieldx>
        <fieldy>...</fieldy>
        <fieldz>...</fieldz>
        ...
        <fieldarray>
            <fielda>...</fielda>
            <fielda>...</fielda>
            ...
        </fieldarray>
    </entry>
    <entry>
    ...
    </entry>
    ...
<root>

Each entry element would contain one log event. (If you are interested, the file is actually a Tempo Timesheets (An Atlassian JIRA Plug-in) work-log export.)

Is it possible to transform such a file into multiple log events without writing my own codec?

2 Answers

Voted

dualed · Answer 1 · 2014-08-07T02:41:50+08:00

Alright, I found a solution that does work for me. The biggest problem with the solution is that the XML plugin is ... not quite unstable, but either poorly documented and buggy or poorly and incorrectly documented.

TLDR

Bash command line:

gzcat -d file.xml.gz | tr -d "\n\r" | xmllint --format - | logstash -f logstash-csv.conf

Logstash config:

input {
    stdin {}
}

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
    # multiline filter adds the tag "multiline" only to lines spanning multiple lines
    # We _only_ want those here.
    if "multiline" in [tags] {
        # Add the encoding line here. Could in theory extract this from the
        # first line with a clever filter. Not worth the effort at the moment.
        mutate {
            replace => ["message",'<?xml version="1.0" encoding="UTF-8" ?>%{message}']
        }
        # This filter exports the hierarchy into the field "entry". This will
        # create a very deep structure that elasticsearch does not really like.
        # Which is why I used add_field to flatten it.
        xml {
            target => entry
            source => message
            add_field => {
                fieldx         => "%{[entry][fieldx]}"
                fieldy         => "%{[entry][fieldy]}"
                fieldz         => "%{[entry][fieldz]}"
                # With deeper nested fields, the xml converter actually creates
                # an array containing hashes, which is why you need the [0]
                # -- took me ages to find out.
                fielda         => "%{[entry][fieldarray][0][fielda]}"
                fieldb         => "%{[entry][fieldarray][0][fieldb]}"
                fieldc         => "%{[entry][fieldarray][0][fieldc]}"
            }
        }
        # Remove the intermediate fields before output. "message" contains the
        # original message (XML). You may or may-not want to keep that.
        mutate {
            remove_field => ["message"]
            remove_field => ["entry"]
        }
    }
}

output {
    ...
}

Detailed

My solution works because at least until the entry level, my XML input is very uniform and thus can be handled by some kind of pattern matching.

Since the export is basically one really long line of XML, and the logstash xml plugin essentially works only with fields (read: columns in lines) that contain XML data, I had to change the data into a more useful format.

Shell: Preparing the file

gzcat -d file.xml.gz |: Was just too much data -- obviously you can skip that
tr -d "\n\r" |: Remove line-breaks inside XML elements: Some of the elements can contain line breaks as character data. The next step requires that these are removed, or encoded in some way. Even though it assumed that at this point you have all XML code in one massive line, it does not matter if this command removes any white space between elements

xmllint --format - |: Format the XML with xmllint (comes with libxml)

Here the single huge spaghetti line of XML (<root><entry><fieldx>...</fieldx></entry></root>) Is properly formatted:

<root>
  <entry>
    <fieldx>...</fieldx>
    <fieldy>...</fieldy>
    <fieldz>...</fieldz>
    <fieldarray>
      <fielda>...</fielda>
      <fieldb>...</fieldb>
      ...
    </fieldarray>
  </entry>
  <entry>
    ...
  </entry>
  ...
</root>

Logstash

logstash -f logstash-csv.conf

(See full content of the .conf file in the TL;DR section.)

Here, the multiline filter does the trick. It can merge multiple lines into a single log message. And this is why the formatting with xmllint was necessary:

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
}

This basically says that every line with indentation that is more than two spaces (or is </entry> / xmllint does indentation with two spaces by default) belongs to a previous line. This also means character data must not contain newlines (stripped with tr in shell) and that the xml must be normalised (xmllint)

rjurado01 · Answer 2 · 2016-12-24T04:14:55+08:00

rjurado01

2016-12-24T04:14:55+08:002016-12-24T04:14:55+08:00

I had a similar case. To parse this xml:

<ROOT number="34">
  <EVENTLIST>
    <EVENT name="hey"/>
    <EVENT name="you"/>
  </EVENTLIST>
</ROOT>

I use this configuration to logstash:

input {
  file {
    path => "/path/events.xml"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "<ROOT"
      negate => "true"
      what => "previous"
      auto_flush_interval => 1
    }
  }
}
filter {
  xml {
    source => "message"
    target => "xml_content"
  }
  split {
    field => "xml_content[EVENTLIST]"
  }
  split {
    field => "xml_content[EVENTLIST][EVENT]"
  }
  mutate {
    add_field => { "number" => "%{xml_content[number]}" }
    add_field => { "name" => "%{xml_content[EVENTLIST][EVENT][name]}" }
    remove_field => ['xml_content', 'message', 'path']
  }
}
output {
  stdout {
    codec => rubydebug
  }
}

I hope this can help someone. I've needed a long time to get it.

1

Logstash parsing xml document containing multiple log entries

TLDR

Detailed

Shell: Preparing the file

Logstash

Can you pass user/pass for HTTP Basic Authentication in URL parameters?

Ping a Specific Port

Check if port is open or closed on a Linux server?

How to automate SSH login with password?

How do I tell Git for Windows where to find my private RSA key?

What's the default superuser username/password for postgres after a new install?

What port does SFTP use?

Command line to list users in a Windows Active Directory group?

What is a Pem file and how does it differ from other OpenSSL Generated Key File Formats?

How to determine if a bash variable is empty?