I'm currently evaluating whether logstash and elasticsearch are useful for our use-case. What I have is a log file containing multiple entries which is of the form
<root>
<entry>
<fieldx>...</fieldx>
<fieldy>...</fieldy>
<fieldz>...</fieldz>
...
<fieldarray>
<fielda>...</fielda>
<fielda>...</fielda>
...
</fieldarray>
</entry>
<entry>
...
</entry>
...
<root>
Each entry
element would contain one log event. (If you are interested, the file is actually a Tempo Timesheets (An Atlassian JIRA Plug-in) work-log export.)
Is it possible to transform such a file into multiple log events without writing my own codec?
Alright, I found a solution that does work for me. The biggest problem with the solution is that the XML plugin is ... not quite unstable, but either poorly documented and buggy or poorly and incorrectly documented.
TLDR
Bash command line:
Logstash config:
Detailed
My solution works because at least until the
entry
level, my XML input is very uniform and thus can be handled by some kind of pattern matching.Since the export is basically one really long line of XML, and the logstash xml plugin essentially works only with fields (read: columns in lines) that contain XML data, I had to change the data into a more useful format.
Shell: Preparing the file
gzcat -d file.xml.gz |
: Was just too much data -- obviously you can skip thattr -d "\n\r" |
: Remove line-breaks inside XML elements: Some of the elements can contain line breaks as character data. The next step requires that these are removed, or encoded in some way. Even though it assumed that at this point you have all XML code in one massive line, it does not matter if this command removes any white space between elementsxmllint --format - |
: Format the XML with xmllint (comes with libxml)Here the single huge spaghetti line of XML (
<root><entry><fieldx>...</fieldx></entry></root>
) Is properly formatted:Logstash
(See full content of the
.conf
file in the TL;DR section.)Here, the
multiline
filter does the trick. It can merge multiple lines into a single log message. And this is why the formatting withxmllint
was necessary:This basically says that every line with indentation that is more than two spaces (or is
</entry>
/ xmllint does indentation with two spaces by default) belongs to a previous line. This also means character data must not contain newlines (stripped withtr
in shell) and that the xml must be normalised (xmllint)I had a similar case. To parse this xml:
I use this configuration to logstash:
I hope this can help someone. I've needed a long time to get it.