I have a CentOS 6.5 server on which I installed Elasticsearch 1.3.2.
My elasticsearch.yml
configuration file is a minimal modification of the one shipping with elasticsearch as a default. Once stripped of all commented lines, it looks like:
cluster.name: xxx-kibana
node:
name: "xxx"
master: true
data: true
index.number_of_shards: 5
index.number_of_replicas: 1
path:
logs: /log/elasticsearch/log
data: /log/elasticsearch/data
transport.tcp.port: 9300
http.port: 9200
discovery.zen.ping.multicast.enabled: false
Elasticsearch should have compression ON by default, and I read various benchmarks putting the compression ratio from as low as 50% to as high as 95%. Unluckily, the compression ratio in my case is -400%, or in other words: data stored with ES takes 4 times as much disk space than the text file with the same content. See:
12K logstash-2014.10.07/2/translog
16K logstash-2014.10.07/2/_state
116M logstash-2014.10.07/2/index
116M logstash-2014.10.07/2
12K logstash-2014.10.07/4/translog
16K logstash-2014.10.07/4/_state
127M logstash-2014.10.07/4/index
127M logstash-2014.10.07/4
12K logstash-2014.10.07/0/translog
16K logstash-2014.10.07/0/_state
109M logstash-2014.10.07/0/index
109M logstash-2014.10.07/0
16K logstash-2014.10.07/_state
12K logstash-2014.10.07/1/translog
16K logstash-2014.10.07/1/_state
153M logstash-2014.10.07/1/index
153M logstash-2014.10.07/1
12K logstash-2014.10.07/3/translog
16K logstash-2014.10.07/3/_state
119M logstash-2014.10.07/3/index
119M logstash-2014.10.07/3
622M logstash-2014.10.07/ # <-- This is the total!
versus:
6,3M /var/log/td-agent/legacy_api.20141007_0.log
8,0M /var/log/td-agent/legacy_api.20141007_10.log
7,6M /var/log/td-agent/legacy_api.20141007_11.log
6,7M /var/log/td-agent/legacy_api.20141007_12.log
8,0M /var/log/td-agent/legacy_api.20141007_13.log
7,6M /var/log/td-agent/legacy_api.20141007_14.log
7,6M /var/log/td-agent/legacy_api.20141007_15.log
7,7M /var/log/td-agent/legacy_api.20141007_16.log
5,6M /var/log/td-agent/legacy_api.20141007_17.log
7,9M /var/log/td-agent/legacy_api.20141007_18.log
6,3M /var/log/td-agent/legacy_api.20141007_19.log
7,8M /var/log/td-agent/legacy_api.20141007_1.log
7,1M /var/log/td-agent/legacy_api.20141007_20.log
8,0M /var/log/td-agent/legacy_api.20141007_21.log
7,2M /var/log/td-agent/legacy_api.20141007_22.log
3,8M /var/log/td-agent/legacy_api.20141007_23.log
7,5M /var/log/td-agent/legacy_api.20141007_2.log
7,3M /var/log/td-agent/legacy_api.20141007_3.log
8,0M /var/log/td-agent/legacy_api.20141007_4.log
7,5M /var/log/td-agent/legacy_api.20141007_5.log
7,5M /var/log/td-agent/legacy_api.20141007_6.log
7,8M /var/log/td-agent/legacy_api.20141007_7.log
7,8M /var/log/td-agent/legacy_api.20141007_8.log
7,2M /var/log/td-agent/legacy_api.20141007_9.log
173M total
What am I doing wrong? Why is data not being compressed?
I have provisionally added index.store.compress.stored: 1
to my configuration file, as I found that in the elasticsearch 0.19.5
release notes (that's when the store
compression came out first), but I'm not yet able to tell if it is making a difference, and anyhow compression should be ON by default, nowadays...
Elasticsearch does not shrink your data automagically. This is true for any database. Beside storing the raw data, each database has to store metadata along with it. Normal databases only store an index (for faster search) for the columns the db-admin chose upfront. ElasticSearch is different as it indexes every column by default. Thus making the index extremely large, but on the other hand gives perfect performance while retrieving data.
In normal configurations you see an increase of 4 to 6 times of the raw data after indexing. Although it heavily depends on the actual data. But this is actually intended behavior.
So to decrease the database size, you have to go the other way around like you did in RDBMs: Exclude columns from being indexed or stored that you do not need to be indexed.
Additionally you could turn on compression, but this will only improve when your "documents" are large, which is probably not true for log file entries.
There are some comparisons and and useful tips here: https://github.com/jordansissel/experiments/tree/master/elasticsearch/disk
But remember: Searching comes with a cost. The cost to pay is disk space. But you gain flexibility. If your storage size exceeds, then grow horizontally! This is where ElasticSearch wins.