ELK: ElasticDump and Python to create a data warehouse job

By nature, the amount of data collected in your ElasticSearch instance will continue to grow and at some point you will need to prune or warehouse indexes so that your active collections are prioritized.

ElasticDump can assist in moving your indexes either to a distinct ElasticSearch instance that is setup specifically for long term data, or exporting the data as json for later import into a warehouse like Hadoop. ElasticDump does not have a special filter for time based indexes (index-YYYY.MM.DD), so you must specify exact index names.

In this article we will use Python to query a source ElasticSearch instance (an instance meant for near real-time querying, keeps minimal amount of data), and exports any indexes from the last 14 days into a target ElasticSearch instance (an instance meant for data warehousing, has more persistent storage and users expect multi-second query times).

Installation

The first step is to install using apt, pip, and npm:

$ sudo apt-get install python python-pip nodejs npm nodejs-legacy -y
$ sudo pip install elasticsearch
$ sudo npm cache clean
$ sudo npm install elasticdump -g --no-bin-links

Then validate the installation using:

$ /usr/local/lib/node_modules/elasticdump/bin/elasticdump --version

Test export to JSON

You can test your connection to an ElasticSearch instance by exporting its index as a local json file:

$ /usr/local/lib/node_modules/elasticdump/bin/elasticdump --input=http://<ES>:9200/<indexname> --output=test.json

Test export between ES instances

As a manual test of our ultimate goal, you can copy a single index from one ElasticSearch instance to another:

$ /usr/local/lib/node_modules/elasticdump/bin/elasticdump --input=http://<ESsrc>:9200/<indexname> --output=http://<ESdest>:9200/<indexname>

Python Script for automated warehousing

As you can see from the above commands, elasticdump only takes a single index name. It may be tempting to throw together a quick shell script that loops through the last 14 days of YYYY.MM.DD, but there are lots of expected errors that will be hard to differentiate from real errors.

For example, it is not necessarily an error if you don’t have a source table for a single day. Also, you may or may not want Elasticdump to skip the copy-over if the destination index already exists. This can all be controlled if you use a more intelligent script.

For this reason, I’ve put together a python script migrateESIndexes.py that acts more intelligently by using the ElasticSearch API to query the source and destination first before running elasticdump on the index. It’s usage is:

$ ./migrateESIndexes.py baseIndexName src dest ndays [--dry-run]

It is best to try a dry-run first, which will tell you what it wants to do (without actually calling elasticdump):

$ ./migrateESIndexes.py myindex http://192.168.1.2:9200/ http://192.168.1.3:9200/ 7 --dry-run

And then by removing the dry-run parameter, it will perform the operations against the destination ElasticSearch instance:

$ ./migrateESIndexes.py myindex http://192.168.1.2:9200/ http://192.168.1.3:9200/ 7

REFERENCES

https://www.npmjs.com/package/elasticdump

https://github.com/taskrabbit/elasticsearch-dump

http://tech.taskrabbit.com/blog/2014/01/06/elasticsearch-dump/