A pythonic tool for batch loading data files (json, parquet, csv, tsv) into ElasticSearch
Project description
Main features:
Batch upload CSV (actually any *SV) files to Elasticsearch
Batch upload JSON files / JSON lines to Elasticsearch
Batch upload parquet files to Elasticsearch
Pre defining custom mappings
Delete index before upload
Index documents with _id from the document itself
Load data directly from url
Supports ES 1.X, 2.X and 5.X
And more
Installation
Usage
(venv)/tmp $ elasticsearch_loader --help Usage: elasticsearch_loader [OPTIONS] COMMAND [ARGS]... Options: --bulk-size INTEGER How many docs to collect before writing to ElasticSearch (default 500) --concurrency INTEGER How much worker threads to start (default 10) --es-host TEXT Elasticsearch cluster entry point. (default http://localhost:9200) --verify-certs Make sure we verify SSL certificates (default false) --use-ssl Turn on SSL (default false) --ca-certs TEXT Provide a path to CA certs on disk --http-auth TEXT Provide username and password for basic auth in the format of username:password --index TEXT Destination index name [required] --delete Delete index before import? (default false) --type TEXT Docs type [required] --id-field TEXT Specify field name that be used as document id --index-settings-file FILENAME Specify path to json file containing index mapping and settings --help Show this message and exit. Commands: csv json FILES with the format of [{"a": "1"}, {"b":... parquet
Examples
Load CSV to elasticsearch
elasticsearch_loader --index incidents --type incident csv file1.csv file2.csv
Load 2 CSV to elasticsearch
elasticsearch_loader --index incidents --type incident csv file1.csv file2.csv
Load JSON to elasticsearch
elasticsearch_loader --index incidents --type incident json *.json
Load all git commits into elasticsearch
git log --pretty=format:'{"sha":"%H","author_name":"%aN", "author_email": "%aE","date":"%ad","message":"%f"}' | elasticsearch_loader --type git --index git json --json-lines -
Load parquet to elasticsearch
elasticsearch_loader --index incidents --type incident parquet file1.parquet
Load CSV from github repo (actually any http/https is ok)
elasticsearch_loader --index data --type avg_height --id-field country json https://raw.githubusercontent.com/samayo/country-data/master/src/country-avg-male-height.json
Load data from stdin
generate_data | elasticsearch_loader --index data --type incident csv -
Read _id from incident_id field elasticsearch_loader --id-field incident_id --index incidents --type incident csv file1.csv file2.csv
Change bulk size
elasticsearch_loader --bulk-size 300 --index incidents --type incident csv file1.csv file2.csv
Change index concurrency
elasticsearch_loader --concurrency 20 --index incidents --type incident csv file1.csv file2.csv
Load custom mappings
elasticsearch_loader --index-settings-file samples/mappings.json --index incidents --type incident csv file1.csv file2.csv
Tests and sample data
Tests are located under test and can run by runnig tox input format can be found under samples
TODO
[x] parquet support
[x] progress bar
[ ] DLQ style out file for docs that didn’t got in
[x] Python3 support
[x] pep8 test
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for elasticsearch-loader-0.2.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4e5bd3ff666395d9e4feefb521b29928ae2ea76dd12c2fac041e0c2d507171ca |
|
MD5 | 53a9085eec17a0c4529540b7fc3a2f60 |
|
BLAKE2b-256 | 0e5e2a706587e5dcd7df5f25e34a9f03d609b709777f5eae54daa1c338678087 |