Skip to main content

Bulk import a CSV or TSV into Elastic Search

Project description

https://img.shields.io/pypi/v/csv2es.svg https://img.shields.io/travis/rholder/csv2es.svg https://img.shields.io/pypi/dm/csv2es.svg

The csv2es project is an Apache 2.0 licensed commandline utility, written in Python, to load a CSV (or TSV) file into an Elasticsearch instance. That’s pretty much it. That’s all it does. The first row of the file should contain the field names intended to be used for Elasticsearch documents otherwise things will get weird. There’s a little trick documented below to add a header row in case the file is missing it.

Features

  • Minimal commandline interface

  • Load CSV’s or TSV’s

  • Customize the delimiter to something else

  • Uses the Elasticsearch bulk API

  • Parallel bulk uploads

  • Retry on errors with exponential backoff

Installation

To install csv2es, simply:

$ pip install csv2es

Usage

Usage: csv2es [OPTIONS]

  Bulk import a delimited file into a target Elasticsearch instance. Common
  delimited files include things like CSV and TSV.

  Load a CSV file:
    csv2es --index-name potatoes --doc-type potato --import-file potatoes.csv

  For a TSV file, note the tab delimiter option
    csv2es --index-name tomatoes --doc-type tomato --import-file tomatoes.tsv --tab

  For a nifty pipe-delimited file (delimiters must be one character):
    csv2es --index-name pipes --doc-type pipe --import-file pipes.psv --delimiter '|'

Options:
  --index-name TEXT          Index name to load data into           [required]
  --doc-type TEXT            The document type (like user_records)  [required]
  --import-file TEXT         File to import (or '-' for stdin)      [required]
  --mapping-file TEXT        JSON mapping file for index
  --delimiter TEXT           The field delimiter to use, defaults to CSV
  --tab                      Assume tab-separated, overrides delimiter
  --host TEXT                The Elasticsearch host (http://127.0.0.1:9200/)
  --docs-per-chunk INTEGER   The documents per chunk to upload (5000)
  --bytes-per-chunk INTEGER  The bytes per chunk to upload (100000)
  --parallel INTEGER         Parallel uploads to send at once, defaults to 1
  --delete-index             Delete existing index if it exists
  --quiet                    Minimize console output
  --version                  Show the version and exit.
  --help                     Show this message and exit.

Examples

Let’s say we’ve got a potatoes.csv file with a nice header that looks like this:

potato_id,potato_type,description
33,sweet,"kinda oval"
17,regular,bumpy
91,regular,"perfectly round"
18,sweet,delightful
42,fried,crispy
37,"extra special",crispy

Now we can stuff it into Elasticsearch:

csv2es --index-name potatoes --doc-type potato --import-file potatoes.csv

But what if it was tomatoes.tsv and separated by tabs? Well, we can do this:

csv2es --index-name tomatoes --doc-type tomato --import-file tomatoes.tsv --tab

Advanced Examples

What if we have a super cool pipe-delimited file and want to wipe out the existing “pipes” index every time we load it up? This ought to handle that case:

csv2es --index-name pipes --delete-index --doc-type pipe --import-file pipes.psv --delimiter '|'

Elasticsearch is great, but it’s doing something strange to our documents when we try to facet by certain fields. Let’s create our own custom mapping file to specify the fields used in Elasticsearch for that potatoes.csv called potatoes.mapping.json:

{
    "dynamic": "true",
    "properties": {
        "potato_id": {"type": "long"},
        "potato_type": {"type": "string", "index" : "not_analyzed"},
        "description": {"type": "string", "index" : "not_analyzed"},
    }
}

Now let’s load the data with a custom mapping file:

csv2es --index-name potatoes --doc-type potato --mapping-file potatoes.mapping.json --import-file potatoes.csv

What if my file is missing the header row, and it’s super huge because there are so many potatoes in it, and everything is terrible? We can use sed to tack on a nice header with something like this:

sed -i 1i"potato_id,potato_type,description" potatoes.csv

As long as you have more disk space than the size of the file, this should be fine.

Contribute

  1. Check for open issues or open a fresh issue to start a discussion around a feature idea or a bug.

  2. Fork the repository on GitHub to start making your changes to the master branch (or branch off of it).

  3. Write a test which shows that the bug was fixed or that the feature works as expected.

  4. Send a pull request and bug the maintainer until it gets merged and published. :) Make sure to add yourself to AUTHORS.

History

1.0.1 (2015-06-02)

  • Add option to stream from stdin

1.0.0 (2015-04-23)

  • Add retrying support with exponential backoff per chunk for bulk uploads

  • Add parallel bulk uploading via joblib

  • Stable release

1.0.0.dev3 (2015-04-19)

  • Switch over to Click for handling executable

  • Fix –delete-index flag

  • Add –version option

1.0.0.dev2 (2015-04-19)

  • Fix import errors

1.0.0.dev1 (2015-04-18)

  • Tinkering with documentation and PyPI updates

1.0.0.dev0 (2015-04-18)

  • First dev version now exists

  • Apache 2.0 license applied

  • Finalize commandline interface

  • Sanitizing some setup.py and test suite running

  • Added Travis CI support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

csv2es-1.0.1.tar.gz (11.3 kB view details)

Uploaded Source

File details

Details for the file csv2es-1.0.1.tar.gz.

File metadata

  • Download URL: csv2es-1.0.1.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for csv2es-1.0.1.tar.gz
Algorithm Hash digest
SHA256 bf85e165d684696aa4a91a3585892b60f20575b1efe58fac0223973ee7b1bf81
MD5 e36c44cc974e99338e622872dd170f95
BLAKE2b-256 7624211496cc1f2159b12fbea157156f06d1ba9272d93d3a2cc97d4e797d4b6b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page