Skip to main content

Simple python tool to extract massive amounts of documents from Elasticsearch into a csv file exploiting multiprocessing and leveraging the underneath elasticsearch-py package

Project description

Simple Python CLI tool to easily extract a massive amount of Elasticsearch documents into a csv file, exploiting multiprocessing features and leveraging the underlying elasticsearch-py package.


Requirements

In order to install and use the elasticsearch-tocsv tool, you need to install the following packages first:


Installation

Install the elasticsearch-tocsv package with:

$ pip3 install elasticsearch-tocsv

A version of python >=3.8 is not absolutely necessary (3.7 should also work but needs to be tested first) but is highly recommended, as previous versions might experience problems when logging in multiprocessing mode.


Arguments description

Running elasticsearch_tocsv --help on the terminal you will be presented with all the possible arguments you can launch the command with. Some of them are mandatory, while some might depend on others. Below is a complete list which describes what each argument is, whether it’s mandatory or optional and its intended use. Later on some use cases will also be reported.

MANDATORY

  • -f, –fields

    Elasticsearch fields, passed as a string with commas between fields and no whitespaces (e.g. “field1,field2”).
  • -i, –index

    Elasticsearch index pattern to query on. To use wildcard (*), it’d be better to put the index between quotes (e.g. “my-indices*”).


OPTIONAL

  • -af, –aggregation_fields [default: None]

    Set this option if you want to generate an additional file (raw exports file will still be generated) containing the info aggregated according to specific fields.
    Specify the fields to aggregate on as a string with commas between fields and no whitespaces (e.g. “field1,field2”).
  • -asi, –allow_short_interval [default: False]

    Set this option to True to allow the –load_balance_interval to go below 1 day.
    With this option enabled the –load_balance_interval can be set down to 1 minute (1m).
  • -at, –aggregation_type [default: count]

    Aggregation function to use when generating the aggregated csv file.
    It can be one of the following: [‘count’, ‘min’, ‘max’, ‘mean’, ‘sum’].
    This option requires the –aggregation_fields to be set.
  • -b, –batch_size [default: 5000]

    Batch size for the scroll API. Max 10000.
    Increasing it might impact the ES instance heap memory. If you want to set a value greater than 10000, you must set first the max_result_window elasticsearch property accordingly.
    Please check out the elasticsearch documentation before increasing this value.
  • -c, –cert_verification [default: False]

    Require ssl certificate verification. Set to True to enable.
    This option is ignored if –ssl is not set to True.
  • -cp, –certificate_path [default: ‘’]

    Path to the certificate to verify the instance certificate against.
    This option is ignored if –ssl and –cert_verification are not set to True.
  • -dp, –disable_progressbar [default: False]

    Turn off the progressbar visualization.
    Set to True to simply be notified when processes have completed fetching data, without the loading progressbars.
    Might be useful in case you redirect the output to a file.
  • -e, –export_path [default: es_export.csv]

    Path to save the csv file to. Make sure the user who’s launching the script is allowed to write to that path.
    WARNING: At the end of the process, unless –keep_partial is set to True, all the files with filenames “[–export_path]_process*.csv” will be removed. Make sure you’re setting an –export_path which won’t accidentally delete any other file apart from the ones created by this script.
  • -ed, –ending_date [default: now+1000y]

    Query ending date. Must be set in iso 8601 format, without the timezone (e.g. “YYYY-MM-ddTHH:mm:ss”).
    Timezone can be specified with the –timezone option.
    This option requires the –time_field to be set.
  • -em, –enable_multiprocessing [default: False]

    Enable the multiprocess options. Set to True to exploit multiprocessing.
    This option requires the –time_field to be set.
  • -h, –help

  • -ho, –host [default: localhost]

    Elasticsearch host
  • -k, –keep_partials [default: False]

    During processing, various partial csv files will be created before merging them into a single csv. Set this flag to True if you want to keep these partial files as well.
    Note that the partial files will be kept anyway if something goes wrong during the creation of the final file.
  • -lbi, –load_balance_interval [default: None]

    Set this option to build process intervals by events count rather than equally spaced over time. The shorter the interval, the better the events-to-process division, the heavier the initial computation to build the intervals.
    Cannot go below 1d if –allow_short_interval is not set.
    Allowed values are a number followed by one of [m, h, d, M, y], like 1d for 1 day or 4M for 4 months.
    Multiprocessing must be enabled to set this option.
  • -mf, –metadata_fields [default: ‘’]

    Elasticsearch metadata fields (_index, _type, _id, _score), passed as a string with commas between fields and no whitespaces (e.g. “_id,_index”).
  • -o, –scroll_timeout [default: 4m]

    Scroll window timeout. Default to 4 minutes.
  • -p, –port [default: 9200]

    Elasticsearch port.
  • -pcs, –partial_csv_size [default: 10000000]

    Max number of rows each partial csv can contain. The higher the number of fields to extract, the lower this number should be so as not to keep too much data in memory.
    If set, must be greater than –batch_size (default 5000)
  • -pn, –process_number [default to max number of cpu of the machine]

    Number of processes to run the script on.
  • -pw, –password [default: None]

    Elasticsearch password in clear.
    If set, the –secret_password will be ignored.
    If neither this nor –secret_password are set, a prompt password will be asked for (leave blank if not needed).
  • -q, –query_string [default: *]

    Elasticsearch query string. Put between quotes and escape internal quotes characters (e.g. “one_field: foo AND another_field.keyword: "bar"”).
  • -rd, –remove_duplicates [default: False]

    Set to True to remove all duplicated events.
    WARNING: two events with the same values of the fields specified in –fields will be considered duplicated and then unified even if on ES they might not be equal because of other fields not included in –fields (e.g. _id).
    Check out the –metadata_fields option to include further info like the ES _id.
  • -s, –ssl [default: False]

    Require ssl connection. Set to True to enable.
  • -sd, –starting_date [default: now-1000y]

    Query starting date. Must be set in iso 8601 format, without the timezone (e.g. “YYYY-MM-ddTHH:mm:ss”)
    Timezone can be specified with the –timezone option.
    This option requires the –time_field to be set.
  • -spw, –secret_password [default: None]

    Env var pointing the Elasticsearch password. If neither this or –password are set, a prompt password will be asked for (leave blank if not needed).
  • -t, –time_field [default: None]

    Time field to query on. If not set and –starting_date or –ending_date are set, an exception will be raised.
  • -tz, –timezone [default to timezone of the machine]

    Timezone to set according to the time zones naming convention (e.g. “America/New_York” or “Europe/Paris” or “UTC”).
  • -u, –user [default: ‘’]

    Elasticsearch user, if any.

Usage examples

  • Connection to localhost to export fields [“@timestamp”, “field_1”, “field_2”] of all the data of the my_sample_index index:

    $ elasticsearch_tocsv -i my_sample_index -f "@timestamp,field_1,field_2"
  • Connection to host 10.20.30.40 to export fields [“@timestamp”, “field_1”, “field_2”] of the January 2020 data of the my_sample_index. Export to the file my_export_file.csv:

    $ elasticsearch_tocsv -ho 10.20.30.40 -i my_sample_index -f "@timestamp,field_1,field_2" -sd "2020-01-01T00:00:00" -ed "2020-02-01T00:00:00" -t "@timestamp" -e my_export_file.csv
  • Connection to localhost over SSL to export fields [“@timestamp”, “field_1”, “field_2”] of all the data of the my_sample_index index. Authentication is required:

    $ elasticsearch_tocsv -i my_sample_index -f "@timestamp,field_1,field_2" -s True -u my_user
  • Connection to localhost over SSL with certificate verification to export fields [“@timestamp”, “field_1”, “field_2”] of all the data of the my_sample_index index:

    $ elasticsearch_tocsv -i my_sample_index -f "@timestamp,field_1,field_2" -s True -c True -ca "path/to/certificate.pem"
  • Connection to localhost to export fields [“@timestamp”, “field_1”, “field_2”] of all the data of the my_sample_index index, exploiting multiprocessing:

    $ elasticsearch_tocsv -i my_sample_index -f "@timestamp,field_1,field_2" -em True -t "@timestamp"
  • Connection to localhost to export fields [“@timestamp”, “field_1”, “field_2”] of the January 2020 data of the my_sample_index index, exploiting multiprocessing but dividing processing intervals by load time with a 1 day precision:

    $ elasticsearch_tocsv -i my_sample_index -f "@timestamp,field_1,field_2" -sd "2020-01-01T00:00:00" -ed "2020-02-01T00:00:00" -t "@timestamp" -em true -lbi 1d

Known bugs and required fixes

  1. Standard output multiprocessing printing when progress bars are not disabled

This is a known issue. When multiprocessing is enabled, progress bars printing might get a bit messy from time to time. This doesn’t present any real usage problems, but I know it might be a bit annoying. It’ll hopefully be fixed as soon as possible.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

elasticsearch-tocsv-0.0.8.tar.gz (26.3 kB view hashes)

Uploaded Source

Built Distribution

elasticsearch_tocsv-0.0.8-py3-none-any.whl (27.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page