Skip to main content

Pulse Data Extractor

Project description

pulse-data-extractor

An easy-button to take data out of Pulse. Downloads Pulse documents from Elasticsearch, and saves them to .jsonl, .json, .pickle, or .csv format.

Installation

To install in an existing environment, run this command

pip install ist-pulse-data-extractor

To define as a project requirement, add the following line to requirements.txt:

ist-pulse-data-extractor

Usage

download

Performs a multi-process sliced query for documents in a Pulse Elasticsearch index. Saves the result to format specified by filename extension. Optional flattening of documents is available (use with caution).

from pulse.downloader import download

Required Parameters

  • index: Elasticsearch index

  • query: Elasticsearch query

  • filepath: Output filepath. File extension should match the desired output format. Supported formats include:

    • .jsonl: Fastest to download, suited for large datasets. Lowest memory overhead in downstream processes.
    • .json: Standard, faster to load than .jsonl, but not suitable for datasets that must be loaded into memory at once
    • .pkl: Fastest to load if using result in Python script
    • .csv: When consuming data with Excel or Pandas. Fields are automatically flattened. Recommended to create a separate post-processing script if some fields contain data that can't be flattened automatically.
  • es_hosts: Not required if using ES_URL environment variable. A list of Elasticsearch hosts. Each item should be a fully-qualified URL with authentication if applicable. This overrides values that may exist in configuration. Example:

      https://elastic:password@node1.host.com:9200,https://elastic:password@node2.host.com:9200
    

Options

  • sample_size: Maximum number of results to return (default=20000).
  • fields: A list of fields to return from Elasticsearch. Limiting the amount of fields reduces download time.
  • flatten_doc: Flatten documents. Useful when working with data frames, but has nuances. Use with caution.
  • delimiter: Delimiter to use when flattening fields
  • include_meta_attribs: Only applicable when flattening. When false, all meta.*.attribs fields are discarded.
  • no_flatten: A list of fields that should not be flattened
  • query_slice_size: Maximum number of documents per slice (worker)
  • query_concurrency: Maximum number of queries to run concurrently
  • auto_mkdir: Automatically create output directory if it doesn't exist

Example

from pulse.downloader import download

download(
        filepath="data/rohingya.jsonl",
        sample_size=10000,
        query={
            "query": {
                "bool": {
                    "filter": [{
                        "match_phrase": {
                            "norm.body": "Rohingya"
                        }
                    }]
                }
            }
        },
        index='pulse-*',
        es_hosts=[
            "https://user:password@dag1.istresearch.com:9200",
            "https://user:password@dag2.istresearch.com:9200",
        ]
    )

build_query

Builds an Elasticsearch query

from pulse.downloader import build_query

Options:

  • start_date: Date range start (eg. 2020-06-14 or 2020-06-14T12:00**:00.000Z)
  • end_date: Date range end
  • project_id: Project ID
  • campaign_id: Campaign ID
  • where_exists: A list or tuple containing fields that should exist in each document
  • where_not_exists: A list or tuple containing fields that should not exist in each document
  • include_match: A mapping of fields to match queries. Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching. The match query is the standard query for performing a full-text search, including options for fuzzy matching.
  • exclude_match: A mapping of fields to match queries. Filters documents that match a provided text, number, date or boolean value.
  • include_terms: A mapping of fields to term queries. Returns documents that contain an exact term in a provided field.
  • exclude_terms: A mapping of fields to term queries. Filters documents that contain an exact term in a provided field.
  • include_phrase: A mapping of fields to match_phrase queries.The match_phrase query analyzes the text and creates a phrase query out of the analyzed text.
  • exclude_phrase: A mapping of fields to match_phrase queries. Excludes matching documents.
  • doc_type: Pulse document type
  • timestamp_field: Timestamp field to use for start_date and end_date
  • query_string: A prepared query string

Example

from pulse.downloader import build_query, download

query = build_query(
    include_phrase={
       "norm.body": "Rohingya"
    },
)
download(
        filepath="data/rohingya.jsonl",
        sample_size=10000,
        query=query,
        index='pulse-*',
        es_hosts=[
            "https://user:password@dag1.istresearch.com:9200",
            "https://user:password@dag2.istresearch.com:9200",
        ]
    )

Development

To deploy a new version, follow the instructions in deploy.sh. Requires access to deployment credentials in Lastpass.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for ist-pulse-data-extractor, version 1.0.2
Filename, size File type Python version Upload date Hashes
Filename, size ist_pulse_data_extractor-1.0.2-py3-none-any.whl (14.9 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size ist-pulse-data-extractor-1.0.2.tar.gz (13.0 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page