Skip to main content

Pulse Data Extractor

Project description

pulse-data-extractor

An easy-button to take data out of Pulse. Downloads Pulse documents from Elasticsearch, and saves them to .jsonl, .json, .pickle, or .csv format.

Installation

To install in an existing environment, run this command

pip install ist-pulse-data-extractor

To define as a project requirement, add the following line to requirements.txt:

ist-pulse-data-extractor

Usage

download

Performs a multi-process sliced query for documents in a Pulse Elasticsearch index. Saves the result to format specified by filename extension. Optional flattening of documents is available (use with caution).

from pulse.downloader import download

Required Parameters

  • index: Elasticsearch index

  • query: Elasticsearch query

  • filepath: Output filepath. File extension should match the desired output format. Supported formats include:

    • .jsonl: Fastest to download, suited for large datasets. Lowest memory overhead in downstream processes.
    • .json: Standard, faster to load than .jsonl, but not suitable for datasets that must be loaded into memory at once
    • .pkl: Fastest to load if using result in Python script
    • .csv: When consuming data with Excel or Pandas. Fields are automatically flattened. Recommended to create a separate post-processing script if some fields contain data that can't be flattened automatically.
  • es_hosts: Not required if using ES_URL environment variable. A list of Elasticsearch hosts. Each item should be a fully-qualified URL with authentication if applicable. This overrides values that may exist in configuration. Example:

      https://elastic:password@node1.host.com:9200,https://elastic:password@node2.host.com:9200
    

Options

  • sample_size: Maximum number of results to return (default=20000).
  • fields: A list of fields to return from Elasticsearch. Limiting the amount of fields reduces download time.
  • flatten_doc: Flatten documents. Useful when working with data frames, but has nuances. Use with caution.
  • delimiter: Delimiter to use when flattening fields
  • include_meta_attribs: Only applicable when flattening. When false, all meta.*.attribs fields are discarded.
  • no_flatten: A list of fields that should not be flattened
  • query_slice_size: Maximum number of documents per slice (worker)
  • query_concurrency: Maximum number of queries to run concurrently
  • auto_mkdir: Automatically create output directory if it doesn't exist

Example

from pulse.downloader import download

download(
        filepath="data/rohingya.jsonl",
        sample_size=10000,
        query={
            "query": {
                "bool": {
                    "filter": [{
                        "match_phrase": {
                            "norm.body": "Rohingya"
                        }
                    }]
                }
            }
        },
        index='pulse-*',
        es_hosts=[
            "https://user:password@dag1.istresearch.com:9200",
            "https://user:password@dag2.istresearch.com:9200",
        ]
    )

build_query

Builds an Elasticsearch query

from pulse.downloader import build_query

Options:

  • start_date: Date range start (eg. 2020-06-14 or 2020-06-14T12:00**:00.000Z)
  • end_date: Date range end
  • project_id: Project ID
  • campaign_id: Campaign ID
  • where_exists: A list or tuple containing fields that should exist in each document
  • where_not_exists: A list or tuple containing fields that should not exist in each document
  • include_match: A mapping of fields to match queries. Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching. The match query is the standard query for performing a full-text search, including options for fuzzy matching.
  • exclude_match: A mapping of fields to match queries. Filters documents that match a provided text, number, date or boolean value.
  • include_terms: A mapping of fields to term queries. Returns documents that contain an exact term in a provided field.
  • exclude_terms: A mapping of fields to term queries. Filters documents that contain an exact term in a provided field.
  • include_phrase: A mapping of fields to match_phrase queries.The match_phrase query analyzes the text and creates a phrase query out of the analyzed text.
  • exclude_phrase: A mapping of fields to match_phrase queries. Excludes matching documents.
  • doc_type: Pulse document type
  • timestamp_field: Timestamp field to use for start_date and end_date
  • query_string: A prepared query string

Example

from pulse.downloader import build_query, download

query = build_query(
    include_phrase={
       "norm.body": "Rohingya"
    },
)
download(
        filepath="data/rohingya.jsonl",
        sample_size=10000,
        query=query,
        index='pulse-*',
        es_hosts=[
            "https://user:password@dag1.istresearch.com:9200",
            "https://user:password@dag2.istresearch.com:9200",
        ]
    )

Development

To deploy a new version, follow the instructions in deploy.sh. Requires access to deployment credentials in Lastpass.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ist-pulse-data-extractor-1.0.2.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

ist_pulse_data_extractor-1.0.2-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file ist-pulse-data-extractor-1.0.2.tar.gz.

File metadata

  • Download URL: ist-pulse-data-extractor-1.0.2.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.1

File hashes

Hashes for ist-pulse-data-extractor-1.0.2.tar.gz
Algorithm Hash digest
SHA256 715918e34b75bc34c89659f8e657299fbd72e8f91bfa0ef019de19b4eff7849a
MD5 73cf995c025b787402f5baf123bc4c38
BLAKE2b-256 ba37ca34998267ab5794ffcf74095b1a93e1d1b39f88d4a35150069fbcbced31

See more details on using hashes here.

File details

Details for the file ist_pulse_data_extractor-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: ist_pulse_data_extractor-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.1

File hashes

Hashes for ist_pulse_data_extractor-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1fccf752a37de066551431159657a8f6a08b4de79445de2adfb27f993d394a7b
MD5 5ebde25582823bcf22f2877548e79407
BLAKE2b-256 7e6cd92a6b1a328ad6a748d39e1523e004b9450360da426bfee425a913c0392b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page