Pulse Data Extractor
Project description
pulse-data-extractor
An easy
-button to take data out of Pulse. Downloads Pulse documents from
Elasticsearch, and saves them to .jsonl
, .json
, .pickle
, or .csv
format.
Installation
To install in an existing environment, run this command
pip install ist-pulse-data-extractor
To define as a project requirement, add the following line to requirements.txt
:
ist-pulse-data-extractor
Usage
download
Performs a multi-process sliced query for documents in a Pulse Elasticsearch index. Saves the result to format specified by filename extension. Optional flattening of documents is available (use with caution).
from pulse.downloader import download
Required Parameters
-
index: Elasticsearch index
-
query: Elasticsearch query
-
filepath: Output filepath. File extension should match the desired output format. Supported formats include:
- .jsonl: Fastest to download, suited for large datasets. Lowest memory overhead in downstream processes.
- .json: Standard, faster to load than .jsonl, but not suitable for datasets that must be loaded into memory at once
- .pkl: Fastest to load if using result in Python script
- .csv: When consuming data with Excel or Pandas. Fields are automatically flattened. Recommended to create a separate post-processing script if some fields contain data that can't be flattened automatically.
-
es_hosts: Not required if using
ES_URL
environment variable. A list of Elasticsearch hosts. Each item should be a fully-qualified URL with authentication if applicable. This overrides values that may exist in configuration. Example:https://elastic:password@node1.host.com:9200,https://elastic:password@node2.host.com:9200
Options
- sample_size: Maximum number of results to return (default=20000).
- fields: A list of fields to return from Elasticsearch. Limiting the amount of fields reduces download time.
- flatten_doc: Flatten documents. Useful when working with data frames, but has nuances. Use with caution.
- delimiter: Delimiter to use when flattening fields
- include_meta_attribs: Only applicable when flattening. When false, all meta.*.attribs fields are discarded.
- no_flatten: A list of fields that should not be flattened
- query_slice_size: Maximum number of documents per slice (worker)
- query_concurrency: Maximum number of queries to run concurrently
- auto_mkdir: Automatically create output directory if it doesn't exist
Example
from pulse.downloader import download
download(
filepath="data/rohingya.jsonl",
sample_size=10000,
query={
"query": {
"bool": {
"filter": [{
"match_phrase": {
"norm.body": "Rohingya"
}
}]
}
}
},
index='pulse-*',
es_hosts=[
"https://user:password@dag1.istresearch.com:9200",
"https://user:password@dag2.istresearch.com:9200",
]
)
build_query
Builds an Elasticsearch query
from pulse.downloader import build_query
Options:
- start_date: Date range start (eg.
2020-06-14
or2020-06-14T12:00**:00.000Z
) - end_date: Date range end
- project_id: Project ID
- campaign_id: Campaign ID
- where_exists: A list or tuple containing fields that should exist in each document
- where_not_exists: A list or tuple containing fields that should not exist in each document
- include_match: A mapping of fields to match queries. Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching. The match query is the standard query for performing a full-text search, including options for fuzzy matching.
- exclude_match: A mapping of fields to match queries. Filters documents that match a provided text, number, date or boolean value.
- include_terms: A mapping of fields to term queries. Returns documents that contain an exact term in a provided field.
- exclude_terms: A mapping of fields to term queries. Filters documents that contain an exact term in a provided field.
- include_phrase: A mapping of fields to match_phrase queries.The match_phrase query analyzes the text and creates a phrase query out of the analyzed text.
- exclude_phrase: A mapping of fields to match_phrase queries. Excludes matching documents.
- doc_type: Pulse document type
- timestamp_field: Timestamp field to use for start_date and end_date
- query_string: A prepared query string
Example
from pulse.downloader import build_query, download
query = build_query(
include_phrase={
"norm.body": "Rohingya"
},
)
download(
filepath="data/rohingya.jsonl",
sample_size=10000,
query=query,
index='pulse-*',
es_hosts=[
"https://user:password@dag1.istresearch.com:9200",
"https://user:password@dag2.istresearch.com:9200",
]
)
Development
To deploy a new version, follow the instructions in deploy.sh
. Requires access
to deployment credentials in Lastpass.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ist-pulse-data-extractor-1.0.2.tar.gz
.
File metadata
- Download URL: ist-pulse-data-extractor-1.0.2.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 715918e34b75bc34c89659f8e657299fbd72e8f91bfa0ef019de19b4eff7849a |
|
MD5 | 73cf995c025b787402f5baf123bc4c38 |
|
BLAKE2b-256 | ba37ca34998267ab5794ffcf74095b1a93e1d1b39f88d4a35150069fbcbced31 |
File details
Details for the file ist_pulse_data_extractor-1.0.2-py3-none-any.whl
.
File metadata
- Download URL: ist_pulse_data_extractor-1.0.2-py3-none-any.whl
- Upload date:
- Size: 14.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.41.0 CPython/3.7.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fccf752a37de066551431159657a8f6a08b4de79445de2adfb27f993d394a7b |
|
MD5 | 5ebde25582823bcf22f2877548e79407 |
|
BLAKE2b-256 | 7e6cd92a6b1a328ad6a748d39e1523e004b9450360da426bfee425a913c0392b |