Pulse Data Extractor
Project description
pulse-data-extractor
An easy
-button to take data out of Pulse. Downloads Pulse documents from
Elasticsearch, and saves them to .jsonl
, .json
, .pickle
, or .csv
format.
Installation
To install in an existing environment, run this command
pip install ist-pulse-data-extractor
To define as a project requirement, add the following line to requirements.txt
:
ist-pulse-data-extractor
Usage
download
Performs a multi-process sliced query for documents in a Pulse Elasticsearch index. Saves the result to format specified by filename extension. Optional flattening of documents is available (use with caution).
from pulse.downloader import download
Required Parameters
-
index: Elasticsearch index
-
query: Elasticsearch query
-
filepath: Output filepath. File extension should match the desired output format. Supported formats include:
- .jsonl: Fastest to download, suited for large datasets. Lowest memory overhead in downstream processes.
- .json: Standard, faster to load than .jsonl, but not suitable for datasets that must be loaded into memory at once
- .pkl: Fastest to load if using result in Python script
- .csv: When consuming data with Excel or Pandas. Fields are automatically flattened. Recommended to create a separate post-processing script if some fields contain data that can't be flattened automatically.
-
es_hosts: Not required if using
ES_URL
environment variable. A list of Elasticsearch hosts. Each item should be a fully-qualified URL with authentication if applicable. This overrides values that may exist in configuration. Example:https://elastic:password@node1.host.com:9200,https://elastic:password@node2.host.com:9200
Options
- sample_size: Maximum number of results to return (default=20000).
- fields: A list of fields to return from Elasticsearch. Limiting the amount of fields reduces download time.
- flatten_doc: Flatten documents. Useful when working with data frames, but has nuances. Use with caution.
- delimiter: Delimiter to use when flattening fields
- include_meta_attribs: Only applicable when flattening. When false, all meta.*.attribs fields are discarded.
- no_flatten: A list of fields that should not be flattened
- query_slice_size: Maximum number of documents per slice (worker)
- query_concurrency: Maximum number of queries to run concurrently
- auto_mkdir: Automatically create output directory if it doesn't exist
Example
from pulse.downloader import download
download(
filepath="data/rohingya.jsonl",
sample_size=10000,
query={
"query": {
"bool": {
"filter": [{
"match_phrase": {
"norm.body": "Rohingya"
}
}]
}
}
},
index='pulse-*',
es_hosts=[
"https://user:password@dag1.istresearch.com:9200",
"https://user:password@dag2.istresearch.com:9200",
]
)
build_query
Builds an Elasticsearch query
from pulse.downloader import build_query
Options:
- start_date: Date range start (eg.
2020-06-14
or2020-06-14T12:00**:00.000Z
) - end_date: Date range end
- project_id: Project ID
- campaign_id: Campaign ID
- where_exists: A list or tuple containing fields that should exist in each document
- where_not_exists: A list or tuple containing fields that should not exist in each document
- include_match: A mapping of fields to match queries. Returns documents that match a provided text, number, date or boolean value. The provided text is analyzed before matching. The match query is the standard query for performing a full-text search, including options for fuzzy matching.
- exclude_match: A mapping of fields to match queries. Filters documents that match a provided text, number, date or boolean value.
- include_terms: A mapping of fields to term queries. Returns documents that contain an exact term in a provided field.
- exclude_terms: A mapping of fields to term queries. Filters documents that contain an exact term in a provided field.
- include_phrase: A mapping of fields to match_phrase queries.The match_phrase query analyzes the text and creates a phrase query out of the analyzed text.
- exclude_phrase: A mapping of fields to match_phrase queries. Excludes matching documents.
- doc_type: Pulse document type
- timestamp_field: Timestamp field to use for start_date and end_date
- query_string: A prepared query string
Example
from pulse.downloader import build_query, download
query = build_query(
include_phrase={
"norm.body": "Rohingya"
},
)
download(
filepath="data/rohingya.jsonl",
sample_size=10000,
query=query,
index='pulse-*',
es_hosts=[
"https://user:password@dag1.istresearch.com:9200",
"https://user:password@dag2.istresearch.com:9200",
]
)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for ist-pulse-data-extractor-1.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf91d76fe659fd52e267d3bb747d9cc415c97389372dc738e5c9083dab4912db |
|
MD5 | ce0dc15f1d8558200b8cd06ea5a9d86b |
|
BLAKE2b-256 | af7bc276b2f7f3ab22307225572dd709d85fa19e821bc0e7dac4e58f8873b707 |
Hashes for ist_pulse_data_extractor-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 72fc795a693f0e8dc69d87dd274b21ee2a30de2e7a3266c59873ff3c9ff406ef |
|
MD5 | f3b5f555efbbdc1868930b00078b2676 |
|
BLAKE2b-256 | 89559e7495b21c5b4795d704b41f3269223225083e55cad3abd1bd0c288f246b |