Python package provided to make elasticsearch aggregations and queries easy.
Project description
What is it?
pandagg is a Python package providing a simple interface to manipulate ElasticSearch queries and aggregations. Its goal is to make it the easiest possible to explore data indexed in an Elasticsearch cluster.
Some of its interactive features are inspired by pandas library, hence the name pandagg which aims to apply pandas to Elasticsearch aggregations.
pandagg is also greatly inspired by the official high level python client elasticsearch-dsl, and is intended to make it more convenient to deal with deeply nested queries and aggregations.
Why another library
pandagg
provides the following features:
- interactive mode for cluster discovery
- richer aggregations syntax, and aggregations parsing features
- declarative indices
- bulk ORM operations
- typing annotations
Documentation
Full documentation and user-guide are available here on read-the-docs.
Installation
pip install pandagg
Dependencies
Hard dependency: ligthtree
Soft dependency: to parse aggregation results as tabular dataframe: pandas
Quick demo
Discover indices on cluster with matching pattern:
>>> from elasticsearch import Elasticsearch
>>> from pandagg.discovery import discover
>>> client = Elasticsearch(hosts=['localhost:9200'])
>>> indices = discover(client, "mov*")
>>> indices
<Indices> ['movies', 'movies_fake']
Explore index mappings:
>>> movies = indices.movies
>>> movies.imappings
<Mappings>
_
├── directors [Nested]
│ ├── director_id Keyword
│ ├── first_name Text
│ │ └── raw ~ Keyword
│ ├── full_name Text
│ │ └── raw ~ Keyword
│ ├── genres Keyword
│ └── last_name Text
│ └── raw ~ Keyword
├── genres Keyword
├── movie_id Keyword
├── name Text
...
>>> movies.imappings.roles
<Mappings subpart: roles>
roles [Nested]
├── actor_id Keyword
├── first_name Text
│ └── raw ~ Keyword
├── full_name Text
│ └── raw ~ Keyword
├── gender Keyword
├── last_name Text
│ └── raw ~ Keyword
└── role Keyword
Execute aggregation on field:
>>> movies.imappings.roles.gender.a.terms()
doc_count key
M 2296792 M
F 1135174 F
Build search request:
>> > search = movies
.search()
.size(2)
.groupby('decade', 'histogram', interval=10, field='year')
.groupby('genres', size=3)
.agg('avg_rank', 'avg', field='rank')
.agg('avg_nb_roles', 'avg', field='nb_roles')
.filter('range', year={"gte": 1990})
>> > search.to_dict()
{'aggs': {'decade': {u'aggs': {'genres': {u'aggs': {'avg_nb_roles': {u'avg': {'field': 'nb_roles'}},
'avg_rank': {u'avg': {'field': 'rank'}}},
'terms': {'field': 'genres', 'size': 3}}},
'histogram': {'field': 'year', 'interval': 10}}},
'query': {'bool': {u'filter': [{'range': {'year': {'gte': 1990}}}]}},
'size': 2}
Execute it:
>>> response = search.execute()
>>> response
<Response> took 52ms, success: True, total result >=10000, contains 2 hits
Parse it in tabular format:
>>> response.aggregations.to_dataframe()
avg_nb_roles avg_rank doc_count
decade genres
1990.0 Documentary 3.778982 6.517093 8393
Drama 18.518067 5.981429 12232
Short 3.023284 6.311326 12197
2000.0 Documentary 5.581433 6.980898 8639
Drama 14.385391 6.269675 11500
Short 4.053082 6.836253 13451
Disclaimers
It does not ensure retro-compatible with previous versions of elasticsearch (intended to work with >=7). It is part of the roadmap to tag pandagg versions according to the ElasticSearch versions they are related to (ie v7.1.4 would work with Elasticsearch v7.X.X).
Contributing
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
Roadmap priorities
- clean and proper documentation
- package versions for different ElasticSearch versions
- onboard new contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pandagg-0.2.4.tar.gz
.
File metadata
- Download URL: pandagg-0.2.4.tar.gz
- Upload date:
- Size: 100.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4ba2d2ccfae100b1cb438e9f08ce469b8d5dea75478e8673528a5b6d878f3464 |
|
MD5 | 62cf83b787b7e04a209b70cb94519ccf |
|
BLAKE2b-256 | 7b79aa8a683f51ec8a4293e3c93763982e14a2412af6a09cd381721d42c44dae |
File details
Details for the file pandagg-0.2.4-py3-none-any.whl
.
File metadata
- Download URL: pandagg-0.2.4-py3-none-any.whl
- Upload date:
- Size: 121.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2f9c2a9ac91408d22740710c11db63428e2c14b6c63e948c23b9c7b50ca6fa97 |
|
MD5 | 1f64e1de4d0412526d96354489c914e8 |
|
BLAKE2b-256 | 508819be1d57646398e88f321358e551eed5ae99b37e15ac10baaca35b7b8d72 |