Skip to main content

Python package provided to make elasticsearch aggregations and queries easy.

Project description

PyPI Latest Release License Python package Coverage Docs Code style: black Checked with mypy

What is it?

pandagg is a Python package providing a simple interface to manipulate ElasticSearch queries and aggregations. Its goal is to make it the easiest possible to explore data indexed in an Elasticsearch cluster.

Some of its interactive features are inspired by pandas library, hence the name pandagg which aims to apply pandas to Elasticsearch aggregations.

pandagg is also greatly inspired by the official high level python client elasticsearch-dsl, and is intended to make it more convenient to deal with deeply nested queries and aggregations.

Why another library

pandagg provides the following features:

  • interactive mode for cluster discovery
  • richer aggregations syntax, and aggregations parsing features
  • declarative indices
  • bulk ORM operations
  • typing annotations

Documentation

Full documentation and user-guide are available here on read-the-docs.

Installation

pip install pandagg

Dependencies

Hard dependency: ligthtree

Soft dependency: to parse aggregation results as tabular dataframe: pandas

Quick demo

Discover indices on cluster with matching pattern:

>>> from elasticsearch import Elasticsearch
>>> from pandagg.discovery import discover
>>> client = Elasticsearch(hosts=['localhost:9200'])


>>> indices = discover(client, "mov*")
>>> indices
<Indices> ['movies', 'movies_fake']

Explore index mappings:

>>> movies = indices.movies
>>> movies.imappings
<Mappings>
_
├── directors                                                [Nested]
   ├── director_id                                           Keyword
   ├── first_name                                            Text
      └── raw                                             ~ Keyword
   ├── full_name                                             Text
      └── raw                                             ~ Keyword
   ├── genres                                                Keyword
   └── last_name                                             Text
       └── raw                                             ~ Keyword
├── genres                                                    Keyword
├── movie_id                                                  Keyword
├── name                                                      Text
...
>>> movies.imappings.roles
<Mappings subpart: roles>
roles                                                        [Nested]
├── actor_id                                                  Keyword
├── first_name                                                Text
   └── raw                                                 ~ Keyword
├── full_name                                                 Text
   └── raw                                                 ~ Keyword
├── gender                                                    Keyword
├── last_name                                                 Text
   └── raw                                                 ~ Keyword
└── role                                                      Keyword

Execute aggregation on field:

>>> movies.imappings.roles.gender.a.terms()
   doc_count key
M    2296792   M
F    1135174   F

Build search request:

>> > search = movies
    .search()
    .size(2)
    .groupby('decade', 'histogram', interval=10, field='year')
    .groupby('genres', size=3)
    .agg('avg_rank', 'avg', field='rank')
    .agg('avg_nb_roles', 'avg', field='nb_roles')
    .filter('range', year={"gte": 1990})

>> > search.to_dict()
{'aggs': {'decade': {u'aggs': {'genres': {u'aggs': {'avg_nb_roles': {u'avg': {'field': 'nb_roles'}},
                                                    'avg_rank': {u'avg': {'field': 'rank'}}},
                                          'terms': {'field': 'genres', 'size': 3}}},
                     'histogram': {'field': 'year', 'interval': 10}}},
 'query': {'bool': {u'filter': [{'range': {'year': {'gte': 1990}}}]}},
 'size': 2}

Execute it:

>>> response = search.execute()
>>> response
<Response> took 52ms, success: True, total result >=10000, contains 2 hits

Parse it in tabular format:

>>> response.aggregations.to_dataframe()
                    avg_nb_roles  avg_rank  doc_count
decade genres
1990.0 Documentary      3.778982  6.517093       8393
       Drama           18.518067  5.981429      12232
       Short            3.023284  6.311326      12197
2000.0 Documentary      5.581433  6.980898       8639
       Drama           14.385391  6.269675      11500
       Short            4.053082  6.836253      13451

Disclaimers

It does not ensure retro-compatible with previous versions of elasticsearch (intended to work with >=7). It is part of the roadmap to tag pandagg versions according to the ElasticSearch versions they are related to (ie v7.1.4 would work with Elasticsearch v7.X.X).

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

Roadmap priorities

  • clean and proper documentation
  • package versions for different ElasticSearch versions
  • onboard new contributors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandagg-0.2.4.tar.gz (100.2 kB view details)

Uploaded Source

Built Distribution

pandagg-0.2.4-py3-none-any.whl (121.8 kB view details)

Uploaded Python 3

File details

Details for the file pandagg-0.2.4.tar.gz.

File metadata

  • Download URL: pandagg-0.2.4.tar.gz
  • Upload date:
  • Size: 100.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for pandagg-0.2.4.tar.gz
Algorithm Hash digest
SHA256 4ba2d2ccfae100b1cb438e9f08ce469b8d5dea75478e8673528a5b6d878f3464
MD5 62cf83b787b7e04a209b70cb94519ccf
BLAKE2b-256 7b79aa8a683f51ec8a4293e3c93763982e14a2412af6a09cd381721d42c44dae

See more details on using hashes here.

File details

Details for the file pandagg-0.2.4-py3-none-any.whl.

File metadata

  • Download URL: pandagg-0.2.4-py3-none-any.whl
  • Upload date:
  • Size: 121.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/32.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.62.3 importlib-metadata/4.11.0 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for pandagg-0.2.4-py3-none-any.whl
Algorithm Hash digest
SHA256 2f9c2a9ac91408d22740710c11db63428e2c14b6c63e948c23b9c7b50ca6fa97
MD5 1f64e1de4d0412526d96354489c914e8
BLAKE2b-256 508819be1d57646398e88f321358e551eed5ae99b37e15ac10baaca35b7b8d72

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page