Skip to main content

AWS Athena client

Project description

Pallas – AWS Athena client

Pallas makes querying AWS Athena easy.

We found it valuable for analyses in Jupyter Notebook, but it is designed to be generic and usable in any application.

Features:

  • Friendly interface to AWS Athena.
  • Performance – Large results are downloaded directly from S3, which is much faster than using Athena API.
  • Pandas integration - Results can be converted to Pandas DataFrame with correct data types mapped automatically.
  • Local caching – Query results can be cached locally, so no data have to be downloaded when a Jupyter notebook is restarted.
  • Remote caching – Query IDs can be cached in S3, so team mates can reproduce results without incurring additional costs.
  • Fixes malformed results returned by Athena to DCL (for example DESCRIBE) queries.
  • Optional white space normalization for better caching.
  • Kills queries on KeyboardInterrupt.

Installation

Pallas requires Python 3.7 or newer. It can be installed using pip:

pip install --upgrade pallas

Quick start

Athena client can be obtained using the pallas.setup() method. All arguments are optional.

import pallas
athena = pallas.setup(
    # Athena (AWS Glue) database. Can be overridden in queries.
    database=None,
    # Athena workgroup. Will use default workgroup if omitted.
    workgroup=None,
    # Athena output location, will use workgroup default location if omitted.
    output_location="s3://...",
    # AWS region, read from ~/.aws/config if not specified.
    region=None,
    # Query execution cache.
    cache_remote="s3://...",
    # Query result cache.
    cache_local="~/Notebooks/.cache/",
    # Normalize white whitespace for better caching. Enabled by default.
    normalize=True,
    # Kill queries on KeybordInterrupt. Enabled by default.
    kill_on_interrupt=True
)

To avoid hardcoded configuration values, Pallas can be setup using environment variables, corresponding to arguments in the previous example:

export PALLAS_DATABASE=
export PALLAS_WORKGROUP=
export PALLAS_OUTPUT_LOCATION=
export PALLAS_REGION=
export PALLAS_NORMALIZE=true
export PALLAS_KILL_ON_INTERRUPT=true
export PALLAS_CACHE_REMOTE=$PALLAS_OUTPUT_LOCATION
export PALLAS_CACHE_LOCAL=~/Notebooks/.cache/
athena = pallas.environ_setup()

Python standard logging is available for monitoring:

import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout)

Use the Athena.execute() method to execute queries:

sql = """
    SELECT * FROM (
        VALUES (1, 'foo', 3.14), (2, 'bar', NULL)
    ) AS t (id, name, value)
"""
results = athena.execute(sql)

If you rerun same query, results should be read from cache.

Pallas also support non-blocking query execution:

query = athena.submit(sql)  # Submit a query and return
query.join()  # Wait for query completion.
results = query.get_results()  # Retrieve results. Calls query.join() internally.

The result objects provides a list-like interface and can be converted to a Pandas DataFrame:

df = results.to_df()

Development

Pallas can be installed with development dependencies using pip:

$ pip install -e .[dev]

Code is checked with flake8 and Mypy. Tests are run using pytest.

For integration test to run, access to AWS resources has to be configured:

export PALLAS_TEST_REGION=            # AWS region, can be also specified in ~/.aws/config
export PALLAS_TEST_ATHENA_DATABASE=   # Name of Athena database
export PALLAS_TEST_ATHENA_WORKGROUP=  # Optional
export PALLAS_TEST_S3_TMP=            # s3:// URI

Code checks and testing are automated using tox:

$ tox

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pallas-0.1.tar.gz (35.2 kB view details)

Uploaded Source

Built Distribution

pallas-0.1-py3-none-any.whl (37.6 kB view details)

Uploaded Python 3

File details

Details for the file pallas-0.1.tar.gz.

File metadata

  • Download URL: pallas-0.1.tar.gz
  • Upload date:
  • Size: 35.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for pallas-0.1.tar.gz
Algorithm Hash digest
SHA256 8f794a696b6439074efbe34cd6bf8f289407c25cf7a57e58f558626f45e2ea6b
MD5 efd8c5bef523ed3dac5f81ec8bd32a47
BLAKE2b-256 25031b44e7f1e08565fa5d1d43fad9b5753cb338124fac7dba255e8a043663f1

See more details on using hashes here.

File details

Details for the file pallas-0.1-py3-none-any.whl.

File metadata

  • Download URL: pallas-0.1-py3-none-any.whl
  • Upload date:
  • Size: 37.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.7.6

File hashes

Hashes for pallas-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3452f59a6321d3ba7498191fd1633ff064c1e183a78933122fa2f428f4dd85b4
MD5 38878e426c36aca0d94e3d6d327f2bbf
BLAKE2b-256 162ef533354c8de58910d87fd0da5d00ea1b99d4a446483b74bed2c8e92e0a17

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page