Skip to main content

AWS Athena client

Project description

Pallas – AWS Athena client

Pallas makes querying AWS Athena easy.

We found it valuable for analyses in Jupyter Notebook, but it is designed to be generic and usable in any application.

Features:

  • Friendly interface to AWS Athena.
  • Performance – Large results are downloaded directly from S3, which is much faster than using Athena API.
  • Pandas integration - Results can be converted to Pandas DataFrame with correct data types mapped automatically.
  • Local caching – Query results can be cached locally, so no data have to be downloaded when a Jupyter notebook is restarted.
  • Remote caching – Query IDs can be cached in S3, so team mates can reproduce results without incurring additional costs.
  • Fixes malformed results returned by Athena to DCL (for example DESCRIBE) queries.
  • Optional white space normalization for better caching.
  • Kills queries on KeyboardInterrupt.

Installation

Pallas requires Python 3.7 or newer. It can be installed using pip:

pip install --upgrade pallas

Quick start

Athena client can be obtained using the pallas.setup() method. All arguments are optional.

import pallas
athena = pallas.setup(
    # Athena (AWS Glue) database. Can be overridden in queries.
    database=None,
    # Athena workgroup. Will use default workgroup if omitted.
    workgroup=None,
    # Athena output location, will use workgroup default location if omitted.
    output_location="s3://...",
    # AWS region, read from ~/.aws/config if not specified.
    region=None,
    # Optional query execution cache.
    cache_remote="s3://...",
    # Optional query result cache.
    cache_local="~/Notebooks/.cache/",
    # Normalize white whitespace for better caching. Enabled by default.
    normalize=True,
    # Kill queries on KeybordInterrupt. Enabled by default.
    kill_on_interrupt=True
)

To avoid hardcoded configuration values, Pallas can be setup using environment variables, corresponding to arguments in the previous example:

export PALLAS_DATABASE=
export PALLAS_WORKGROUP=
export PALLAS_OUTPUT_LOCATION=
export PALLAS_REGION=
export PALLAS_NORMALIZE=true
export PALLAS_KILL_ON_INTERRUPT=true
export PALLAS_CACHE_REMOTE=$PALLAS_OUTPUT_LOCATION
export PALLAS_CACHE_LOCAL=~/Notebooks/.cache/
athena = pallas.environ_setup()

Python standard logging is available for monitoring:

import logging
import sys
logging.basicConfig(level=logging.INFO, stream=sys.stdout)

Use the Athena.execute() method to execute queries:

sql = """
    SELECT * FROM (
        VALUES (1, 'foo', 3.14), (2, 'bar', NULL)
    ) AS t (id, name, value)
"""
results = athena.execute(sql)

If you rerun same query, results should be read from cache.

Pallas also support non-blocking query execution:

query = athena.submit(sql)  # Submit a query and return
query.join()  # Wait for query completion.
results = query.get_results()  # Retrieve results. Calls query.join() internally.

The result objects provides a list-like interface and can be converted to a Pandas DataFrame:

df = results.to_df()

Alternatives

PyAthena

PyAthena is a Python DB API 2.0 (PEP 249) compliant client for Amazon Athena. It is integrated with Pandas and SQLAlchemy.

The main difference between Pallas and PyAthena are the interfaces of the libraries. Pallas does not implement the Python DB API. Instead, it adheres to the Athena REST API.

Pallas exposes an object representing a query execution. Thanks to that, it can get back to queries executed in the past and retrieve their results. One client natively supports both blocking and non-blocking execution.

PyAthena advantages:

  • PyAthena is older and more popular.
  • SQLAlchemy integration.
  • Standard Python DB API.
  • More configuration options.

Pallas advantages:

  • Pallas offers more powerful caching. It can cache results locally, and the cache is not limited to last N queries.
  • For better performance, Pallas downloads results directly from S3. PyAthena can also download results from S3, but it reads them using Pandas, failing to convert some data types.
  • Small helpers: smarter polling, query normalization, estimated price in logs, or kill on KeyboardInterrupt.
  • Nicer interface (from Pallas's author point of view).

boto3

boto3 is the official AWS SDK for Python. Pallas uses boto3 internally.

Querying Athena using boto3 directly is complicated and requires a lot of boilerplate code.

Development

Pallas can be installed with development dependencies using pip:

$ pip install -e .[dev]

Code is checked with flake8 and Mypy. Tests are run using pytest.

For integration test to run, access to AWS resources has to be configured:

export PALLAS_TEST_REGION=            # AWS region, can be also specified in ~/.aws/config
export PALLAS_TEST_ATHENA_DATABASE=   # Name of Athena database
export PALLAS_TEST_ATHENA_WORKGROUP=  # Optional
export PALLAS_TEST_S3_TMP=            # s3:// URI

Code checks and testing are automated using tox:

$ tox

Changelog

v0.3 (2020-06-18)

  • Athena and Query classes available from the top-level module (useful for type hints).
  • AthenaQueryError from the top-level module.
  • Fix: SELECT queries cached only when uppercase.
  • Fix: Queries not killed on KeyboardInterrupt.

v0.2 (2020-06-02)

  • Cache SELECT statements only (starting with SELECT or WITH).
  • Preserve empty lines in the middle of normalized queries.

v0.1 (2020-03-24)

  • Initial release.

License

Copyright 2020 Akamai Technologies, Inc

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Contributing

By submitting a contribution (the “Contribution”) to this project,
and for good and valuable consideration, the receipt and sufficiency of which
are hereby acknowledged, you (the “Assignor”) irrevocably convey, transfer,
and assign the Contribution to the owner of the repository (the “Assignee”),
and the Assignee hereby accepts, all of your right, title, and interest in and
to the Contribution along with all associated copyrights, copyright
registrations, and/or applications for registration and all issuances,
extensions and renewals thereof (collectively, the “Assigned Copyrights”).
You also assign all of your rights of any kind whatsoever accruing under
the Assigned Copyrights provided by applicable law of any jurisdiction,
by international treaties and conventions and otherwise throughout the world.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pallas-0.3.tar.gz (39.6 kB view hashes)

Uploaded Source

Built Distribution

pallas-0.3-py3-none-any.whl (40.5 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page