Skip to main content

Athena User Defined Functions(UDFs) in Python made easy!

Project description

athena-python-udf

PyPI Changelog License

Athena User Defined Functions(UDFs) in Python made easy!

This library implements the Athena UDF protocol in Python, so you don't have to use Java, and you can use any Python library you wish, including numpy/pandas!

Installation

Install this library using pip:

pip install athena-python-udf

Usage

  • Install the package
  • Create a lambda handler Python file subclass BaseAthenaUDF
  • Implement the handle_athena_record static method with your required functionality like this:
from typing import Any

from athena_udf import BaseAthenaUDF
from pyarrow import Schema


class SimpleVarcharUDF(BaseAthenaUDF):

    @staticmethod
    def handle_athena_record(input_schema: Schema, output_schema: Schema, arguments: list[Any]):
        varchar = arguments[0]
        return varchar.lower()


lambda_handler = SimpleVarcharUDF(use_threads=False).lambda_handler

This very basic example takes a varchar input, and returns the lowercase version.

  • varchar is converted to a python string on the way in and way out.
  • input_schema contains a PyArrow schema representing the schema of the data being passed
  • output_schema contains a PyArrow schema representing the schema of what athena expects to be returned.
  • arguments contains a list of arguments given to the function. Can be more than one with different types.

You can also play with multithreading (enabled by default) using the following parameters:

  • chunk_size - if you want to force splitting received record batch into chunks of specific size and process these chunks consecutively. It may be useful if your lambda will operate with some rate-limited external APIs.

  • max_workers - basic ThreadPoolExecutor parameter. You can leave it empty to keep default behavior.

If you package the above into a zip, with dependencies and name your lambda function my-lambda you can then run it from the athena console like so:

USING EXTERNAL FUNCTION my_udf(col1 varchar) RETURNS varchar LAMBDA 'athena-test'

SELECT my_udf('FooBar');

Which will yield the result foobar

See other examples in the examples folder of this repo.

Important information before using

Each lambda instance will take multiple requests for the same query. Each request can contain multiple rows, athena-udf handles this for you and your implementation will receive a single row.

Athena will group your data into around 1MB chunks in a single request. The maximum your function can return is 6MB per chunk.

This library uses PyArrow. This is a large library, so the Lambdas will be around 50MB zipped.

Timestamps seem to be truncated into Python date objects missing the time.

Functions can return one value only. To return more complex data structures, consider returning a JSON payload and parsing on athena.

Development

To contribute to this library, first checkout the code. Then create a new virtual environment with all required dependencies and activate it:

poetry install
source .venv/bin/activate

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

athena_python_udf-0.2.2.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

athena_python_udf-0.2.2-py3-none-any.whl (8.8 kB view details)

Uploaded Python 3

File details

Details for the file athena_python_udf-0.2.2.tar.gz.

File metadata

  • Download URL: athena_python_udf-0.2.2.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.1 Linux/6.5.0-1021-azure

File hashes

Hashes for athena_python_udf-0.2.2.tar.gz
Algorithm Hash digest
SHA256 d7baedbcd18806e576032eac032bfeda77a6a60121e07de298744b10be90ebdf
MD5 9ce89729d03ea6800e726c2b6926bcf0
BLAKE2b-256 71a6f194fa0775ee21251ec5dfb0f39104ce6db7d043d6b17e20ff462642afbb

See more details on using hashes here.

File details

Details for the file athena_python_udf-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: athena_python_udf-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 8.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.12.1 Linux/6.5.0-1021-azure

File hashes

Hashes for athena_python_udf-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7c24130cc55511d3739f2aa98bfd8401f2044efcf41fe850cecd69ffe01bcd8b
MD5 ab11132aa55cc8a5a87163d86eba14cf
BLAKE2b-256 c5335d5e6e04c59480321f2ec166e109e93a158c9485e7cc36c89e22f4d3c81a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page