Skip to main content

Athena User Defined Functions(UDFs) made easy!

Project description

athena-udf

PyPI Changelog License

Athena User Defined Functions(UDFs) in Python made easy!

This library implements the Athena UDF protocol in Python so you don't have to use Java and you can use any Python library you wish including numpy/pandas!

Installation

Install this library using pip:

pip install athena-udf

Usage

Simply install the package, create a lambda handler Python file, subclass BaseAthenaUDF and implement the handle_athena_record static method with your required functionality like this:

import athena_udf


class SimpleVarcharUDF(athena_udf.BaseAthenaUDF):

    @staticmethod
    def handle_athena_record(input_schema, output_schema, arguments):
        varchar = arguments[0]
        return varchar.lower()


lambda_handler = SimpleVarcharUDF().lambda_handler

This very basic example takes a varchar input, and returns the lowercase version.

varchar is converted to a python string on the way in and way out.

input_schema contains a PyArrow schema representing the schema of the data being passed

output_schema contains a PyArrow schema representing the schema of what athena expects to be returned.

arguments contains a list of arguments given to the function. Can be more than 1 with different types.

If you package the above into a zip, with dependencies and name your lambda function my-kambda you can then run it from the athena console like so:

USING EXTERNAL FUNCTION my_udf(col1 varchar) RETURNS varchar LAMBDA 'athena-test'

SELECT my_udf('FooBar');

Which will yield the result foobar

See other examples in the examples folder of this repo.

Important information before using.

Each lambda instance will take multiple requests for the same query. Each request can contain multiple rows, athena-udf handles this for you and your implementation will receive a single row.

Athena will group your data into around 1MB chunks in a single request. The maximum your function can return in 6MB per chunk.

This library uses PyArrow. This is a large library so the Lambdas will be around 50MB zipped.

Timestamps seem to be truncated into Python date objects missing the time.

Functions can return one value only. To return more complex data structures consider returning a JSON payload and parsing on athena.

Development

To contribute to this library, first checkout the code. Then create a new virtual environment:

cd athena-udf
python -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

athena-udf-0.2.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

athena_udf-0.2-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file athena-udf-0.2.tar.gz.

File metadata

  • Download URL: athena-udf-0.2.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for athena-udf-0.2.tar.gz
Algorithm Hash digest
SHA256 acb81600b1c52974121b0b001664e44d687ba9d32e48bcfc4db5a3b18f77f154
MD5 b5321966d2b8a07407764d37ebbe1ea3
BLAKE2b-256 a3ef2fcfd7a257075f075c06bd625081bdeb1e64fe217ce17d32568beaf41225

See more details on using hashes here.

File details

Details for the file athena_udf-0.2-py3-none-any.whl.

File metadata

  • Download URL: athena_udf-0.2-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for athena_udf-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5e3f6b550b39b38bdea5c512aed8c7e9ca56b3736d88a2d46ab6a15ce2ead1af
MD5 24d991595b801d66d0711054acefaabd
BLAKE2b-256 8a94ade84eb16ef620666e984e435ba8def91978b2f4fe51c774f47b1e6762d2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page