Athena User Defined Functions(UDFs) made easy!
Project description
athena-udf
Athena User Defined Functions(UDFs) in Python made easy!
This library implements the Athena UDF protocol in Python so you don't have to use Java and you can use any Python library you wish including numpy/pandas!
Installation
Install this library using pip
:
pip install athena-udf
Usage
Simply install the package, create a lambda handler Python file, subclass BaseAthenaUDF
and implement the handle_athena_record
static method with your required functionality like this:
import athena_udf
class SimpleVarcharUDF(athena_udf.BaseAthenaUDF):
@staticmethod
def handle_athena_record(input_schema, output_schema, arguments):
varchar = arguments[0]
return varchar.lower()
lambda_handler = SimpleVarcharUDF().lambda_handler
This very basic example takes a varchar
input, and returns the lowercase version.
varchar
is converted to a python string on the way in and way out.
input_schema
contains a PyArrow
schema representing the schema of the data being passed
output_schema
contains a PyArrow
schema representing the schema of what athena expects to be returned.
arguments
contains a list of arguments given to the function. Can be more than 1 with different types.
If you package the above into a zip, with dependencies and name your lambda function my-kambda
you can then run it from the athena console like so:
USING EXTERNAL FUNCTION my_udf(col1 varchar) RETURNS varchar LAMBDA 'athena-test'
SELECT my_udf('FooBar');
Which will yield the result foobar
See other examples in the examples folder of this repo.
Important information before using.
Each lambda instance will take multiple requests for the same query. Each request can contain multiple rows, athena-udf
handles this for you and your implementation will receive a single row.
Athena will group your data into around 1MB chunks in a single request. The maximum your function can return in 6MB per chunk.
This library uses PyArrow
. This is a large library so the Lambdas will be around 50MB zipped.
Timestamps seem to be truncated into Python date
objects missing the time.
Functions can return one value only. To return more complex data structures consider returning a JSON payload and parsing on athena.
Development
To contribute to this library, first checkout the code. Then create a new virtual environment:
cd athena-udf
python -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file athena-udf-0.2.tar.gz
.
File metadata
- Download URL: athena-udf-0.2.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | acb81600b1c52974121b0b001664e44d687ba9d32e48bcfc4db5a3b18f77f154 |
|
MD5 | b5321966d2b8a07407764d37ebbe1ea3 |
|
BLAKE2b-256 | a3ef2fcfd7a257075f075c06bd625081bdeb1e64fe217ce17d32568beaf41225 |
File details
Details for the file athena_udf-0.2-py3-none-any.whl
.
File metadata
- Download URL: athena_udf-0.2-py3-none-any.whl
- Upload date:
- Size: 7.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5e3f6b550b39b38bdea5c512aed8c7e9ca56b3736d88a2d46ab6a15ce2ead1af |
|
MD5 | 24d991595b801d66d0711054acefaabd |
|
BLAKE2b-256 | 8a94ade84eb16ef620666e984e435ba8def91978b2f4fe51c774f47b1e6762d2 |