Skip to main content

Databend UDF Server

Project description

Usage

1. Define your functions in a Python file

from databend_udf import *

# Define a function
@udf(input_types=["VARCHAR", "VARCHAR", "VARCHAR"], result_type="VARCHAR")
def split_and_join(s: str, split_s: str, join_s: str) -> str:
    return join_s.join(s.split(split_s))

# Define a function that accepts nullable values, and set skip_null to True to enable it returns NULL if any argument is NULL.
@udf(
    input_types=["INT", "INT"],
    result_type="INT",
    skip_null=True,
)
def gcd(x: int, y: int) -> int:
    while y != 0:
        (x, y) = (y, x % y)
    return x

# Define a function that accepts nullable values, and set skip_null to False to enable it handles NULL values inside the function.
@udf(
    input_types=["ARRAY(INT64 NULL)", "INT64"],
    result_type="INT NOT NULL",
    skip_null=False,
)
def array_index_of(array: List[int], item: int):
    if array is None:
        return 0

    try:
        return array.index(item) + 1
    except ValueError:
        return 0

# Define a function which is IO bound, and set io_threads to enable it can be executed concurrently.
@udf(input_types=["INT"], result_type="INT", io_threads=32)
def wait_concurrent(x):
    # assume an IO operation cost 2s
    time.sleep(2)
    return x

# Define a table-valued function that emits multiple columns per row.
@udf(
    input_types=["INT", "INT"],
    result_type=[("left", "INT"), ("right", "INT"), ("sum", "INT")],
    batch_mode=True,
)
def expand_pairs(left: List[int], right: List[int]):
    if len(left) != len(right):
        raise ValueError("Inputs must have the same length")
    return [
        {"left": l, "right": r, "sum": l + r} for l, r in zip(left, right)
    ]

if __name__ == '__main__':
    # create a UDF server listening at '0.0.0.0:8815'
    server = UDFServer("0.0.0.0:8815")
    # add defined functions
    server.add_function(split_and_join)
    server.add_function(gcd)
    server.add_function(array_index_of)
    server.add_function(wait_concurrent)
    # start the UDF server
    server.serve()

@udf is an annotation for creating a UDF. It supports following parameters:

  • input_types: A list of strings or Arrow data types that specifies the input data types.
  • result_type: A string or an Arrow data type that specifies the return value type.
  • name: An optional string specifying the function name. If not provided, the original name will be used.
  • io_threads: Number of I/O threads used per data chunk for I/O bound functions.
  • skip_null: A boolean value specifying whether to skip NULL value. If it is set to True, NULL values will not be passed to the function, and the corresponding return value is set to NULL. Default to False.

2. Start the UDF Server

Then we can Start the UDF Server by running:

python3 udf_server.py

3. Update Databend query node config

Now, udf server is disabled by default in databend. You can enable it by setting 'enable_udf_server = true' in query node config.

In addition, for security reasons, only the address specified in the config can be accessed by databend. The list of allowed udf server addresses are specified through the udf_server_allowlist variable in the query node config.

Here is an example config:

[query]
...
enable_udf_server = true
udf_server_allow_list = [ "http://0.0.0.0:8815", "http://example.com" ]

4. Add the functions to Databend

We can use the CREATE FUNCTION command to add the functions you defined to Databend:

CREATE FUNCTION [IF NOT EXISTS] <udf_name> (<arg_type>, ...) RETURNS <return_type> LANGUAGE <language> HANDLER=<handler> ADDRESS=<udf_server_address>

The udf_name is the name of UDF you declared in Databend. The handler is the function name you defined in the python UDF server.

For example:

CREATE FUNCTION split_and_join (VARCHAR, VARCHAR, VARCHAR) RETURNS VARCHAR LANGUAGE python HANDLER = 'split_and_join' ADDRESS = 'http://0.0.0.0:8815';

NOTE: The udf_server_address you specify must appear in udf_server_allow_list explained in the previous step.

In step 2, when you starting the UDF server, the corresponding sql statement of each function will be printed out. You can use them directly.

5. Use the functions in Databend

mysql> select split_and_join('3,5,7', ',', ':');
+-----------------------------------+
| split_and_join('3,5,7', ',', ':') |
+-----------------------------------+
| 3:5:7                             |
+-----------------------------------+

Data Types

The data types supported by the Python UDF API and their corresponding python types are as follows :

SQL Type Python Type
BOOLEAN bool
TINYINT (UNSIGNED) int
SMALLINT (UNSIGNED) int
INT (UNSIGNED) int
BIGINT (UNSIGNED) int
FLOAT float
DOUBLE float
DECIMAL decimal.Decimal
DATE datetime.date
TIMESTAMP datetime.datetime
VARCHAR str
BINARY bytes
VARIANT any
MAP(K,V) dict
ARRAY(T) list[T]
TUPLE(T...) tuple(T...)

The NULL in sql is represented by None in Python.

Databend UDF Server Tests

# start UDF server
python3 examples/server.py
./target/debug/databend-sqllogictests --run_dir udf_server

Acknowledgement

Databend Python UDF Server API is inspired by RisingWave Python API.

Code Formatting

Use Ruff to keep the Python sources consistent:

python -m pip install ruff  # once
python -m ruff format python/databend_udf python/tests

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

databend_udf-0.2.11.tar.gz (25.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

databend_udf-0.2.11-py3-none-any.whl (19.2 kB view details)

Uploaded Python 3

File details

Details for the file databend_udf-0.2.11.tar.gz.

File metadata

  • Download URL: databend_udf-0.2.11.tar.gz
  • Upload date:
  • Size: 25.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for databend_udf-0.2.11.tar.gz
Algorithm Hash digest
SHA256 20709d9b8fc129f524f259578cd9a96ed872d2c10babd45125b647b9f90e770c
MD5 80445cac01dcf2e72a90c5d1a1a284b8
BLAKE2b-256 a365cf2d6dbc410c2addb46baf65298516b6a293a3b2cf0deed9072f2f81e029

See more details on using hashes here.

Provenance

The following attestation bundles were made for databend_udf-0.2.11.tar.gz:

Publisher: python.yml on databendlabs/databend-udf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file databend_udf-0.2.11-py3-none-any.whl.

File metadata

  • Download URL: databend_udf-0.2.11-py3-none-any.whl
  • Upload date:
  • Size: 19.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for databend_udf-0.2.11-py3-none-any.whl
Algorithm Hash digest
SHA256 add301eb939a9aa95bfd0a10673f569710d2006d503c731a0d87d73818cc93d2
MD5 789fbb30fd97a0bd8c275ebbd966bb83
BLAKE2b-256 a2b19eaa4a33119b6dd3c95d71807259a4bfed6cdf169d2800a5267a9f093318

See more details on using hashes here.

Provenance

The following attestation bundles were made for databend_udf-0.2.11-py3-none-any.whl:

Publisher: python.yml on databendlabs/databend-udf

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page