Skip to main content

A user-defined function framework for Apache Arrow

Project description

Arrow UDF Python Server

Installation

pip install arrow-udf

Usage

Define functions in a Python file:

# udf.py
from arrow_udf import udf, udtf, UdfServer
import struct
import socket

# Define a scalar function
@udf(input_types=['INT', 'INT'], result_type='INT')
def gcd(x, y):
    while y != 0:
        (x, y) = (y, x % y)
    return x

# Define a scalar function that returns multiple values (within a struct)
@udf(input_types=['BINARY'], result_type='STRUCT<src_addr: STRING, dst_addr: STRING, src_port: INT16, dst_port: INT16>')
def extract_tcp_info(tcp_packet: bytes):
    src_addr, dst_addr = struct.unpack('!4s4s', tcp_packet[12:20])
    src_port, dst_port = struct.unpack('!HH', tcp_packet[20:24])
    src_addr = socket.inet_ntoa(src_addr)
    dst_addr = socket.inet_ntoa(dst_addr)
    return {
        'src_addr': src_addr,
        'dst_addr': dst_addr,
        'src_port': src_port,
        'dst_port': dst_port,
    }

# Define a table function
@udtf(input_types='INT', result_types='INT')
def series(n):
    for i in range(n):
        yield i

# Start a UDF server
if __name__ == '__main__':
    server = UdfServer(location="0.0.0.0:8815")
    server.add_function(gcd)
    server.add_function(extract_tcp_info)
    server.add_function(series)
    server.serve()

Start the UDF server:

python3 udf.py

Data Types

Arrow Type Python Type
boolean bool
int8 int
int16 int
int32 int
int64 int
uint8 int
uint16 int
uint32 int
uint64 int
float32 float
float32 float
date32 datetime.date
time64 datetime.time
timestamp datetime.datetime
interval MonthDayNano / (int, int, int) (fields can be obtained by months(), days() and nanoseconds() from MonthDayNano)
string str
binary bytes
large_string str
large_binary bytes

Extension types:

Data type Metadata Python Type
decimal arrowudf.decimal decimal.Decimal
json arrowudf.json any

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arrow_udf-0.2.1.tar.gz (10.0 kB view hashes)

Uploaded Source

Built Distribution

arrow_udf-0.2.1-py3-none-any.whl (10.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page