datafusion

Build and run queries against data

These details have been verified by PyPI

Maintainers

alamb andygrove etareduce jdye64 jorgecarleitao kou kszucs wesm xhochy

Project description

Datafusion with Python

This is a Python library that binds to Apache's Arrow in-memory rust-based query engine datafusion. It allows you to build a Logical Plan through a DataFrame API against parquet or CSV files, and obtain the result back.

Being written in rust, this code has strong assumptions about thread safety and lack of memory leaks.

We lock the GIL to convert the results back to pyarrow arrays and to run UFDs.

How to use it

Simple usage:

import datafusion
import pyarrow

# an alias
f = datafusion.functions

# create a context
ctx = datafusion.ExecutionContext()

# create a RecordBatch and a new DataFrame from it
batch = pyarrow.RecordBatch.from_arrays(
    [pyarrow.array([1, 2, 3]), pyarrow.array([4, 5, 6])],
    names=["a", "b"],
)
df = ctx.create_dataframe([[batch]])

# create a new statement
df = df.select(
    f.col("a") + f.col("b"),
    f.col("a") - f.col("b"),
)

# execute and collect the first (and only) batch
result = df.collect()[0]

assert result.column(0) == pyarrow.array([5, 7, 9])
assert result.column(1) == pyarrow.array([-3, -3, -3])

UDFs

def is_null(array: pyarrow.Array) -> pyarrow.Array:
    return array.is_null()

udf = f.udf(is_null, [pyarrow.int64()], pyarrow.bool_())

df = df.select(udf(f.col("a")))

UDAFs

import pyarrow
import pyarrow.compute


class Accumulator:
    """
    Interface of a user-defined accumulation.
    """
    def __init__(self):
        self._sum = pyarrow.scalar(0.0)

    def to_scalars(self) -> [pyarrow.Scalar]:
        return [self._sum]

    def update(self, values: pyarrow.Array) -> None:
        # not nice since pyarrow scalars can't be summed yet. This breaks on `None`
        self._sum = pyarrow.scalar(self._sum.as_py() + pyarrow.compute.sum(values).as_py())

    def merge(self, states: pyarrow.Array) -> None:
        # not nice since pyarrow scalars can't be summed yet. This breaks on `None`
        self._sum = pyarrow.scalar(self._sum.as_py() + pyarrow.compute.sum(states).as_py())

    def evaluate(self) -> pyarrow.Scalar:
        return self._sum


df = ...

udaf = f.udaf(Accumulator, pyarrow.float64(), pyarrow.float64(), [pyarrow.float64()])

df = df.aggregate(
    [],
    [udaf(f.col("a"))]
)

How to install

pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple datafusion

We haven't configured CI/CD to publish wheels in pip yet and thus you can only install it in development. It requires cargo and rust. See below.

How to develop

This assumes that you have rust and cargo installed. We use the workflow recommended by pyo3 and maturin.

Bootstrap:

# fetch this repo
git clone git@github.com:jorgecarleitao/datafusion-python.git

cd datafusion-python

# prepare development environment (used to build wheel / install in development)
python -m venv venv
venv/bin/pip install maturin==0.8.2 toml==0.10.1

# used for testing
venv/bin/pip install pyarrow==1.0.0

Whenever rust code changes (your changes or via git pull):

venv/bin/maturin develop
venv/bin/python -m unittest discover tests

Project details

These details have been verified by PyPI

Maintainers

alamb andygrove etareduce jdye64 jorgecarleitao kou kszucs wesm xhochy

Release history Release notifications | RSS feed

41.0.0

Sep 16, 2024

40.1.0

Aug 20, 2024

39.0.0

Jul 2, 2024

38.0.1

May 30, 2024

37.1.0

May 13, 2024

36.0.0

Mar 10, 2024

35.0.0

Feb 4, 2024

34.0.0

Jan 3, 2024

33.0.0

Nov 19, 2023

32.0.0

Oct 25, 2023

31.0.0

Sep 18, 2023

28.0.0

Aug 6, 2023

27.0.0

Jul 8, 2023

26.0.0

Jun 15, 2023

25.0.0

May 29, 2023

24.0.0

May 13, 2023

23.0.0

Apr 28, 2023

22.0.0

Apr 14, 2023

21.0.0

Apr 5, 2023

20.0.0

Mar 20, 2023

0.8.0

Feb 25, 2023

0.7.0

Nov 29, 2022

0.6.0

Jun 5, 2022

0.5.2

Apr 4, 2022

0.5.1

Mar 15, 2022

0.5.0

Mar 10, 2022

0.4.0

Nov 17, 2021

0.2.0

Dec 10, 2020

This version

0.1.2

Nov 18, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datafusion-0.1.2.tar.gz (20.2 kB view hashes)

Uploaded Nov 18, 2020 Source

Built Distributions

datafusion-0.1.2-cp38-cp38-macosx_10_7_x86_64.whl (3.3 MB view hashes)

Uploaded Nov 18, 2020 CPython 3.8 macOS 10.7+ x86-64

datafusion-0.1.2-cp37-cp37m-macosx_10_7_x86_64.whl (3.3 MB view hashes)

Uploaded Nov 18, 2020 CPython 3.7m macOS 10.7+ x86-64

Hashes for datafusion-0.1.2.tar.gz

Hashes for datafusion-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`5520180d8931edb24e2f98b6e00de159b39e247b1d131e7ff100502de4adab1d`
MD5	`e3f7f272a8099a4dde8a2e51f9472fb6`
BLAKE2b-256	`c655c1cf2b8ba59a48e26790bfecfa03e68ecb1231022fc139ede8183a31d9a7`

Hashes for datafusion-0.1.2-cp38-cp38-macosx_10_7_x86_64.whl

Hashes for datafusion-0.1.2-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm	Hash digest
SHA256	`d38cc28167fc540bec5e8e7ef679e544820eaf42b20db422f0cd60343199d2be`
MD5	`33085a49e46736ada8d9c955c1200951`
BLAKE2b-256	`90e2971fe2b6f79189beee1253d16d370c89a21c461b2d4552e268b2f399ea10`

Hashes for datafusion-0.1.2-cp37-cp37m-macosx_10_7_x86_64.whl

Hashes for datafusion-0.1.2-cp37-cp37m-macosx_10_7_x86_64.whl
Algorithm	Hash digest
SHA256	`c25886aade9f6973e6f9bb8dcdcb233e56b48f0007ff0b64c448d9fb8fb7f47c`
MD5	`e62706fcdffd2b582bf33ba8378901fa`
BLAKE2b-256	`16129e42c54c5f9071414efd3f86d35880d58abc3507be6c6ac2daedd51b1ad6`