Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    "my_table",
    ["file:///path/to/data.parquet"],
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame()
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For catalog‑backed tables, use LazyFrame.from_catalog_table("table_name") in your chain and provide a ChalkSqlCatalog when converting.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.30.4-cp313-cp313-manylinux_2_35_x86_64.whl (149.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

chalkdf-3.30.4-cp313-cp313-macosx_15_0_arm64.whl (118.3 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.30.4-cp312-cp312-manylinux_2_35_x86_64.whl (149.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

chalkdf-3.30.4-cp312-cp312-macosx_15_0_arm64.whl (118.3 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.30.4-cp311-cp311-manylinux_2_35_x86_64.whl (149.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

chalkdf-3.30.4-cp311-cp311-macosx_15_0_arm64.whl (118.3 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.30.4-cp310-cp310-manylinux_2_35_x86_64.whl (149.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

chalkdf-3.30.4-cp310-cp310-macosx_15_0_arm64.whl (118.3 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.30.4-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.4-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 21b44df062382a6a336e0fed457c4e848d7cd4611ffc9852e39455c5de773503
MD5 c740e210169de37143b2bfed19db6960
BLAKE2b-256 d7c769cd1a93c440e285a7a46f50e7df68fb6feac64eab26f2050eb6815b107f

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.4-cp313-cp313-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.4-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.4-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 45af852efd4e353fc81b9dc510837177ea50e3cb1e9622e3a69c51825fb114ab
MD5 57eeec5f078bd68f3f52c39b2fa5e11a
BLAKE2b-256 f0bb095f57595a8240451b2825d6ebbb3ce4e0853090241d7261a9c37ac469ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.4-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.4-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.4-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 90eecb44512c3ad0e8b0f4745d150c4b427331ca551267a446bae59e04088481
MD5 aee83bc487c39ac77f6c7506297ecd33
BLAKE2b-256 cbcc265e7640f0a6e23e0cd3c0399e7e65ee557bdbc0b513c1d4d96ae4f225dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.4-cp312-cp312-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.4-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.4-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 ba6e8d15ec41bb3588565bf31fe414aa6b8c5a4bafd58e7a20ae881624fcb93b
MD5 9a25e76feea218d2beebed70625bbe8e
BLAKE2b-256 10b189617541ed59613877f2c8891e9e080f95ebd3e8b870f5aff7286181d21a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.4-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.4-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.4-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 20f23a2b3d340724af357b782dd8eccd2ce45afe26e7ca00a62d048c10409997
MD5 de2b21951060fda7558d66a54900d1d0
BLAKE2b-256 a309b5aa9446c88d4df017bf4b37a5f81214343f9514130f44efdbeaaa48d0d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.4-cp311-cp311-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.4-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.4-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 30c09a946ed7e16e0b92b552447b11b41162fd34642723895b0b6fad9469f494
MD5 cfbdfaa2305e1d9ca46b18c176143f07
BLAKE2b-256 3deaf9dbd1d33c049caf9c4df8f055ec88f26f30f447d15da72d4fe7f91bcd3f

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.4-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.4-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.4-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 fe10880984536e5948ad4c49bf4aa0be4b9b6ebb81ea1d311bb6d4f3fc8d3417
MD5 acda00e60060450156eb2827e26b64be
BLAKE2b-256 f3f750a0afad63773200814aa8ac86914e46e361d058fa1d4c1e8753de8ae40d

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.4-cp310-cp310-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.4-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.4-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 9e11ee483270f794aed5a31e6cc97271595634be6d5c8abce278a46c3eb4d41c
MD5 d7d9379b68eba45fd263d4a4c3b6cc8f
BLAKE2b-256 2383a9b9e0c0b35a7b671c7591a403c02ba8b08257f10e75e636923057d72711

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.4-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page