Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Force caching of a reused subcomputation

DataFrame.replay() is an escape hatch for tests and carefully inspected plans where you need to force a subcomputation to be cached behind an explicit replay boundary. Most plans should let the optimizer place replay nodes automatically; use this only when you know a shared subplan is expensive enough to compute once and reuse.

from chalk import _
from chalkdf import DataFrame

df = DataFrame.from_dict({"id": [1, 2, 3], "amount": [10, 20, 30]})
shared = df.with_columns({"amount_cents": _.amount * 100}).replay("shared_amount_cents")

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.34.16-cp313-cp313-manylinux_2_36_x86_64.whl (171.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.34.16-cp313-cp313-macosx_15_0_arm64.whl (134.2 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.34.16-cp312-cp312-manylinux_2_36_x86_64.whl (171.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.34.16-cp312-cp312-macosx_15_0_arm64.whl (134.2 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.34.16-cp311-cp311-manylinux_2_36_x86_64.whl (171.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.34.16-cp311-cp311-macosx_15_0_arm64.whl (134.2 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.34.16-cp310-cp310-manylinux_2_36_x86_64.whl (171.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.34.16-cp310-cp310-macosx_15_0_arm64.whl (134.1 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.34.16-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.16-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 61bdc20da8c20e372c69a67d088a8bd550426423deefae162a28cd1538848fe1
MD5 202e4b295144ece31ca5f3f3f44c6652
BLAKE2b-256 9682ce68d3e59a58dae2ec309924940489c4a2463a2a3190d48ad4f3402e9ee4

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.16-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.16-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.16-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 59c592317059d78aae5e94a681d1e7bcdcf1af023a9215bf9134d1080f02304d
MD5 c99fa1305224d7199b9b9c9b3259c9d4
BLAKE2b-256 2f41e5370b1bc470f724033e9ae11a33f7c17b308513b9e0826b729996c5dfc7

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.16-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.16-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.16-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 d6bdf9622d2414571bf99d0d11ec4dc3deca9d88dbb7a02562b41299578f4e67
MD5 4ec82752ca26885351eb71abc9342be8
BLAKE2b-256 38d1002765528716cf73885fa7a39d611b8717af514903770ccad9142bba104b

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.16-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.16-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.16-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 875bd70501bf0f6e07c7943dfd89f828cddfbda8cccc1c006c901a602084f9ba
MD5 5d0817402b40b4ee4718ef14289892d6
BLAKE2b-256 e6b7b1b849b6dfad4e24587a1566ea5e62c3e721605f9602b552708844d06f63

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.16-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.16-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.16-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 40a22f10e6a6a09dd3a6f8b8cc504037ff27d12d37361339aa52ffc3c5beea41
MD5 34fea013aac40ce72db5e2a77cbc9adc
BLAKE2b-256 11b4b204b3e0f577e82ae35280c49aa0efc684d1e73482c21427f920e68b0c10

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.16-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.16-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.16-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d513c31602a440de11bc70f82ef8fef2fcf9145fed2ca7e82cd2d966bedf5119
MD5 1600bd1663a93ee51074651abd854adf
BLAKE2b-256 e9857f69cae22a553d9eec618419ca593b5ca333620e85d46e7d2649f25ddabb

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.16-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.16-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.16-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 63e763ff908510326d3058f09ced85e499b2b7c6eccb4f3769b99476c0850751
MD5 05455288b7e37636f4c9354f91deba89
BLAKE2b-256 b6044f6ac1762ef47c0a38c2476f97e2c5411b5c5ad4af990207a850b3be9209

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.16-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.16-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.16-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 3e88c63ff0e361840887a041e86ac65b57d46b41e253d05124b998b1d2bd1532
MD5 5ed8c7e896050430f78a64f721cfe0cb
BLAKE2b-256 5dde6afffab9843851c5956596925ed1f52d1c69ffc6107bc5eec57fe5aa9a4a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.16-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page