Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Force caching of a reused subcomputation

DataFrame.replay() is an escape hatch for tests and carefully inspected plans where you need to force a subcomputation to be cached behind an explicit replay boundary. Most plans should let the optimizer place replay nodes automatically; use this only when you know a shared subplan is expensive enough to compute once and reuse.

from chalk import _
from chalkdf import DataFrame

df = DataFrame.from_dict({"id": [1, 2, 3], "amount": [10, 20, 30]})
shared = df.with_columns({"amount_cents": _.amount * 100}).replay("shared_amount_cents")

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.34.20-cp313-cp313-manylinux_2_36_x86_64.whl (178.8 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.34.20-cp313-cp313-macosx_15_0_arm64.whl (142.4 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.34.20-cp312-cp312-manylinux_2_36_x86_64.whl (178.8 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.34.20-cp312-cp312-macosx_15_0_arm64.whl (142.4 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.34.20-cp311-cp311-manylinux_2_36_x86_64.whl (178.8 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.34.20-cp311-cp311-macosx_15_0_arm64.whl (142.4 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.34.20-cp310-cp310-manylinux_2_36_x86_64.whl (178.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.34.20-cp310-cp310-macosx_15_0_arm64.whl (142.4 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.34.20-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.20-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 6c7c18a0339a5bc57ff73ee104d1910a8613376e0770a78377979ea79c25b17c
MD5 7cf2230efeb227ac6ccbbb283df329e7
BLAKE2b-256 d2dda6421d3ac51974ad449f8436ced90d83d15270bded68fe78d5b18d6c34d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.20-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.20-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.20-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 3acdde2d9cebf95647d86e8e791e809cdef7e6189f0949083a61d7ffa42d1dea
MD5 e4fb26312dbd5d6f123b837e7308e288
BLAKE2b-256 06f5d3d9be37ec2a6ba05c011a801d9d386ab2f64588591550836193ee919e62

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.20-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.20-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.20-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 fc7a82c4cb3095e74955e65ea3ab94b346aa15149a1a8c5a97203b40edac2e48
MD5 a1d6df5dd5dba89907bbb61c1c0f2b26
BLAKE2b-256 17b5fae14570f2061a1a32be123a7540f1f0b43b5d482399c3729712617b9a96

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.20-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.20-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.20-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 c2df0d2da951aba5c73f4592c4492d06f666a02c24e95ff7146a87d56293357c
MD5 1d4f19c5a172fd938a6ff8642733c161
BLAKE2b-256 71f6d4bbda79e7669300adda65c509a517c94e91e5f4b1b2c5e6ea2ec6730847

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.20-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.20-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.20-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 4510253dafd803fb2aa43ac2b98143e56d2a633c85ea17198e4533050f7f94ef
MD5 afea257fdabd9de0e53fdbde48bbea9e
BLAKE2b-256 7683bf8e738724aa37b30d2691f43ecb970afffe211d281c886172ea53662b92

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.20-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.20-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.20-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 9e33a6ad0c865eef792b68ff3535b7dabbd8b215cc877c28372312acd3bc5aff
MD5 538f481fe26b58329dba00dc4314590b
BLAKE2b-256 c700efc8707e98fb8c904fd552749e916e88a51dc57d75236eb3e2bbf669b443

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.20-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.20-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.20-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 4ed8ba5d495bfc1a0fae6bae8b56919bab66474b2f8c8ff0b990d1dfe46c1820
MD5 831e3bc968fecbd3a14a9563c58dcdc9
BLAKE2b-256 2b810010db38eedf71e04a2f00ec25ae5bd2aa2447b6102f3bf6bfb71f8c96a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.20-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.20-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.20-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 5600af5c7285e3612e7091752fdc66e939d50542f1feae9353ffa8a1c45189b7
MD5 9c371798fea5c7ce3df0eb27eceff66e
BLAKE2b-256 472d471d679d64a6da4d487fa1c3b5649cc0dc1555de91a75052320ef8d12f09

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.20-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page