Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Force caching of a reused subcomputation

DataFrame.replay() is an escape hatch for tests and carefully inspected plans where you need to force a subcomputation to be cached behind an explicit replay boundary. Most plans should let the optimizer place replay nodes automatically; use this only when you know a shared subplan is expensive enough to compute once and reuse.

from chalk import _
from chalkdf import DataFrame

df = DataFrame.from_dict({"id": [1, 2, 3], "amount": [10, 20, 30]})
shared = df.with_columns({"amount_cents": _.amount * 100}).replay("shared_amount_cents")

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.34.10-cp313-cp313-manylinux_2_36_x86_64.whl (171.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.34.10-cp313-cp313-macosx_15_0_arm64.whl (134.2 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.34.10-cp312-cp312-manylinux_2_36_x86_64.whl (171.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.34.10-cp312-cp312-macosx_15_0_arm64.whl (134.2 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.34.10-cp311-cp311-manylinux_2_36_x86_64.whl (171.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.34.10-cp311-cp311-macosx_15_0_arm64.whl (134.2 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.34.10-cp310-cp310-manylinux_2_36_x86_64.whl (171.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.34.10-cp310-cp310-macosx_15_0_arm64.whl (134.1 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.34.10-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.10-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 796cdf4748d88789965f4c2d09e153b8c02ef5862e22255c76b76fcf6860f8f4
MD5 a47016a9fe42fa044df4adbbaf4a25f7
BLAKE2b-256 157d771b78116608183395961830c4d8d7ef2c35c65ccb2d04646811d2673ed3

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.10-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.10-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.10-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 1ac9172a0523136e9085fa7768d8e7700c5f061f18ae3d1e03b1d0d99250cda6
MD5 520c95574b35ee7fc3870c25a0f6cc91
BLAKE2b-256 10b144bb983e35705ef49dde4701514d1750ee826d0dcd6d7321a8f7860a131b

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.10-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.10-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.10-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 6fef9fc64dad67cc8f2a0d49f925ad1fd32c67f18b98d7dabe1a44c3de15cec1
MD5 26c5cfde60f0093818f38eb5e0c5cd47
BLAKE2b-256 e3ec680f05ac09d402ef5b45d3edb2fa9061fe23a9155269dda7856a552529b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.10-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.10-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.10-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 a782ded19cae55d1ab8b92b9962a5528462fe790be9feed3760757410146dd21
MD5 056326e8decd86810dd2a9727fc14192
BLAKE2b-256 c2e8c4effa296844735e8154728d7fa65d1de8195d82f542490bc2d5bb5e3549

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.10-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.10-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.10-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 e0a513ef1e2168c412783b415619a35b2d2dcf970747c0f203cc7908ea129123
MD5 ab4a7bcd3245e6269ad3ed28427c003c
BLAKE2b-256 12cfecfa69116b3ba21b9e02443044750ef6da659d3665de3bd382b236cfd199

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.10-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.10-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.10-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 5787ae12c39dfc241d04c00fdc5742401e100f0572f12cfa69cd55603a605dad
MD5 7fddfeea4d92f9693fba25231f6f7dfc
BLAKE2b-256 92866fc9e660b64807d171265a0514ac49ece9f61b2ef31858931841cad86f87

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.10-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.10-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.10-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 27937b76da1ae8dc2ca84e0a96fd7d1c944bca05ab0559813e3fc634d6c508fe
MD5 dd46c027e450f1bf42a2d887d1381a10
BLAKE2b-256 9ed27de0c104877cfa7e3c44e5e3b597d5e99721af2aae0e18a80b4933c65a8a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.10-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.10-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.10-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 cef2e913027b2039cd7c9253ee31ae1811e863edc81769cf7a9f30b7bd3e9828
MD5 1f1ec081230b908d24a1ab416583e395
BLAKE2b-256 680f9b122c42cb5ad718f4e698f13fb8e3bd6ed4a9b41267bb1b48aec0671fa1

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.10-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page