DataFrame utilities for Chalk AI, backed by libchalk
Project description
chalkdf
DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.
The API centers on two concepts:
chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to aDataFramefor execution.
You build expressions with Chalk’s Python underscore DSL (from chalk import _)
and function registry (import chalk.functions as F). We also ship pragmatic
testing helpers (chalkdf.Testing) to make it easy to compare results in your
unit tests.
Installation
- Requires Python 3.10–3.13
- Works on Linux and macOS 15+
Install from PyPI:
pip install chalkdf chalkpy
Quickstart
Below are minimal, self‑contained examples that mirror the public API used in the tests.
Create from Arrow and select columns
import pyarrow as pa
from chalkdf import DataFrame
tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)
out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │ c │ a │
# ├─────┼─────┤
# │ 100 │ 1 │
# │ 200 │ 2 │
# │ 300 │ 3 │
# └─────┴─────┘
Add or replace columns with expressions
import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame
tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)
out = df.with_columns(
{
"sum": _.a + _.b,
"flag": F.if_then_else(_.a < 2, "small", "big"),
}
)
print(out.run())
Filter, slice, and order
import pyarrow as pa
from chalk import _
from chalkdf import DataFrame
tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)
out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())
Rename and explode
import pyarrow as pa
from chalkdf import DataFrame
tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)
renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run()) # (1,1), (1,2)
Group by and aggregate
import pyarrow as pa
from chalk import _
from chalkdf import DataFrame
tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)
out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run()) # rows may be unordered
Joins
Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.
import pyarrow as pa
from chalkdf import DataFrame
left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))
joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())
# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")
# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")
Scan Parquet files
You can construct a DataFrame that scans one or more Parquet files without
loading them eagerly. Use local file:// URIs (or remote URIs when running in
an environment with appropriate access):
from chalkdf import DataFrame
df = DataFrame.scan(
"my_table",
["file:///path/to/data.parquet"],
)
print(df.run()) # materializes and prints a preview
Lazy plans (experimental)
LazyFrame records a chain of operations and can round‑trip to a protobuf for
transport or persistence. You can reconstruct the same lazy plan later and
convert it to a DataFrame to execute.
from chalk import _
from chalkdf import LazyFrame
lf = (
LazyFrame()
.select("c1", "c2")
.slice(0, length=10)
.filter(_.c1 > 0)
)
proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2 # structural equality of the recorded plan
df = lf2._convert_to_df() # convert to a DataFrame for execution
print(df.run())
Notes:
LazyFrame._convert_to_df()is currently a private/experimental helper.- For catalog‑backed tables, use
LazyFrame.from_catalog_table("table_name")in your chain and provide aChalkSqlCatalogwhen converting.
Testing helpers
Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit
tests. It materializes both frames to Arrow and supports relaxed comparisons.
import pyarrow as pa
from chalkdf import Testing, DataFrame
left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))
Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)
# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)
Why chalkdf?
- Built on Apache Arrow memory formats for zero‑copy interop.
- Runs on Chalk’s native engine for performance and portability.
- Small, predictable API surface suited for services and batch jobs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chalkdf-3.29.22-cp313-cp313-manylinux_2_35_x86_64.whl.
File metadata
- Download URL: chalkdf-3.29.22-cp313-cp313-manylinux_2_35_x86_64.whl
- Upload date:
- Size: 149.3 MB
- Tags: CPython 3.13, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc524abdf0643f6be65dd7cad83e144630e74e459508d4a760b2c2f03f3c15c0
|
|
| MD5 |
8587ab52138aad9a670908ba71818f20
|
|
| BLAKE2b-256 |
510f02d4f3029341272e3b106ab99e498f3ed41f713f1b10ed429b7d6636db38
|
Provenance
The following attestation bundles were made for chalkdf-3.29.22-cp313-cp313-manylinux_2_35_x86_64.whl:
Publisher:
build-dataframe.yml on chalk-ai/chalk-private
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chalkdf-3.29.22-cp313-cp313-manylinux_2_35_x86_64.whl -
Subject digest:
cc524abdf0643f6be65dd7cad83e144630e74e459508d4a760b2c2f03f3c15c0 - Sigstore transparency entry: 767744633
- Sigstore integration time:
-
Permalink:
chalk-ai/chalk-private@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/chalk-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
build-dataframe.yml@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file chalkdf-3.29.22-cp313-cp313-macosx_15_0_arm64.whl.
File metadata
- Download URL: chalkdf-3.29.22-cp313-cp313-macosx_15_0_arm64.whl
- Upload date:
- Size: 118.2 MB
- Tags: CPython 3.13, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
872274361aa0b671e15b0b04c70e9fcd319973fb1d4e6f16066fba32102b329b
|
|
| MD5 |
834589639b2becd14d7f3b44b035714d
|
|
| BLAKE2b-256 |
5924e14d762a88f106d2aa0472e637e8fb0b346086a0b5a36d0f997668478569
|
Provenance
The following attestation bundles were made for chalkdf-3.29.22-cp313-cp313-macosx_15_0_arm64.whl:
Publisher:
build-dataframe.yml on chalk-ai/chalk-private
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chalkdf-3.29.22-cp313-cp313-macosx_15_0_arm64.whl -
Subject digest:
872274361aa0b671e15b0b04c70e9fcd319973fb1d4e6f16066fba32102b329b - Sigstore transparency entry: 767744591
- Sigstore integration time:
-
Permalink:
chalk-ai/chalk-private@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/chalk-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
build-dataframe.yml@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file chalkdf-3.29.22-cp312-cp312-manylinux_2_35_x86_64.whl.
File metadata
- Download URL: chalkdf-3.29.22-cp312-cp312-manylinux_2_35_x86_64.whl
- Upload date:
- Size: 149.4 MB
- Tags: CPython 3.12, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
db1c66d9cccf228edaa8979deb2f49fa6b3f42af94cd6292a89e0944ed7bdead
|
|
| MD5 |
e09a3a7ea795f51cc5a39957510fba4f
|
|
| BLAKE2b-256 |
31bf911242d07a860c47843db07a5d9facfb7218f7d07c5c2a0c50f0aea9dde5
|
Provenance
The following attestation bundles were made for chalkdf-3.29.22-cp312-cp312-manylinux_2_35_x86_64.whl:
Publisher:
build-dataframe.yml on chalk-ai/chalk-private
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chalkdf-3.29.22-cp312-cp312-manylinux_2_35_x86_64.whl -
Subject digest:
db1c66d9cccf228edaa8979deb2f49fa6b3f42af94cd6292a89e0944ed7bdead - Sigstore transparency entry: 767744603
- Sigstore integration time:
-
Permalink:
chalk-ai/chalk-private@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/chalk-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
build-dataframe.yml@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file chalkdf-3.29.22-cp312-cp312-macosx_15_0_arm64.whl.
File metadata
- Download URL: chalkdf-3.29.22-cp312-cp312-macosx_15_0_arm64.whl
- Upload date:
- Size: 118.2 MB
- Tags: CPython 3.12, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c867931628583064a0e3bd58f13381c18bc06d3dadc6516df4f6dd5df910ef5a
|
|
| MD5 |
2af51bc26cad8b917397529ba75190e6
|
|
| BLAKE2b-256 |
5a00db80dfefc1e4db200a327f4cbb37719b7fa2cbc3e9808fd015fa2f20185d
|
Provenance
The following attestation bundles were made for chalkdf-3.29.22-cp312-cp312-macosx_15_0_arm64.whl:
Publisher:
build-dataframe.yml on chalk-ai/chalk-private
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chalkdf-3.29.22-cp312-cp312-macosx_15_0_arm64.whl -
Subject digest:
c867931628583064a0e3bd58f13381c18bc06d3dadc6516df4f6dd5df910ef5a - Sigstore transparency entry: 767744564
- Sigstore integration time:
-
Permalink:
chalk-ai/chalk-private@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/chalk-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
build-dataframe.yml@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file chalkdf-3.29.22-cp311-cp311-manylinux_2_35_x86_64.whl.
File metadata
- Download URL: chalkdf-3.29.22-cp311-cp311-manylinux_2_35_x86_64.whl
- Upload date:
- Size: 149.4 MB
- Tags: CPython 3.11, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
388f5ea86b7c0d7b01a0ce93157d1b818d9024f8ce64f9a4f8be8979c3ba61fb
|
|
| MD5 |
ce01482fbd3326b4986ef99c2974d82d
|
|
| BLAKE2b-256 |
f5a06efda11fd3e5c480eb800ef322ad2dbe9e1ed7dd66ea370e5b8fb55aeac0
|
Provenance
The following attestation bundles were made for chalkdf-3.29.22-cp311-cp311-manylinux_2_35_x86_64.whl:
Publisher:
build-dataframe.yml on chalk-ai/chalk-private
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chalkdf-3.29.22-cp311-cp311-manylinux_2_35_x86_64.whl -
Subject digest:
388f5ea86b7c0d7b01a0ce93157d1b818d9024f8ce64f9a4f8be8979c3ba61fb - Sigstore transparency entry: 767744626
- Sigstore integration time:
-
Permalink:
chalk-ai/chalk-private@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/chalk-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
build-dataframe.yml@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file chalkdf-3.29.22-cp311-cp311-macosx_15_0_arm64.whl.
File metadata
- Download URL: chalkdf-3.29.22-cp311-cp311-macosx_15_0_arm64.whl
- Upload date:
- Size: 118.2 MB
- Tags: CPython 3.11, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74a01c292da0e43618d2d704ef70793afaedf86a727da3ece2e0c4caa3ecb089
|
|
| MD5 |
a024c4aad640c21061dfdb495516e6f3
|
|
| BLAKE2b-256 |
bbe73283dd87835ffa42257590b44c4b33216211a3747df84382c34f3e2a3448
|
Provenance
The following attestation bundles were made for chalkdf-3.29.22-cp311-cp311-macosx_15_0_arm64.whl:
Publisher:
build-dataframe.yml on chalk-ai/chalk-private
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chalkdf-3.29.22-cp311-cp311-macosx_15_0_arm64.whl -
Subject digest:
74a01c292da0e43618d2d704ef70793afaedf86a727da3ece2e0c4caa3ecb089 - Sigstore transparency entry: 767744644
- Sigstore integration time:
-
Permalink:
chalk-ai/chalk-private@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/chalk-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
build-dataframe.yml@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file chalkdf-3.29.22-cp310-cp310-manylinux_2_35_x86_64.whl.
File metadata
- Download URL: chalkdf-3.29.22-cp310-cp310-manylinux_2_35_x86_64.whl
- Upload date:
- Size: 149.3 MB
- Tags: CPython 3.10, manylinux: glibc 2.35+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09732448cbf36f3caa06934bedd719683d6ddd9239b092693eefd032a8d96543
|
|
| MD5 |
d800ed72dff72ce151aaec14f637ede0
|
|
| BLAKE2b-256 |
6551769b4ea60fe02258996e83fdcbd33edc8bd648cc7b076d2c0c2daf2457f7
|
Provenance
The following attestation bundles were made for chalkdf-3.29.22-cp310-cp310-manylinux_2_35_x86_64.whl:
Publisher:
build-dataframe.yml on chalk-ai/chalk-private
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chalkdf-3.29.22-cp310-cp310-manylinux_2_35_x86_64.whl -
Subject digest:
09732448cbf36f3caa06934bedd719683d6ddd9239b092693eefd032a8d96543 - Sigstore transparency entry: 767744580
- Sigstore integration time:
-
Permalink:
chalk-ai/chalk-private@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/chalk-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
build-dataframe.yml@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file chalkdf-3.29.22-cp310-cp310-macosx_15_0_arm64.whl.
File metadata
- Download URL: chalkdf-3.29.22-cp310-cp310-macosx_15_0_arm64.whl
- Upload date:
- Size: 118.2 MB
- Tags: CPython 3.10, macOS 15.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cd8099fad5600e740b1f63eef86c9ed149dbee6ef5e8cb1e1396de302a1f900
|
|
| MD5 |
3e5bf360a0c9be889fb2f087eb5d0913
|
|
| BLAKE2b-256 |
cd75759a603615c432a67e3f510fb84fad2e6fdfe6120724ea59d32202c8b0e3
|
Provenance
The following attestation bundles were made for chalkdf-3.29.22-cp310-cp310-macosx_15_0_arm64.whl:
Publisher:
build-dataframe.yml on chalk-ai/chalk-private
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chalkdf-3.29.22-cp310-cp310-macosx_15_0_arm64.whl -
Subject digest:
7cd8099fad5600e740b1f63eef86c9ed149dbee6ef5e8cb1e1396de302a1f900 - Sigstore transparency entry: 767744616
- Sigstore integration time:
-
Permalink:
chalk-ai/chalk-private@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Branch / Tag:
refs/heads/dev - Owner: https://github.com/chalk-ai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
self-hosted -
Publication workflow:
build-dataframe.yml@c1928a834e1dfdfba52d5f04fda7779a9164663e -
Trigger Event:
workflow_dispatch
-
Statement type: