Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Force caching of a reused subcomputation

DataFrame.replay() is an escape hatch for tests and carefully inspected plans where you need to force a subcomputation to be cached behind an explicit replay boundary. Most plans should let the optimizer place replay nodes automatically; use this only when you know a shared subplan is expensive enough to compute once and reuse.

from chalk import _
from chalkdf import DataFrame

df = DataFrame.from_dict({"id": [1, 2, 3], "amount": [10, 20, 30]})
shared = df.with_columns({"amount_cents": _.amount * 100}).replay("shared_amount_cents")

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.34.9-cp313-cp313-manylinux_2_36_x86_64.whl (163.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.34.9-cp313-cp313-macosx_15_0_arm64.whl (128.9 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.34.9-cp312-cp312-manylinux_2_36_x86_64.whl (163.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.34.9-cp312-cp312-macosx_15_0_arm64.whl (128.9 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.34.9-cp311-cp311-manylinux_2_36_x86_64.whl (163.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.34.9-cp311-cp311-macosx_15_0_arm64.whl (128.9 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.34.9-cp310-cp310-manylinux_2_36_x86_64.whl (163.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.34.9-cp310-cp310-macosx_15_0_arm64.whl (128.9 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.34.9-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.9-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 2b7c41a2c7a2b4fb1db405a1a4437dd886e026b4ca69a35fb527de929f0d7f3f
MD5 dcdd6cf1cd285710b262f5c2f44227c6
BLAKE2b-256 2661efcd526e5c536453441f6eb29868b0032cb73bf3073a0bcc2af7919ee609

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.9-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.9-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.9-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 b88d431c405367be28cfed868c630b264f93841c3dec2796148574b6eb2907b8
MD5 938ffeb4740cce5712907d12052a3b91
BLAKE2b-256 665d3683057a989fc4954e6d33d1ada111c1964c965c5eec8909d3bfa527a941

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.9-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.9-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.9-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 c8bc629ba8d78b231446823d52faef1f541c3311a56cae7d6f0a9e34f5cc18e0
MD5 8eb018aee8f3ed9ab5159c3e3fefc13b
BLAKE2b-256 ce4ba0435330a59b93c9ff8331de9929a672935ca05db44be894641c44857590

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.9-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.9-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.9-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 354f906bb80487ede223c979a03ae70f4e099ab128d9ba7f12e8aad4830c7866
MD5 9f2a8ea2452739d6200cea092914fe6a
BLAKE2b-256 d0f80ebcd24a8af36a53b979ba8e19298977940e125e9c9e61fe2b50b5548a14

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.9-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.9-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.9-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 d8e01bf180e9af2410fb3f6675d89c525b6e021e392e81d12e0c76d669b2cec7
MD5 8c09cb4548260cb665605acbfdada0cd
BLAKE2b-256 98f81513a08b57298e9274a6970b1787787450561c841bd702fdceda95643dda

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.9-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.9-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.9-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 510f4c5a868f7a46f4154d5a678d6ae8bb893b415e4bc29ea767bd460b87515a
MD5 648bc2cde590c4e836cd8aa3b9219dbb
BLAKE2b-256 297a014ed4413645d83ea26f18757ff50e2269d3006cb605be8be19d876dca8c

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.9-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.9-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.9-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 76ea5db00efa01808c7c8b06bef787a3437aa62eefa67bf1efcd8f092c26ee83
MD5 31bf68d421f29e970f5d592f9e93d55b
BLAKE2b-256 f188db02a6aea629c6494191ae04baf14b75fc8ccc87b8345ff77a5b915f8004

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.9-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.9-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.9-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 601559dc1028cc1fd913b3194def8e561a802ec12e95d4edd9ea48cc1a19bc37
MD5 e38019de74899d9b6abb65ec97ba36c1
BLAKE2b-256 05423f672e4f531d2ea6d74f418d5b93ba683f9ec85ec0d01f8c80e270b3a8d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.9-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page