Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Force caching of a reused subcomputation

DataFrame.replay() is an escape hatch for tests and carefully inspected plans where you need to force a subcomputation to be cached behind an explicit replay boundary. Most plans should let the optimizer place replay nodes automatically; use this only when you know a shared subplan is expensive enough to compute once and reuse.

from chalk import _
from chalkdf import DataFrame

df = DataFrame.from_dict({"id": [1, 2, 3], "amount": [10, 20, 30]})
shared = df.with_columns({"amount_cents": _.amount * 100}).replay("shared_amount_cents")

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.34.2-cp313-cp313-manylinux_2_36_x86_64.whl (163.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.34.2-cp313-cp313-macosx_15_0_arm64.whl (128.4 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.34.2-cp312-cp312-manylinux_2_36_x86_64.whl (163.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.34.2-cp312-cp312-macosx_15_0_arm64.whl (128.4 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.34.2-cp311-cp311-manylinux_2_36_x86_64.whl (163.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.34.2-cp311-cp311-macosx_15_0_arm64.whl (128.4 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.34.2-cp310-cp310-manylinux_2_36_x86_64.whl (163.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.34.2-cp310-cp310-macosx_15_0_arm64.whl (128.4 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.34.2-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.2-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 4206dc8e068545d6a73a862598c8ed65d6f168463ba0b0a2710be55725212809
MD5 192c031de9a2b84ab508d3da36bde984
BLAKE2b-256 97f476a8a17d5df16c9a8f4b1e586b371e8e9f2ba3501e3b50abcb0a353048da

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.2-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.2-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.2-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 75bc54e01ab6ca6319a5b8d98e256aeb3933c61591fbcdfadc95475dcc4744d5
MD5 d92f9bd76eb18d8cba2c5afff8a08e41
BLAKE2b-256 752eaaa33ddd5c9ce860088a3930889622b768ec1c4945d366d2a45dc7e2a924

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.2-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.2-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.2-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 31dbdbef9544468a4ec35cbf339c384f324a3e44b01d81afb829c1b5564be539
MD5 d83b5f26e95500132762566349b004ed
BLAKE2b-256 b2f756d7bef744b54ba02b1882f3e9f53099138bd31da76c842fb26cd706c062

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.2-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.2-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.2-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 4ec22e6b06b35686f18f0378547ecbb784ce80baa79c559ba89882821bb2e7a1
MD5 d1f9392699313144792de047e7da0038
BLAKE2b-256 8bb089a580368bb76ff23a8ea09ddd1a5f1cc30e0c64c5855328c0e9b2ce02d5

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.2-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.2-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.2-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 8bb7e511cf0e292141550ab9e621af8f2b5bd4378fd545ee94f80fee97b9914a
MD5 083664cea089c1865ecc1a53ac073fc2
BLAKE2b-256 cfe1229e6c28eddd7c251e4b6904753e51b05f209dcf730d574b60bceed15778

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.2-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.2-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.2-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 1dcc00d9b59e84b996d94b3ada40b6f5ad83c23cad215dbfb56bdf908d2fad4a
MD5 1ba4ceae131557277e7541dba53fa1a3
BLAKE2b-256 d4ee9f29774542344ed7faf86f14c70dd4e38dd556117755d398b3c63faaeb7f

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.2-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.2-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.2-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 a70435f300831dcf1ea2be2e2df470e4d00e46a4f26343dd3aea0a4069081209
MD5 3a1b5da8cd249772f4ecc9651d94de78
BLAKE2b-256 67a0813fc36c235c13ec57c911d9aa3ca7eb7e2da5d8cbd795cfc00e8107eeb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.2-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.34.2-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.34.2-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 a730f830983bbce7c09c7f1ffd761a3abfdfc2bbfb54c793204f4a2f2c937277
MD5 88712e09cbb63a3dd5e7da1673e44f53
BLAKE2b-256 94418b71e78629d83d740c35bd71b96353f84a4b75eaa91c7b83c2f52c3d4a2b

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.34.2-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page