Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.38-cp313-cp313-manylinux_2_36_x86_64.whl (157.9 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.31.38-cp313-cp313-macosx_15_0_arm64.whl (125.9 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.38-cp312-cp312-manylinux_2_36_x86_64.whl (157.9 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.31.38-cp312-cp312-macosx_15_0_arm64.whl (125.9 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.38-cp311-cp311-manylinux_2_36_x86_64.whl (157.9 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.31.38-cp311-cp311-macosx_15_0_arm64.whl (125.8 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.38-cp310-cp310-manylinux_2_36_x86_64.whl (157.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.31.38-cp310-cp310-macosx_15_0_arm64.whl (125.9 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.38-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.38-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 afcd5e5ff910e100ccf4b22b62a741f6e2fe1a0b832f912102d839a7e7696ba5
MD5 4037b46e0959b66040d3c9305564bd10
BLAKE2b-256 6976c008b0aa8541e00d6e49a64970d892b882c4af4bb3a1c4a30f53bdc7c0a1

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.38-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.38-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.38-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2579f12464c749be16be95a47bc6575b9732c98d1c5c484692de68953f2e9695
MD5 bb81e9d2a53c8a751c13bed0d860d683
BLAKE2b-256 4adabc99bce3743dabbf8aaaf5f29c4a1ff4e69376f948ccc468826af29c6dff

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.38-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.38-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.38-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 23b3485d5b140252bd6d4a7ecf471f42095296dfb4ed3979d7e2ce5a9d5dc251
MD5 b8c9b2fd6d5d84cac2afc6c8df6145c5
BLAKE2b-256 13a7002cb15726c07a0d2fe4753619030d679b2d8813bd674c9465e474a46ca0

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.38-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.38-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.38-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 9f38c807a6961d3ab253c3bc23727a90fad705523294ae86a80d93dd512b02c9
MD5 871ab8cf0e8b24b5a2d1fff7cef8d52e
BLAKE2b-256 987fb4da0fc2deb83778f32b2000a76e52b200d0ad6bca91ae0a283dd485522a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.38-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.38-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.38-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 2a6a78a8f6bb1fc919eb0b7bdf3c0445ffefe691da42a63377650140a1d98314
MD5 886be90439fd03cd37f4046340fe8672
BLAKE2b-256 417270860e0fd86e6ea0b6f4f69b4dc243ac8a2c85a0e2fa95891fc31c24a8d8

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.38-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.38-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.38-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2d5c7a57b25852c1b98a795cec584c6f9cc24e6b02d25cfb76297b8af36bb908
MD5 57d75e56a988a9e526f266babd0c958e
BLAKE2b-256 f84a8059d2beda4a51830c39e2a2ee84756cad7723737954d63c8be8186fdd93

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.38-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.38-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.38-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 df21fce3c46ad19f034107eda83c07fccc61ee04c6e6764a4e1d2fc28dd8d040
MD5 ac7b1b3208823a090d6ddc2eac94080c
BLAKE2b-256 c173066df53b8f50554744c327d5f5f4486d3108e14da4a2a569039c6710a957

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.38-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.38-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.38-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 182123bf1e5c0a3e2b5438301ea96f62d06a9fa380a9fa9102b096c533814bb4
MD5 2ff4b7ce889f62fc833d8df0ad2843ba
BLAKE2b-256 05e4e55a8c6696accca8b13cfdd825b653bf66ffda7ae6f368b5fdcf15a558cc

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.38-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page