Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    "my_table",
    ["file:///path/to/data.parquet"],
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame()
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For catalog‑backed tables, use LazyFrame.from_catalog_table("table_name") in your chain and provide a ChalkSqlCatalog when converting.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.29.22-cp313-cp313-manylinux_2_35_x86_64.whl (149.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

chalkdf-3.29.22-cp313-cp313-macosx_15_0_arm64.whl (118.2 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.29.22-cp312-cp312-manylinux_2_35_x86_64.whl (149.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

chalkdf-3.29.22-cp312-cp312-macosx_15_0_arm64.whl (118.2 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.29.22-cp311-cp311-manylinux_2_35_x86_64.whl (149.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

chalkdf-3.29.22-cp311-cp311-macosx_15_0_arm64.whl (118.2 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.29.22-cp310-cp310-manylinux_2_35_x86_64.whl (149.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

chalkdf-3.29.22-cp310-cp310-macosx_15_0_arm64.whl (118.2 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.29.22-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.29.22-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 cc524abdf0643f6be65dd7cad83e144630e74e459508d4a760b2c2f03f3c15c0
MD5 8587ab52138aad9a670908ba71818f20
BLAKE2b-256 510f02d4f3029341272e3b106ab99e498f3ed41f713f1b10ed429b7d6636db38

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.29.22-cp313-cp313-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.29.22-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.29.22-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 872274361aa0b671e15b0b04c70e9fcd319973fb1d4e6f16066fba32102b329b
MD5 834589639b2becd14d7f3b44b035714d
BLAKE2b-256 5924e14d762a88f106d2aa0472e637e8fb0b346086a0b5a36d0f997668478569

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.29.22-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.29.22-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.29.22-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 db1c66d9cccf228edaa8979deb2f49fa6b3f42af94cd6292a89e0944ed7bdead
MD5 e09a3a7ea795f51cc5a39957510fba4f
BLAKE2b-256 31bf911242d07a860c47843db07a5d9facfb7218f7d07c5c2a0c50f0aea9dde5

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.29.22-cp312-cp312-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.29.22-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.29.22-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 c867931628583064a0e3bd58f13381c18bc06d3dadc6516df4f6dd5df910ef5a
MD5 2af51bc26cad8b917397529ba75190e6
BLAKE2b-256 5a00db80dfefc1e4db200a327f4cbb37719b7fa2cbc3e9808fd015fa2f20185d

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.29.22-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.29.22-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.29.22-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 388f5ea86b7c0d7b01a0ce93157d1b818d9024f8ce64f9a4f8be8979c3ba61fb
MD5 ce01482fbd3326b4986ef99c2974d82d
BLAKE2b-256 f5a06efda11fd3e5c480eb800ef322ad2dbe9e1ed7dd66ea370e5b8fb55aeac0

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.29.22-cp311-cp311-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.29.22-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.29.22-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 74a01c292da0e43618d2d704ef70793afaedf86a727da3ece2e0c4caa3ecb089
MD5 a024c4aad640c21061dfdb495516e6f3
BLAKE2b-256 bbe73283dd87835ffa42257590b44c4b33216211a3747df84382c34f3e2a3448

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.29.22-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.29.22-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.29.22-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 09732448cbf36f3caa06934bedd719683d6ddd9239b092693eefd032a8d96543
MD5 d800ed72dff72ce151aaec14f637ede0
BLAKE2b-256 6551769b4ea60fe02258996e83fdcbd33edc8bd648cc7b076d2c0c2daf2457f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.29.22-cp310-cp310-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.29.22-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.29.22-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 7cd8099fad5600e740b1f63eef86c9ed149dbee6ef5e8cb1e1396de302a1f900
MD5 3e5bf360a0c9be889fb2f087eb5d0913
BLAKE2b-256 cd75759a603615c432a67e3f510fb84fad2e6fdfe6120724ea59d32202c8b0e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.29.22-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page