Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.87-cp313-cp313-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.31.87-cp313-cp313-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.87-cp312-cp312-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.31.87-cp312-cp312-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.87-cp311-cp311-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.31.87-cp311-cp311-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.87-cp310-cp310-manylinux_2_36_x86_64.whl (162.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.31.87-cp310-cp310-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.87-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.87-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 df84ab48451a8076198a6ed203bd391590aefc8f5922a707cbed896a6b583322
MD5 eb97fb0c62ac33a37061032fbc47919f
BLAKE2b-256 196113f41045a223de15e6cb5f84d20020c470dee8bb449f8e66ea21bfadd8c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.87-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.87-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.87-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 a0062d6cf4c1c45383ac50ef194a7f1047129271bb0ff22c715db2b7b7a77e82
MD5 796a2643f737c34484b3b06483340f9a
BLAKE2b-256 0ce05ac64c582e93460cd2e45880cd9ec85f88ab97f47fe0fbda19145a817b06

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.87-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.87-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.87-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 f98db5fbabb7092cd7b9d293fd12e568d9eed7dfadcafa6fd07752325b811583
MD5 4a77e5bb916c5347fea2fc9e4da6ef47
BLAKE2b-256 5a2665959a5753dfe9fe854495ac6a1c811e72a7ca590e4f689babe16fb5085b

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.87-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.87-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.87-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 36d04ea8242bc6a92979f32e5736a3b716394a47bc363ab67b06ab575a452a8e
MD5 594660b04ba809d62e50acade9ad353d
BLAKE2b-256 62ebaf203f9ab54c007c2abb525b566122cef51d7e82cb922e45567fede81a99

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.87-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.87-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.87-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 ac878ef9414b7fbe40fff91164ac9116ea187aa529e443286676e73abd168675
MD5 10aa623c5c79dd1280b9add128a73987
BLAKE2b-256 86b995a0e1dace586cee2d9622e4aae20e6ad7ed091e3494dc75325a3b7e57ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.87-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.87-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.87-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 889cb7ce2cc7e0f5ba757756109403584ff6c47d4a6f50ab193f471b3fdceb28
MD5 9cf10c89e48e50289e74b6e534f5ffc5
BLAKE2b-256 09509a60d6266e549308962fb01a90c7c9ddc3beaa5a629184b373750258ca99

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.87-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.87-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.87-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 a2a17e4f9002dfb2d273670da122f8e9deb315727cbdc2f17efaa455e49900a3
MD5 7179ed49d2e5d3bc53c8af56d5315e86
BLAKE2b-256 0619b2ed42db754b9c3a2f9a8b8ca3283354994ca59d9d56b028223828f50c4a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.87-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.87-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.87-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 81b4804079394ed2c89e7aabdec012a77e50bb1aadca458064d2b39d35606048
MD5 acf00f1e99f27657cf6dd501845c269c
BLAKE2b-256 3741e27b30a48c4d74329dc836b99f80980309bf2c0ef4bb313c4c72329d9e9e

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.87-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page