Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.71-cp313-cp313-manylinux_2_36_x86_64.whl (159.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.31.71-cp313-cp313-macosx_15_0_arm64.whl (128.0 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.71-cp312-cp312-manylinux_2_36_x86_64.whl (159.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.31.71-cp312-cp312-macosx_15_0_arm64.whl (128.0 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.71-cp311-cp311-manylinux_2_36_x86_64.whl (159.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.31.71-cp311-cp311-macosx_15_0_arm64.whl (128.0 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.71-cp310-cp310-manylinux_2_36_x86_64.whl (159.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.31.71-cp310-cp310-macosx_15_0_arm64.whl (128.0 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.71-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.71-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 23721dd7ce67147c512d0c5e7d81f6bba331cd7553eb0e17977cc563c7bc3260
MD5 bbb8b5961c351efcf72a2ff06fe56f13
BLAKE2b-256 50382aca9df24810ee385c84500a726cace435e27cb93608ca206675e23fbd1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.71-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.71-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.71-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 8a0f7b32a448ba41eb87847a563c68a738302e30e15a82c8d23806cf3516117d
MD5 c7613e5d298fe5935995640ce42d4202
BLAKE2b-256 37ec790152edaff38ab826bec3faa5f9cf7ee9ad0e0cb602f2ac42a1480d6912

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.71-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.71-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.71-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 972725992209ee24e1228463e17ab88bc590cdea0423ac152131ed055b0a5c4d
MD5 1e63a3908b5c3a04ed097e8eb3690fbb
BLAKE2b-256 9ddb3ec8475e7b85d156c298ac02419d87d39bf04be9fc4b328e9ae44708627f

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.71-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.71-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.71-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 ff9cfed1c9e94d917a99c15667aaf3337d561e6afe78fcc114efa1a180263ccd
MD5 5f711f6c7b90fe86fa30df3ac8632cbe
BLAKE2b-256 46aa72237d053658fd52fbd9210db8c6cb755a6734ef713150f941aea0fb8e7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.71-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.71-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.71-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 1f88d6826e199b3fc04138059175c6592f0192a491ad7768e42f272ea97364b0
MD5 bd37b461851cced215387a307d5cb3cd
BLAKE2b-256 71c3f008c12d5c66211dd534d51e6058d2c55dce6c602cbc9499c2652b0de280

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.71-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.71-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.71-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 232bcda23ebd0795b078238823e40f9ed507e829d85f50ef328d011e35524e60
MD5 3df1775823e9d3b4b503cd006ffcfc15
BLAKE2b-256 67b99c200e39c49cf2e337bd73310b19dc9d6a739b7c1b86d117a8a82b44cece

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.71-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.71-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.71-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 a53fbddadabd6660fd5241084f37efc77b5589bae9f91dc5228dc1eee10cc59a
MD5 1d5e49ae3ee74bf2470d8df86bdb3f95
BLAKE2b-256 ab188563efad8bbdcfb9aa9353677545ac62605fd66b28817d5e454444243dc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.71-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.71-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.71-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 531a502627cbc2158a24cad5b3a0321f6695495cbb8bdaa745f5e46dfbcb310c
MD5 845dfa727efeea10dc48ee70461c037e
BLAKE2b-256 f1e0d972cdb84644a90ad9c527607e318e5828ed80a79b8afae396f90a0ee515

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.71-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page