Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.72-cp313-cp313-manylinux_2_36_x86_64.whl (160.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.31.72-cp313-cp313-macosx_15_0_arm64.whl (128.3 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.72-cp312-cp312-manylinux_2_36_x86_64.whl (160.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.31.72-cp312-cp312-macosx_15_0_arm64.whl (128.3 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.72-cp311-cp311-manylinux_2_36_x86_64.whl (160.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.31.72-cp311-cp311-macosx_15_0_arm64.whl (128.3 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.72-cp310-cp310-manylinux_2_36_x86_64.whl (160.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.31.72-cp310-cp310-macosx_15_0_arm64.whl (128.3 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.72-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.72-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 79c9b24eb9799297246c223f8e454716da71c7d22bfe793d6f29db1c14d6c30f
MD5 8ca8c9971bf5f1a60c25ca3feed1f6c1
BLAKE2b-256 9924808b1a30dab5b2c688d60ab15a17590857b87cffa54886051ec483abf572

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.72-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.72-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.72-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 02050f20514647a31593e68188582edd5ad2a0f7ea71e8e97dd3dae05a1076a0
MD5 b1a99e0513391ab11006a5173787fb1c
BLAKE2b-256 caf081dac9cbb43ebced99543b1f9042501c89bc24650d36a322c7c8bd8786d3

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.72-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.72-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.72-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 2a32c27db0b892f363cf1b86e42823ec0f398cf6abb9019353f4b74411c0c229
MD5 ae2965291279469133eabc35e29419d5
BLAKE2b-256 c4c788cc37858eb942e13354262e8359a1e97a9de3a54edafa7edb99275920b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.72-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.72-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.72-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 72395229c90449c1a0121f5d8da41cb289995c822a8ebb0887f26a0720f79087
MD5 981b0f7104c5d3abf85b730a1c59bd9d
BLAKE2b-256 d1a1a1f9c5ed91f98850cb2fd9f684e8c45a2b9b4d8a96d41fce0483a3a47450

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.72-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.72-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.72-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 88a3ed6d146d6bc56fdca00c020948eb1c3fd3eb227afb3e28a5a7e9cff8aa8f
MD5 b5127f9be055cd70bb5faea7d59153c0
BLAKE2b-256 3a04135ce5bf50830b11725c96f809ecb94a10d015d584a05c2b071341045fbb

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.72-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.72-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.72-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 558baa65fe92dcdec5ddb07e13d13fed6aa2b0bb8fb0b217f953685800d372d9
MD5 77fd899c23febc218e1f86e18a48655d
BLAKE2b-256 5fe2fa4a17ba37cec99c046c6f56c902d97517265537d44ce13fa7e0b5fd7e02

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.72-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.72-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.72-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 45aa37b9f4428ac5a43af6ddaae4e0df40d8a08506e9ea7e4a55b171acb6316b
MD5 de71efe6d509cbea3048cc85539bd843
BLAKE2b-256 8f22a3eb627552420bd8efa52304266d61d174bdb40163e1299542d03e4f1c63

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.72-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.72-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.72-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 fac48bda5fae7da427a070377d846625edf8b7c2bd10d5c654f3aceecd613241
MD5 971285b58fcf05f30fe2215199b701ee
BLAKE2b-256 279cd5b1d941ddcf1437b6b8cdb965fbc5669962fff14fe58918cd2f8c924a9f

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.72-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page