Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.30.24-cp313-cp313-manylinux_2_35_x86_64.whl (145.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

chalkdf-3.30.24-cp313-cp313-macosx_15_0_arm64.whl (119.0 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.30.24-cp312-cp312-manylinux_2_35_x86_64.whl (145.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

chalkdf-3.30.24-cp312-cp312-macosx_15_0_arm64.whl (119.0 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.30.24-cp311-cp311-manylinux_2_35_x86_64.whl (145.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

chalkdf-3.30.24-cp311-cp311-macosx_15_0_arm64.whl (119.0 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.30.24-cp310-cp310-manylinux_2_35_x86_64.whl (145.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

chalkdf-3.30.24-cp310-cp310-macosx_15_0_arm64.whl (119.0 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.30.24-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.24-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 f032b1eff70ded862c746d54d21c65f7b9687e64246878db402438d98651eb3c
MD5 4e3360161d0d4c4ae41e2415d3244b59
BLAKE2b-256 ef0c38ffe8685c82fab311a3b8efe01953a916c570a2567b434d5af217fe3d0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.24-cp313-cp313-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.24-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.24-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 e6c50c74ebe77142638f78613966de56445d82d64b373de7a9e403faac785388
MD5 6305017b578b014f2b9f0f41860f6723
BLAKE2b-256 f6f8359d58bf7289af831bac7eb11717bad4a334ef84a3e9f4e48ba92f70a4ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.24-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.24-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.24-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 a2f4cfa3878cecce9bef73ebacb878106b78f2fc4f22dd3b6e62faae00ec1be3
MD5 9582fde403f49ada18a3fe6d01285298
BLAKE2b-256 fb53988a3266eae053df281b45b4e55b8bed5d5e6f571b874ef7d483a49a2539

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.24-cp312-cp312-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.24-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.24-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d8c6b8d9e6630adde4774f11a06dcef4240d4a66f13f3d79dfcaad4f1416f062
MD5 9bc34980db053e9633de4c25fe981e2e
BLAKE2b-256 830816c1a10b4c7e8958cd3b76b09031ebc0e1d869b457e65ab81e1a0834dfac

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.24-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.24-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.24-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 d2fa84177fe8c999e46bf16603c0cfec3b4095c5a34b20a545ac8cc8223f2b16
MD5 875f527cff36084ac2f3fc6107fbbf1a
BLAKE2b-256 6e616192813acd9a9aadb4094acf5e3119dd1c94a89b5fd44cba3613dd340c64

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.24-cp311-cp311-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.24-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.24-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 ab205bba719001fc176376e56e3cbe49d5c98bfb851be681ecbc90bac2f97790
MD5 f2529120da03cf3c142db14001eb16f9
BLAKE2b-256 8f1544586bbfdadeb526921ca50a450f41790c159b08f34313002bcacb0d9eeb

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.24-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.24-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.24-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 425f479ff8c9d28ebf84dcccbf532c5986e74c08321b3059f8876ec52a71d141
MD5 1729ba601680f12961a4a84613966ac0
BLAKE2b-256 336a760382f64def47a92192344ac80ed7aa1406516e8ef55c87cb5e7cd67a55

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.24-cp310-cp310-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.24-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.24-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 5fe4df885fb0f64ce1d7ab3d6edaf163fa70a4c22ab9ce5189d507fa6259bdd3
MD5 376080543f0eb0dd9464006c3c7962a6
BLAKE2b-256 fe2fdd057fefd3421b9dfd0fe30865e517d376fad3a73f4ba62d1923c43639ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.24-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page