Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.9-cp313-cp313-manylinux_2_35_x86_64.whl (155.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

chalkdf-3.31.9-cp313-cp313-macosx_15_0_arm64.whl (121.3 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.9-cp312-cp312-manylinux_2_35_x86_64.whl (155.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

chalkdf-3.31.9-cp312-cp312-macosx_15_0_arm64.whl (121.3 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.9-cp311-cp311-manylinux_2_35_x86_64.whl (155.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

chalkdf-3.31.9-cp311-cp311-macosx_15_0_arm64.whl (121.3 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.9-cp310-cp310-manylinux_2_35_x86_64.whl (155.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

chalkdf-3.31.9-cp310-cp310-macosx_15_0_arm64.whl (121.3 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.9-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.9-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 6b811e299f901225141a9782e1ba73fd2866fb4fc61ef36e50cc1de07750a25b
MD5 8a543af5bee9daa9f5d21000f2999e6e
BLAKE2b-256 ff81feb5072104b5b227ab6688d671618a6b6bcb31bc89a9c53f2aa47c6c3e6c

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.9-cp313-cp313-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.9-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.9-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 7852610a67c0f0581c7bcdc3b0286e52b47d7563985b90cb266a263f20016f35
MD5 b504132873960749d1cc02f95674a690
BLAKE2b-256 c02db58456439c1ec4b2df7bfdeb38227a6831876161e66a6dd1e75d834fb35a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.9-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.9-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.9-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 58e0c369cf4a369f17f0089cf072416706cca5d2546249934e0c2a9a3c7b12e1
MD5 7d3b1c2ada4e699f8f8997e9ee3fcb77
BLAKE2b-256 677122b94938ba1cd378969f893a5be545bea5e6e0271474910c5b51eaed45e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.9-cp312-cp312-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.9-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.9-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 3601da84d002186bbe82522cd22aebd88155e7b504271fe1f7f40ad9f32d6c5a
MD5 c4917ede61a16f109c41fe66171c5985
BLAKE2b-256 0cefc112e878a2ed9e3534e99f005dac5d7bb627850104c54ee621686d057d7e

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.9-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.9-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.9-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 6f449735e0fef5d291aa428bb26a363e73bfa515e751ab852529430d73a3c2a8
MD5 2a24266b5dadce9e825c907b93b1be69
BLAKE2b-256 a6ecb3ec9a2d893db0831d483925ba89d1a8bd8314cf68c19260e0d15e56aae9

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.9-cp311-cp311-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.9-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.9-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 2ba9a36fda3f9c35ad574e8571b730ada76398fdaae1b591fbd0b21857eabcca
MD5 adec74d5d95aca90e03aa0f76f313817
BLAKE2b-256 3f34b5042e9deebcb7ea09e85074d27f71fd1d7f6e1fbcb4c8e5eff57af55749

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.9-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.9-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.9-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 c9d28a0ebbf02b999ced6b8f9850277a455d8156eea76e5fcbd83a493bb69d78
MD5 b37acf5d4cec87e10888448086881643
BLAKE2b-256 97c89bd00adb203be6b29c10c52dfce6810485501a68c9c9eb9d8bd251da2f31

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.9-cp310-cp310-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.9-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.9-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 82f5f758795094b58ffd4686e72ed22ccb221a89c2f70d1d88bbe74672373fdc
MD5 eb410697947ac4a40ee7c47260fec378
BLAKE2b-256 aeac6c0c4287c0688581bc8005a3dfe30299282679a8556cf60286089027e77b

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.9-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page