Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.89-cp313-cp313-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.31.89-cp313-cp313-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.89-cp312-cp312-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.31.89-cp312-cp312-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.89-cp311-cp311-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.31.89-cp311-cp311-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.89-cp310-cp310-manylinux_2_36_x86_64.whl (162.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.31.89-cp310-cp310-macosx_15_0_arm64.whl (127.6 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.89-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.89-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 4b372770a9529e8a2ee7fcb45c5a2ac5ac9d07ceb6fdcbb61137075dd16d4e9f
MD5 9785d3fd376dd7e0e45a5020c154c743
BLAKE2b-256 b152f78f63fea0a737284f4a6c3aa4b5d5d886a6e8df6203a4316aad61706f91

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.89-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.89-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.89-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 8f96e48d8ff9e03f1115ed23defec0e9a6b54d2a7d4a4dea928e4aec2ea13bb4
MD5 81b3891167e7b2db43b14e5f977bd48b
BLAKE2b-256 04c4a9057c2a5619f6bf7044b75755c3e10a3ce3cf0714620e5d9dc3f0ddeaea

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.89-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.89-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.89-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 8b540a0465661ed799fc927e4ca010bbd8605fe6f231473ad35517c02bd54ff4
MD5 60d141bb2e3ca698509f49a38c5c7778
BLAKE2b-256 c8841d2e6e46d18f26424380c8a05b67ee0981e0b0c8e72d15f7496bb00a86fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.89-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.89-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.89-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 7d7082187acde5a479b1c04cb6e7ade026d3fb10e427b90db364cecc902d602e
MD5 34e1d286ea8dbb2ae826d487af25b98f
BLAKE2b-256 871df0493315c1e7feab74e191da8a09d9a66b299b4337d582d71b7b3f757a5f

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.89-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.89-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.89-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 8525514aafa799d7d1c6dc2b7311df2417ea11d81ff3c5a14804e8f2891bb771
MD5 48305dc968992b024bef863542de4354
BLAKE2b-256 f08743d701f7dc9af294862008a1e5bc7febeac3013feea09127afa88a43f15c

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.89-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.89-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.89-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 1261e6fac1c2dcf1dcf07fef6698cca2825532f7f5dc56f6680144dcdeb50c88
MD5 c4164a38eb0c5cf1f2e551099b46fd68
BLAKE2b-256 bdbc4b59b2cf3b646612b4218aa001ebaa04863da8675cfc7edb2888652915e9

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.89-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.89-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.89-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 90dd6df6a882697a535f6b05f17b5fb8c4edc4773b19c432ba8ed0bb753344bf
MD5 d274725d83ad8275b1dd67a4c7512cb8
BLAKE2b-256 695c5083c7ac2a429b8d317292e51b61a0dd469618a8ee737a45c1668a133a83

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.89-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.89-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.89-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6469eae2d4d3dce1b3c5b8cc25fb14e3bfc6ba36111d223988fed99c6e048c7f
MD5 ff2d951739e42155549ae763daaeb1fc
BLAKE2b-256 f93b1635f1d94ab6ef106738539868e8549c9ffc21b23f37e81a08fb7ac02f57

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.89-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page