Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.30.20-cp313-cp313-manylinux_2_35_x86_64.whl (144.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

chalkdf-3.30.20-cp313-cp313-macosx_15_0_arm64.whl (118.5 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.30.20-cp312-cp312-manylinux_2_35_x86_64.whl (144.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

chalkdf-3.30.20-cp312-cp312-macosx_15_0_arm64.whl (118.5 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.30.20-cp311-cp311-manylinux_2_35_x86_64.whl (144.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

chalkdf-3.30.20-cp311-cp311-macosx_15_0_arm64.whl (118.6 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.30.20-cp310-cp310-manylinux_2_35_x86_64.whl (144.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

chalkdf-3.30.20-cp310-cp310-macosx_15_0_arm64.whl (118.5 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.30.20-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.20-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 5323a6d4532a8b9eca63acf755a7bc1344fe75e0e069141f7e4e8c59f94f9d23
MD5 b17d154761687974c67411b8bc4b93dd
BLAKE2b-256 b0e8c76f6f4de456cf010a70d36b150cca055f970572f57dfa0b3a9dbaae83d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.20-cp313-cp313-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.20-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.20-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 cd99f32c6229d720fc01c1958d6107a649497704efe78cf99c25732ad0039727
MD5 937b0e21e68b8c606da1dd0e4c014339
BLAKE2b-256 fe84401aaa3b8817ea246a37047a6eebe6af22a51c7bb3438c9a1d9a3fccc2f7

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.20-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.20-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.20-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 f9d920bf7d2ba755a6dc0854610a08fc39b00b2df63c2e6f07d43a150883ae09
MD5 b20ff36b676c07e9238c3ef4ec00b910
BLAKE2b-256 9805892811fe96bf58a8fbf504a54cfd5813bd7c1dbfdd464381ef7bfef15455

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.20-cp312-cp312-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.20-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.20-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 1a5bd1dab0c3f4f3c6d2100ba535804a30e6a7b1157b3a0a5de3f640b325cb94
MD5 19d1e6214a41cba3acce251708246c14
BLAKE2b-256 7564993826f62e6e314b659e4eec121d608af0f8e1826bd16ac2eba7845742ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.20-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.20-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.20-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 67f9cedee83b905b68526c53ca74442915792a371396080906839fab2149a895
MD5 903ac71f43ddcff141d21cc559385774
BLAKE2b-256 fd0236384ac276f5b08a9afa31db469d06365a6084a41986c578d961240254f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.20-cp311-cp311-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.20-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.20-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 13eec32a4cebfafb2edc081ab2e01c785bcd80ad9f306383f7a23b0457305e1e
MD5 2f567798ca6c5fb763133312a40f3eac
BLAKE2b-256 a891d725a7b2e3cb4cd222304c66f0dd38689044049d642362bb860ba8beeafc

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.20-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.20-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.20-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 079ef79a3fe670ed226fdda61272f1bac852cf689531ceb5d23be8a5b8b625bc
MD5 6d0488f86a05d3fb535299dee2ee3f01
BLAKE2b-256 3ca4f301f4b238405b9dc939dcb94083e66ac0070a005d7ab8d4b5dda6c0ad22

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.20-cp310-cp310-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.30.20-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.30.20-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 d16d7df637dfac4bb4ba353bee67271db520470e8af29782d5f42c6cf7a12a2e
MD5 f24dc983223d0054efb66e85b88784a9
BLAKE2b-256 95e64e00f80f544624ee60a7c143a73f883c0706f3059599f58c9e91699bb279

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.30.20-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page