Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.88-cp313-cp313-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.31.88-cp313-cp313-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.88-cp312-cp312-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.31.88-cp312-cp312-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.88-cp311-cp311-manylinux_2_36_x86_64.whl (162.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.31.88-cp311-cp311-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.88-cp310-cp310-manylinux_2_36_x86_64.whl (162.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.31.88-cp310-cp310-macosx_15_0_arm64.whl (127.7 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.88-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.88-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 609ca4b32d4e3939037dbf53655f4dcc9269b9da69c2b99a2fcb8579401ecb75
MD5 92f62a1f6b1be2c3d12489d33f6e7a65
BLAKE2b-256 d3c91999ae356a9185da1c92bd9233f16aa781c62b7351ea80f79b593ab97263

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.88-cp313-cp313-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.88-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.88-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 8f7d4a74a930ae21bb62f61b6ff53211e2039ba0069aec08642f43b0a230e45a
MD5 24c31ffe9946dda9d2ea79efca410a4f
BLAKE2b-256 19e7a144528aefc51437f8ba3cea83d3be027d6d9e37966e7fff15cfd0b39695

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.88-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.88-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.88-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 491c455aa4e345a7b24ff988c2d16d1464c1df633964791f8cff4fe457c4a6f4
MD5 c090846adb93b9d46ea17e7d4bd90843
BLAKE2b-256 2eb579dd683b5c8acdb131b5da7355439aabc6f4890340db09e4b9c59751be57

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.88-cp312-cp312-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.88-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.88-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 e725ee450dbdb0ccf2277801734d3da7895b6451b41e465490e519dfea7bcfc9
MD5 4baec728c985196c417a79449eabffe7
BLAKE2b-256 b9342d38ed4f0fdc129798fb06ef576d0f8937e5581c931251125283f61769bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.88-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.88-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.88-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 3fc578debaf07945a7de484e432ba13e4e516d57610521c9c35c6c4d41fc33b8
MD5 cdb93dad462a2fa053c3e8a76b4a62ed
BLAKE2b-256 562e843e11e8d183068fec13b7ea39880e3b650e69495a687eb3f3ec3896132c

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.88-cp311-cp311-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.88-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.88-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 bfd6dff07929df4b0f0ad5d9ab28dbf13060458059dcc43dfa916328b9ed59af
MD5 bb4b8a0338de92f223c581198475e6b4
BLAKE2b-256 d1b5851b1002b1a24c6fe18c2f0ab89b10c1fd1a1170e0689df210eb9b80c52e

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.88-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.88-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.88-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 a01315247599f65292b165cdb7af41122b89614e644ef944a08002f3314fa20c
MD5 e92f45b2660c687e9c494f8bd17a4c8c
BLAKE2b-256 dc22867f06f43fd6d5f0a94a4bbaa121a0afb4e9dd8d287a1b14f14af863a4ef

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.88-cp310-cp310-manylinux_2_36_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.88-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.88-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 bfd23f29d771d1d1c67544ce65db472520e5eab95abcc0b0d77801f3a962e35c
MD5 2ec4b5cb8bae2774e9c9c16a68efc2f4
BLAKE2b-256 da8053859935fe25453fa1b2e69dedcac7277375dd98ad388347d55775ee8e82

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.88-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page