Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.20-cp313-cp313-manylinux_2_35_x86_64.whl (155.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.35+ x86-64

chalkdf-3.31.20-cp313-cp313-macosx_15_0_arm64.whl (124.1 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.20-cp312-cp312-manylinux_2_35_x86_64.whl (155.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.35+ x86-64

chalkdf-3.31.20-cp312-cp312-macosx_15_0_arm64.whl (124.1 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.20-cp311-cp311-manylinux_2_35_x86_64.whl (155.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.35+ x86-64

chalkdf-3.31.20-cp311-cp311-macosx_15_0_arm64.whl (124.1 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.20-cp310-cp310-manylinux_2_35_x86_64.whl (155.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.35+ x86-64

chalkdf-3.31.20-cp310-cp310-macosx_15_0_arm64.whl (124.1 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.20-cp313-cp313-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.20-cp313-cp313-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 aa2a3bd4879f420af2c04b9d9fd3baf27043ea816bfd020ea752d908591a08e0
MD5 ed5e4d01e880e86f8f6de81ade56dcf4
BLAKE2b-256 eb9406ab70fee0e75fc131b96851126dc6fe8c86fd815dacf6cdc01f2e573a25

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.20-cp313-cp313-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.20-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.20-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 b0b2dd8953d2f57f6cf545fe651d3ec33e5b49c2512f4d9c3f189986a5969c34
MD5 e70587712ece2687ce0ee2e68fa8b925
BLAKE2b-256 89d39fa92e483d22344f6f22219da4e25ba3494fd33d17ea0f25606224559ffb

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.20-cp313-cp313-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.20-cp312-cp312-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.20-cp312-cp312-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 210f1c60bd8cc305be7a71785f0830adc9695f824be4824246920d73d512ce5e
MD5 9bc714df4eada44f664eae5f56a0a854
BLAKE2b-256 808b7cfa001d7ba238c537a6c170021e7717ea5c0462eb3b65bdf90cddeb14f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.20-cp312-cp312-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.20-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.20-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 0888f8cdc5a112ac6ed5ffb26e2ad1ef397b2c614e5f4e92ca30821959a475ac
MD5 b861e4aa4579faa63611117c101eacc8
BLAKE2b-256 051df990901352235701aad8fe86383d1c0ec94889b16ec491c001719815100a

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.20-cp312-cp312-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.20-cp311-cp311-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.20-cp311-cp311-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 dab07a7eaec2a0802eddea2c5029c14df4dd2d1f6ab1655bb88eb698c196413a
MD5 12c491ca5c4d9887bb381c301f79832e
BLAKE2b-256 359b0a935869e8613840f2ef57746caec85797bf111960b6bddb808f166ca787

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.20-cp311-cp311-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.20-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.20-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6d55ddb38e52806b5a5d9186c00ed19b61e4089e1a966ce449d4f4c73c40dd72
MD5 0e6654d4d562d02d6de98e59e7235044
BLAKE2b-256 b36af69bb2e10fbc7d48d5279d0c733b80276b228a4030bc0a9ef879d041eea9

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.20-cp311-cp311-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.20-cp310-cp310-manylinux_2_35_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.20-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm Hash digest
SHA256 7006abe4d33f98899ca58111f37c3a9509ebb29c3c3b2ece5c8fd3de6e596046
MD5 7103122c9ac3ae3f66bd69150fdf9a7e
BLAKE2b-256 a18cea9df28f3505187dd6f4691f6008899563a04afc8f6df758dde4c0e429cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.20-cp310-cp310-manylinux_2_35_x86_64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chalkdf-3.31.20-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.20-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 19af2182624563a75e46006d20dd17d77a6886f58cda0f7931afcc4733e74e6e
MD5 ec8fb09cd1b974f6904702212f267a27
BLAKE2b-256 13e797d93cb397b9e5588d79f9040fc0784b830f870097220ecd7ab1874b5778

See more details on using hashes here.

Provenance

The following attestation bundles were made for chalkdf-3.31.20-cp310-cp310-macosx_15_0_arm64.whl:

Publisher: build-dataframe.yml on chalk-ai/chalk-private

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page