Skip to main content

DataFrame utilities for Chalk AI, backed by libchalk

Project description

chalkdf

DataFrame utilities for building fast, portable data pipelines on top of Apache Arrow — powered by Chalk’s libchalk execution engine.

The API centers on two concepts:

  • chalkdf.DataFrame: a lightweight, eager plan that you can materialize to Arrow.
  • chalkdf.LazyFrame: a serializable, chainable description of a plan that can be round-tripped to/from a protobuf and converted to a DataFrame for execution.

You build expressions with Chalk’s Python underscore DSL (from chalk import _) and function registry (import chalk.functions as F). We also ship pragmatic testing helpers (chalkdf.Testing) to make it easy to compare results in your unit tests.

Installation

  • Requires Python 3.10–3.13
  • Works on Linux and macOS 15+

Install from PyPI:

pip install chalkdf chalkpy

Quickstart

Below are minimal, self‑contained examples that mirror the public API used in the tests.

Create from Arrow and select columns

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"a": [1, 2, 3], "b": [10, 20, 30], "c": [100, 200, 300]})
df = DataFrame.from_arrow(tbl)

out = df.select("c", "a")
print(out.run())
# ┌─────┬─────┐
# │  c  │  a  │
# ├─────┼─────┤
# │ 100 │  1  │
# │ 200 │  2  │
# │ 300 │  3  │
# └─────┴─────┘

Add or replace columns with expressions

import pyarrow as pa
import chalk.functions as F
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3], pa.int64()), "b": pa.array([10, 20, 30], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.with_columns(
    {
        "sum": _.a + _.b,
        "flag": F.if_then_else(_.a < 2, "small", "big"),
    }
)
print(out.run())

Filter, slice, and order

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"a": pa.array([1, 2, 3, 4, 5], pa.int64()), "b": pa.array([10, 20, 30, 40, 50], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.filter(_.a > 2).slice(1, 2).order_by(("b", "descending"))
print(out.run())

Rename and explode

import pyarrow as pa
from chalkdf import DataFrame

tbl = pa.table({"id": [1, 2], "vals": pa.array([[1, 2], []], type=pa.list_(pa.int64()))})
df = DataFrame.from_arrow(tbl)

renamed = df.rename({"vals": "values"})
exploded = renamed.explode("values")
print(exploded.run())  # (1,1), (1,2)

Group by and aggregate

import pyarrow as pa
from chalk import _
from chalkdf import DataFrame

tbl = pa.table({"g": [1, 1, 2, 2, 2], "v": pa.array([10, 20, 1, 2, 3], pa.int64())})
df = DataFrame.from_arrow(tbl)

out = df.agg(["g"], _.v.sum().alias("v_sum"))
print(out.run())  # rows may be unordered

Joins

Use a list of join keys when both sides share column names. When names differ, pass a mapping of left columns to their right-hand counterparts.

import pyarrow as pa
from chalkdf import DataFrame

left = DataFrame.from_arrow(pa.table({"key": [1, 2, 3], "x": [10, 20, 30]}))
right = DataFrame.from_arrow(pa.table({"key": [2, 3, 4], "y": [200, 300, 400]}))

joined = left.join(right, on=["key"], how="inner").select("key", "x", "y")
print(joined.run())

# Or with an explicit mapping of left->right keys
joined2 = left.join(right, on={"key": "key"}, how="inner")

# When column names differ, map each left column to its right counterpart
joined3 = left.join(right, on={"key": "lookup_key"}, how="left")

Scan Parquet files

You can construct a DataFrame that scans one or more Parquet files without loading them eagerly. Use local file:// URIs (or remote URIs when running in an environment with appropriate access):

from chalkdf import DataFrame

df = DataFrame.scan(
    ["file:///path/to/data.parquet"],
    name="my_table",
)

print(df.run())  # materializes and prints a preview

Lazy plans (experimental)

LazyFrame records a chain of operations and can round‑trip to a protobuf for transport or persistence. You can reconstruct the same lazy plan later and convert it to a DataFrame to execute.

from chalk import _
from chalkdf import LazyFrame

lf = (
    LazyFrame.from_dict({"c1": [1, 2, 3], "c2": [4, 5, 6]})
    .select("c1", "c2")
    .slice(0, length=10)
    .filter(_.c1 > 0)
)

proto = lf.to_proto()
lf2 = LazyFrame.from_proto(proto)
assert lf == lf2  # structural equality of the recorded plan

df = lf2._convert_to_df()  # convert to a DataFrame for execution
print(df.run())

Notes:

  • LazyFrame._convert_to_df() is currently a private/experimental helper.
  • For named tables, use LazyFrame.named_table("table_name", schema) as a root.

Testing helpers

Use chalkdf.Testing.assert_frame_equal to compare DataFrame results in unit tests. It materializes both frames to Arrow and supports relaxed comparisons.

import pyarrow as pa
from chalkdf import Testing, DataFrame

left = DataFrame.from_arrow(pa.table({"a": [1.000001], "b": [2.0]}))
right = DataFrame.from_arrow(pa.table({"a": [1.0], "b": [2.0]}))

Testing.assert_frame_equal(left, right, atol=1e-5, rtol=0.0)

# Ignore row or column order when needed
Testing.assert_frame_equal(left, right, check_row_order=False, check_column_order=False)

Why chalkdf?

  • Built on Apache Arrow memory formats for zero‑copy interop.
  • Runs on Chalk’s native engine for performance and portability.
  • Small, predictable API surface suited for services and batch jobs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

chalkdf-3.31.65-cp313-cp313-manylinux_2_36_x86_64.whl (164.5 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.36+ x86-64

chalkdf-3.31.65-cp313-cp313-macosx_15_0_arm64.whl (130.0 MB view details)

Uploaded CPython 3.13macOS 15.0+ ARM64

chalkdf-3.31.65-cp312-cp312-manylinux_2_36_x86_64.whl (164.5 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.36+ x86-64

chalkdf-3.31.65-cp312-cp312-macosx_15_0_arm64.whl (130.1 MB view details)

Uploaded CPython 3.12macOS 15.0+ ARM64

chalkdf-3.31.65-cp311-cp311-manylinux_2_36_x86_64.whl (164.5 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.36+ x86-64

chalkdf-3.31.65-cp311-cp311-macosx_15_0_arm64.whl (130.1 MB view details)

Uploaded CPython 3.11macOS 15.0+ ARM64

chalkdf-3.31.65-cp310-cp310-manylinux_2_36_x86_64.whl (164.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.36+ x86-64

chalkdf-3.31.65-cp310-cp310-macosx_15_0_arm64.whl (130.1 MB view details)

Uploaded CPython 3.10macOS 15.0+ ARM64

File details

Details for the file chalkdf-3.31.65-cp313-cp313-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.65-cp313-cp313-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 a8a75a3d7f05c495979d7e7ed318f0c39c5c2db578d09c29dacf5fa6d6c8ce7f
MD5 9d89e37b2884627920c4c188d4e2370f
BLAKE2b-256 feb282d31a57706d6d9a967a358cffccc9bfb3fda541e364af1852a4e7c4eb90

See more details on using hashes here.

File details

Details for the file chalkdf-3.31.65-cp313-cp313-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.65-cp313-cp313-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 547eeed204adcf740fdb09a0fca1c628faa72f2e1f58ae920f224b55879d186b
MD5 cfe267bc2afe357aa4f3920fd6dce5da
BLAKE2b-256 5f42a176fa8da6a59cc46c399fd878f6c837cf90ef067e88df63c21d18e38f7e

See more details on using hashes here.

File details

Details for the file chalkdf-3.31.65-cp312-cp312-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.65-cp312-cp312-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 200f99b9d898cdfb2b94d4f44a000a3856cd231c6974b7eab11e7fd34c6bd285
MD5 8f4da7259fd84de126dc25cea96f429e
BLAKE2b-256 dba2e3da93a1cfd065375ea6bf9e83acab46b1167fd9441234bbbda755389fa1

See more details on using hashes here.

File details

Details for the file chalkdf-3.31.65-cp312-cp312-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.65-cp312-cp312-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 c8979b7650c9628a64b0e5e3d8ff4a38bf5e79e75ffd5d5c9003681f59e0a587
MD5 ba6a06178c872553af060ba7b270770d
BLAKE2b-256 14eb14ae7a1e2e7b20db258487c642776ab0bc3e4d96096bd2ad14afc07c2fa5

See more details on using hashes here.

File details

Details for the file chalkdf-3.31.65-cp311-cp311-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.65-cp311-cp311-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 1b20cd255ba1db2fe9e0404b6172ca7a193c98269ac1a2088651d65367281d04
MD5 7f64b73c85e92348e2e7a9e8199fb8a4
BLAKE2b-256 3b7815f2bf40531384ed2353bf407737e68b78049a32cedef4422b86684d176f

See more details on using hashes here.

File details

Details for the file chalkdf-3.31.65-cp311-cp311-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.65-cp311-cp311-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 5eed7919428a84e5f8de1dcd76cfa273d88e25ecae0fc5b6315598dcbdfa99bd
MD5 1d8f58849f9d4d4f569168d3dcdd0392
BLAKE2b-256 39097d21cc58621117075cc324ebf913b83402f75de1d29eeaac47d145997d66

See more details on using hashes here.

File details

Details for the file chalkdf-3.31.65-cp310-cp310-manylinux_2_36_x86_64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.65-cp310-cp310-manylinux_2_36_x86_64.whl
Algorithm Hash digest
SHA256 14045bd362c7295e9e07d83b563e30615afa148ce3d1f2585b2b0fcd3863e255
MD5 9de564429cf652ae45491438c9a8a4d7
BLAKE2b-256 5bf13caa8611dba5cfa739d67ab879498e36da615e02f002c82b87e6a79f64b9

See more details on using hashes here.

File details

Details for the file chalkdf-3.31.65-cp310-cp310-macosx_15_0_arm64.whl.

File metadata

File hashes

Hashes for chalkdf-3.31.65-cp310-cp310-macosx_15_0_arm64.whl
Algorithm Hash digest
SHA256 6be11599e525ca07d3f120a81c77379ab4c0878934253af45749f1c0643dd586
MD5 a5c2f58ae8d94c1f120a82ebe2ac25d9
BLAKE2b-256 06bc2ffc5c6cbee0ffb9b678e1952583ba92153ef4bbbdb1f3fb4e0493b21d80

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page