Value-level data diff for the lakehouse — compare Parquet, Iceberg, and Delta down to the cell, no Spark or warehouse

These details have not been verified by PyPI

Project links

Project description

lake-sift

Value-level data diff for the lakehouse era.

lake-sift compares two datasets down to the individual cell — on a single node, with no Spark, no warehouse, and no framework lock-in. It diffs Parquet files Iceberg snapshots, and Delta versions today, mixing them freely through pluggable source adapters (see the roadmap).

$ lake-sift a.parquet b.parquet --key id
+1 added  -1 removed  ~1 changed row (1 cell)
- id=1, v='a'
+ id=4, v='d'
~ [id=3] v: 'c' → 'C'

It is a library with a thin CLI on top, so the same diff powers both an interactive review and a CI gate (via exit codes).

Why lake-sift

Most data-diff tools are bound to something heavier — a warehouse, a transform framework, a JVM cluster, or a catalog. lake-sift deliberately stays small and unbound:

Existing tool	Bound to
Datafold / Recce	a warehouse + dbt workflow
SQLMesh `table_diff`	the SQLMesh framework
lakeFS `refs_data_diff`	lakeFS + Spark + a JAR
Iceberg changelog / Delta CDF	Spark/JVM, change tracking enabled up front
reladiff	DB connections (not files/snapshots)

lake-sift's niche: engine-neutral · single-node · framework-free · format-native · review-oriented output.

Features

Schema diff — added / removed columns, type changes.
Row diff — keys present only on one side (added / removed).
Cell diff — for shared keys, per-column old → new changes.
Single & composite keys, with duplicate-key detection.
NULL == NULL treated as equal (unlike default SQL semantics).
Column scoping — --columns (only these) / --exclude (skip these, e.g. updated_at).
Three output modes — human-readable color, machine-readable JSON, summary-only.
CI-friendly exit codes — 0 equal, 1 differences, 2 error.
Single-node engine — heavy comparison runs as DuckDB SQL; Python is a thin orchestrator.

Installation

pip install lake-sift             # once published to PyPI
pip install "lake-sift[iceberg]"  # with the Iceberg source (PyIceberg)
pip install "lake-sift[delta]"    # with the Delta source (delta-rs)

Until the first PyPI release, install from source:

git clone https://github.com/JeonDaehong/lake-sift.git
cd lake-sift
pip install -e ".[dev]"

Requires Python 3.10+. The Iceberg and Delta sources are optional extras — Parquet diffing needs no extra dependencies.

Usage

Command line

# Compare two files by key
lake-sift a.parquet b.parquet --key id

# Composite key, exclude a volatile column, machine-readable output
lake-sift a.parquet b.parquet -k order_id,line_no -x updated_at --json

# As a CI gate: non-zero exit blocks the change when data differs
lake-sift prod.parquet pr.parquet -k id || echo "data change detected!"

Flags: --key/-k, --exclude/-x, --columns/-c, --json, --summary, --allow-duplicates, --tolerance/-t, --ignore-case/-i, --sample/-n, --top.

Column projection (pushdown). When you narrow the comparison with --columns or --exclude, lake-sift reads only the key plus the compared columns from each source — pushed down to the scan, so Iceberg/Delta/Parquet never materialize columns you don't compare. A consequence: added/removed rows then show only those columns. Schema changes are still detected across the full schema (read from metadata), so a dropped or retyped column is reported even when it isn't compared. Without these flags, the full rows are read and shown as before.

Iceberg snapshots

Either operand may be an Iceberg table instead of a file, using the form iceberg:<catalog>/<namespace>.<table>[@<snapshot_id>]. Catalog connection details are read from PyIceberg's standard config (~/.pyiceberg.yaml or PYICEBERG_* environment variables) — lake-sift only references a catalog by name.

# Diff two snapshots of the same Iceberg table (audit a change)
lake-sift "iceberg:prod/sales.orders@1001" "iceberg:prod/sales.orders@1042" -k order_id

# Mix sources freely: validate a Parquet export against the live table
lake-sift export.parquet "iceberg:prod/sales.orders" -k order_id

Requires the iceberg extra (pip install "lake-sift[iceberg]"). For finer control (row filters, field projection, an already-loaded table) use IcebergSource from the Python API.

Delta tables

Either operand may be a Delta Lake table, using the form delta:<path-or-uri>[@<version>]. The path is a local directory or any URI delta-rs understands (s3://, abfs://, …); @<version> pins a table version for time travel.

# Diff two versions of the same Delta table (audit a change)
lake-sift "delta:/data/sales@11" "delta:/data/sales@12" -k order_id

# Mix sources freely: validate a Parquet export against a cloud Delta table
lake-sift export.parquet "delta:s3://lake/sales" -k order_id

Requires the delta extra (pip install "lake-sift[delta]"). For finer control (column projection, predicate filters, storage credentials, an already-loaded table) use DeltaSource from the Python API.

Python API

The CLI is a thin wrapper over the library — both share the same core.

from lakesift import diff, ParquetSource

result = diff(
    left=ParquetSource("a.parquet"),
    right=ParquetSource("b.parquet"),
    key=["id"],
    exclude=["updated_at"],
)

result.is_empty()      # True when there is no difference (the common CI check)
result.summary()       # {"added": 1, "removed": 1, "changed": 1, ...}
result.schema_changes  # [SchemaChange(...), ...]
result.added           # rows only on the right
result.removed         # rows only on the left
result.changed_cells   # [CellChange(key=..., column=..., old=..., new=...), ...]
result.to_json()

IcebergSource reads a snapshot through PyIceberg and accepts a loaded table directly, or loads one from a catalog — with optional snapshot pinning, row filter, and field projection pushed down to the scan:

from lakesift import diff, IcebergSource

left = IcebergSource.from_catalog("prod", "sales.orders", snapshot_id=1001)
right = IcebergSource.from_catalog(
    "prod", "sales.orders", snapshot_id=1042,
    row_filter="region = 'EU'",          # narrow the scan before diffing
)

with diff(left, right, key=["order_id"]) as result:
    print(result.summary())

DeltaSource reads a table through delta-rs and accepts a path/URI or an already-loaded DeltaTable, with optional version time travel, column projection, predicate filters, and storage credentials:

from lakesift import diff, DeltaSource

left = DeltaSource("/data/sales", version=11)
right = DeltaSource(
    "/data/sales", version=12,
    columns=["order_id", "amount", "status"],  # project before diffing
)

with diff(left, right, key=["order_id"]) as result:
    print(result.summary())

Exit codes

Code	Meaning
`0`	Identical — no differences
`1`	Differences found
`2`	Error — comparison not possible (missing key, unreadable input, duplicate keys, …)

Roadmap

Version	Scope
v0.1	Parquet MVP — schema/row/cell diff, CLI, exit codes
v0.2	numeric tolerance, ignore-case, `--sample`, top-k changed columns
v0.3	Iceberg snapshot source (PyIceberg) — same core, new adapter
v0.4	Delta version source (delta-rs) — same core, new adapter (current)
v0.5	HTML report, GitHub Action
v1.0	stable API + documentation site

Non-goals

lake-sift does one thing — diff. It is intentionally not a catalog or version-control system (lakeFS, Nessie), a table-maintenance/optimization tool, a transformation framework (dbt, SQLMesh), or a monitoring/observability platform.

Project layout

lake-sift/
├── src/lakesift/
│   ├── core.py          # diff engine (DuckDB SQL generation/execution)
│   ├── result.py        # DiffResult, CellChange, SchemaChange
│   ├── sources/         # input adapters (parquet, iceberg, delta)
│   ├── render/          # human (color) and json renderers
│   └── cli.py           # typer CLI
└── tests/

Contributing

Issues and pull requests are welcome. To set up a development environment:

pip install -e ".[dev]"
pytest

Please keep changes within the documented v0 scope unless a roadmap item is being implemented.

License

MIT © JeonDaehong

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.4.0

Jun 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lake_sift-0.4.0.tar.gz (23.6 kB view details)

Uploaded Jun 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lake_sift-0.4.0-py3-none-any.whl (23.9 kB view details)

Uploaded Jun 21, 2026 Python 3

File details

Details for the file lake_sift-0.4.0.tar.gz.

File metadata

Download URL: lake_sift-0.4.0.tar.gz
Upload date: Jun 21, 2026
Size: 23.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for lake_sift-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`37310382ed79aa94b9dd0946cc8be614578ce85232d9b5c022be778ce7cb0ff9`
MD5	`02988d8f9293114160ebfc20a69288ed`
BLAKE2b-256	`a713fe474be3b354a652d1ed842a570a4619237abaf506c4de4dbbe307857e2d`

See more details on using hashes here.

File details

Details for the file lake_sift-0.4.0-py3-none-any.whl.

File metadata

Download URL: lake_sift-0.4.0-py3-none-any.whl
Upload date: Jun 21, 2026
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for lake_sift-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d49e4ccd3a5fca5a88699a7ffd5197cc409e26b19522b83ded08060cc2526b1`
MD5	`ca62e0fd83b3cbc494505336fea1a3d8`
BLAKE2b-256	`60e2dc89bc4af67e144e95b7cc80963984fb96c2829f2ea13ef6d18b0dae2da4`

See more details on using hashes here.

lake-sift 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

lake-sift

Why lake-sift

Features

Installation

Usage

Command line

Iceberg snapshots

Delta tables

Python API

Exit codes

Roadmap

Non-goals

Project layout

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes