Native DuckLake reader for Polars (no DuckDB runtime dependency)
Project description
polars-ducklake
A native DuckLake reader for Polars — with no DuckDB runtime dependency.
DuckLake stores all metadata in a SQL catalog database and all data in Parquet
files. polars-ducklake exploits that design directly: scan_ducklake is
registered as a Polars IO plugin (polars.io.plugins.register_io_source) and
returns a single IO_SOURCE LazyFrame node. Polars hands the deserialized
predicate, projected columns, and any n_rows cap back to the reader, which
prunes whole files using the catalog's min/max stats and then forwards what
remains into per-file pl.scan_parquet calls — so projection pushdown,
predicate pushdown, and streaming all still reach the Parquet engine.
Requires polars >= 1.20.
Install
pip install polars-ducklake
Optional extras for non-SQLite catalogs:
pip install "polars-ducklake[postgres]" # Postgres catalog
pip install "polars-ducklake[mysql]" # MySQL catalog (read-only; see below)
pip install "polars-ducklake[duckdb-catalog]" # DuckDB-as-catalog
Catalog backends
Each backend is verified by an end-to-end integration test: the canonical DuckDB ducklake-extension writer creates a real lake against the backend, then this package reads it back and checks the results.
| Backend | Status | Notes |
|---|---|---|
| SQLite | Verified | No extra dependency; stdlib only. |
| PostgreSQL | Verified | Requires [postgres]; tested against PG 17. |
| DuckDB-as-catalog | Verified | Requires [duckdb-catalog] (duckdb-engine). |
| MySQL | Reader verified | [mysql] installs pymysql. Every read behavior (round-trip, time travel, deletes, schema evolution, partial files) is verified against MySQL 8.0 in the cross-backend test matrix. End-to-end writer round-trip via the DuckDB ducklake+mysql extension is currently unstable upstream — until that's resolved, populate MySQL-backed catalogs with another writer. |
Quickstart
import polars as pl
import polars_ducklake as pdl
# `scan_ducklake(metadata_catalog, table, ...)` — first arg identifies
# the metadata catalog (the SQL database hosting DuckLake's bookkeeping
# tables), second arg names the table.
lf = pdl.scan_ducklake(
"sqlite:///metadata.db",
"sales",
)
print(lf.filter(pl.col("region") == "us").select("amount").collect())
Tables can be addressed by an unqualified name (the default schema
main is assumed) or a fully-qualified schema.table form:
lf = pdl.scan_ducklake("sqlite:///metadata.db", "analytics.events")
# equivalent to:
lf = pdl.scan_ducklake("sqlite:///metadata.db", "events", schema="analytics")
The DuckLake-native connection string form is also supported:
lf = pdl.scan_ducklake("ducklake:sqlite:metadata.db", "sales")
You can also pass a pre-built SQLAlchemy Engine — useful when you
need connection-pool options, SSL settings, or a custom driver:
import sqlalchemy
engine = sqlalchemy.create_engine(
"postgresql+psycopg://user@host/db?application_name=my-app",
pool_size=10,
)
lf = pdl.scan_ducklake(engine, "analytics.sales")
Time travel
from datetime import datetime, timedelta
# By snapshot id
lf = pdl.scan_ducklake(engine, table="sales", snapshot_id=42)
# By timestamp (resolves to latest snapshot at or before the given time)
lf = pdl.scan_ducklake(
engine,
table="sales",
as_of=datetime.now() - timedelta(days=7),
)
Object storage
storage_options is passed through to pl.scan_parquet:
lf = pdl.scan_ducklake(
"postgresql+psycopg://user@host/db",
table="sales",
storage_options={"aws_region": "us-east-1"},
)
What works
- Append-only reads, with multi-snapshot time travel by
snapshot_idoras_oftimestamp. - Tables that have seen
DELETEorUPDATE(DuckLake representsUPDATEas delete + insert). - Schema evolution:
ADD COLUMN(older files null-fill at the target's dtype),DROP COLUMN(column disappears), andRENAME COLUMN(older files' physical names are translated through the catalog's stablecolumn_id). - Partitioned tables — identity transforms (the partition column is
written into each Parquet) and non-identity transforms (
year,month,bucket(N), etc., where the source column is in the Parquet) both read correctly. - Compacted lakes — files merged across snapshots (
partial_max IS NOT NULL) get a per-row snapshot filter applied for time-travel reads, so pinning to an older snapshot returns only rows whose origin was at-or-before that snapshot. - Catalog-stats file pruning — supported predicate shapes (a flat
AND of
eq/ne/lt/le/gt/geagainst a literal, plusis_null/is_not_null) are matched againstducklake_file_column_statsso files whose per-column min/max provably can't satisfy the filter are dropped before any Parquet I/O. Pruning is conservative: unrecognized predicate shapes (OR,is_in, arithmetic on a column, NaN-tainted floats, missing stats) keep the file and Polars handles the filter post-yield.
The returned LazyFrame is a single IO_SOURCE node — .explain()
shows PYTHON SCAN rather than Parquet SCAN. lf.filter(...) and
lf.select(...) work exactly as they do on any other Polars source;
the predicate and projection are deserialized and pushed back into the
reader, which forwards them to per-file pl.scan_parquet calls.
Current limitations
The following are not yet supported and will raise
NotImplementedError rather than silently produce wrong results:
- No writes. Read-only.
- No inlined data. Tables with rows stored directly in the catalog
DB (small writes under DuckLake's inlining threshold — default 10
rows, on by default in the DuckDB writer) are detected and refused.
scan_ducklakeis designed for the large-dataset path; for tiny interactive lakes either setDATA_INLINING_ROW_LIMIT=0on the catalog or run the writer's inline-flush command.
Known upstream issue: Polars' native Parquet reader currently
panics on the Arrow MAP logical type produced by DuckDB's writer
("MapArray expects DataType::Struct as its inner logical type"). Our
catalog mapping for MAP columns is correct (assembled as
List(Struct{key, value})), and reads work when the on-disk Parquet
uses that shape directly. A workaround for DuckDB-MAP Parquet is to
read those columns via PyArrow until the upstream Polars fix lands.
How it works
scan_ducklake is split into eager identity resolution and a deferred
plan that Polars drives through the IO-plugin contract.
Eagerly, at call time:
- Normalize the catalog argument to a SQLAlchemy
Engine. - Resolve the target snapshot (latest, by
snapshot_id, or byas_of). - Look up the
schema_id/table_idfromducklake_schema/ducklake_tableusing DuckLake's MVCC visibility filter, and refuse inlined-data tables. These raiseLookupError/NotImplementedErrordirectly from the call rather than being wrapped inComputeErrorat.collect()time. - Register an IO source via
polars.io.plugins.register_io_source(..., is_pure=True)and return itsLazyFrame. Heavier catalog work is deferred to a memoized_PlanContext.
Lazily, when Polars drives the read:
- Read the active column set at the target snapshot (renames, adds,
drops applied), the data files, the positional-delete files, and
the lake-wide
data_path. - Take the pushed predicate from Polars and walk its serialized
form to extract a flat AND of supported leaves; match those
against
ducklake_file_column_statsand drop files whose per-column min/max can't satisfy them. Anything unrecognized falls through and keeps the file (false negatives, never false positives). - Plan each surviving data file individually:
- Schema evolution: query
ducklake_columnat the file'sbegin_snapshotand translate physical names through stablecolumn_ids (pl.col(old_name).alias(new_name)); null-fill columns added after the file was written. - Deletes: per-file anti-join on the delete file's
pos. - Compacted files: when the target is older than
partial_max, filter rows by the writer-emitted_ducklake_internal_snapshot_id.
- Schema evolution: query
- Yield each file's
pl.scan_parquetLazyFrame to Polars with the pushedwith_columns/predicate/n_rowsapplied.
All catalog queries use parameterized SQL (no f-string interpolation) and the spec's MVCC clause:
WHERE :snapshot_id >= begin_snapshot
AND (end_snapshot IS NULL OR :snapshot_id < end_snapshot)
References
- DuckLake specification: https://ducklake.select/docs/stable/specification/introduction
- DuckLake queries (canonical SQL pattern): https://ducklake.select/docs/stable/specification/queries
- DuckLake table specs: https://ducklake.select/docs/stable/specification/tables/overview
- Polars
scan_parquet: https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html
Development
# Set up the venv with all dev / catalog dependencies pinned by uv.lock
uv sync --group dev
# Bring up the project-local docker stack (MinIO + Postgres + MySQL).
# Required only for the integration suite — unit tests run without it.
docker compose up -d
# Run everything.
uv run pytest
# Run only fast unit tests (no docker needed).
uv run pytest --ignore=tests/integration
# Run only integration tests.
uv run pytest tests/integration -m integration
# Tear the docker stack down (and wipe its volumes).
docker compose down -v
The docker-compose stack uses non-default ports (MinIO 19000/19001, Postgres 15432, MySQL 13306) and a project-prefixed volume namespace, so it is fully isolated from anything else you have running locally.
License
MIT.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_ducklake-0.2.1.tar.gz.
File metadata
- Download URL: polars_ducklake-0.2.1.tar.gz
- Upload date:
- Size: 34.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3709ddbfd9bf6929dfab18faad441b09e940be6e1b434358bff8fef92bce4ffb
|
|
| MD5 |
3d25a0a6c029f258eeceda557682c644
|
|
| BLAKE2b-256 |
68baac3ea2f12aa433bd8caee8038cb247cfe7d12c887398736c0415609398e3
|
Provenance
The following attestation bundles were made for polars_ducklake-0.2.1.tar.gz:
Publisher:
publish.yml on ashler-herrick/polars-ducklake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_ducklake-0.2.1.tar.gz -
Subject digest:
3709ddbfd9bf6929dfab18faad441b09e940be6e1b434358bff8fef92bce4ffb - Sigstore transparency entry: 1422923349
- Sigstore integration time:
-
Permalink:
ashler-herrick/polars-ducklake@c77cc1a5a426d9a32a657dc6456f190cb6bbafc2 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/ashler-herrick
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c77cc1a5a426d9a32a657dc6456f190cb6bbafc2 -
Trigger Event:
push
-
Statement type:
File details
Details for the file polars_ducklake-0.2.1-py3-none-any.whl.
File metadata
- Download URL: polars_ducklake-0.2.1-py3-none-any.whl
- Upload date:
- Size: 36.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1d2ea932ecdb9f9f796f7f2a2aaf78bc4d59b041fb4c3bcf41e887ffc8354ed2
|
|
| MD5 |
b3c3fee46b7a81a17a7483d4db857eb5
|
|
| BLAKE2b-256 |
7ca55167ccaad0f6e92ffc055f6f9be67d82d58f43db744ed6462bbfa4a936fe
|
Provenance
The following attestation bundles were made for polars_ducklake-0.2.1-py3-none-any.whl:
Publisher:
publish.yml on ashler-herrick/polars-ducklake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_ducklake-0.2.1-py3-none-any.whl -
Subject digest:
1d2ea932ecdb9f9f796f7f2a2aaf78bc4d59b041fb4c3bcf41e887ffc8354ed2 - Sigstore transparency entry: 1422923484
- Sigstore integration time:
-
Permalink:
ashler-herrick/polars-ducklake@c77cc1a5a426d9a32a657dc6456f190cb6bbafc2 -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/ashler-herrick
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c77cc1a5a426d9a32a657dc6456f190cb6bbafc2 -
Trigger Event:
push
-
Statement type: