Native DuckLake reader for Polars (no DuckDB runtime dependency)

These details have not been verified by PyPI

Project links

Project description

polars-ducklake

A native DuckLake reader for Polars — with no DuckDB runtime dependency.

DuckLake stores all metadata in a SQL catalog database and all data in Parquet files. polars-ducklake exploits that design directly: it issues a few SELECT queries against the catalog to resolve a list of Parquet paths, then hands them to pl.scan_parquet so Polars can do all of its native optimization (predicate pushdown, projection pushdown, streaming).

Install

pip install polars-ducklake

Optional extras for non-SQLite catalogs:

pip install "polars-ducklake[postgres]"        # Postgres catalog
pip install "polars-ducklake[mysql]"           # MySQL catalog (read-only; see below)
pip install "polars-ducklake[duckdb-catalog]"  # DuckDB-as-catalog

Catalog backends

Each backend is verified by an end-to-end integration test: the canonical DuckDB ducklake-extension writer creates a real lake against the backend, then this package reads it back and checks the results.

Backend	Status	Notes
SQLite	Verified	No extra dependency; stdlib only.
PostgreSQL	Verified	Requires `[postgres]`; tested against PG 17.
DuckDB-as-catalog	Verified	Requires `[duckdb-catalog]` (`duckdb-engine`).
MySQL	Reader verified	`[mysql]` installs `pymysql`. Every read behavior (round-trip, time travel, deletes, schema evolution, partial files) is verified against MySQL 8.0 in the cross-backend test matrix. End-to-end writer round-trip via the DuckDB ducklake+mysql extension is currently unstable upstream — until that's resolved, populate MySQL-backed catalogs with another writer.
Anything else SQLAlchemy supports	Best-effort	Pass a pre-built `Engine` or a SQLAlchemy URL directly.

Quickstart

import polars as pl
import polars_ducklake as pdl

# `scan_ducklake(metadata_catalog, table, ...)` — first arg identifies
# the metadata catalog (the SQL database hosting DuckLake's bookkeeping
# tables), second arg names the table.
lf = pdl.scan_ducklake(
    "sqlite:///metadata.db",
    "sales",
)

print(lf.filter(pl.col("region") == "us").select("amount").collect())

Tables can be addressed by an unqualified name (the default schema main is assumed) or a fully-qualified schema.table form:

lf = pdl.scan_ducklake("sqlite:///metadata.db", "analytics.events")
# equivalent to:
lf = pdl.scan_ducklake("sqlite:///metadata.db", "events", schema="analytics")

The DuckLake-native connection string form is also supported:

lf = pdl.scan_ducklake("ducklake:sqlite:metadata.db", "sales")

You can also pass a pre-built SQLAlchemy Engine — useful when you need connection-pool options, SSL settings, or a custom driver:

import sqlalchemy
engine = sqlalchemy.create_engine(
    "postgresql+psycopg://user@host/db?application_name=my-app",
    pool_size=10,
)
lf = pdl.scan_ducklake(engine, "analytics.sales")

Time travel

from datetime import datetime, timedelta

# By snapshot id
lf = pdl.scan_ducklake(engine, table="sales", snapshot_id=42)

# By timestamp (resolves to latest snapshot at or before the given time)
lf = pdl.scan_ducklake(
    engine,
    table="sales",
    as_of=datetime.now() - timedelta(days=7),
)

Object storage

storage_options is passed through to pl.scan_parquet:

lf = pdl.scan_ducklake(
    "postgresql+psycopg://user@host/db",
    table="sales",
    storage_options={"aws_region": "us-east-1"},
)

What works

Append-only reads, with multi-snapshot time travel by snapshot_id or as_of timestamp.
Tables that have seen DELETE or UPDATE (DuckLake represents UPDATE as delete + insert).
Schema evolution: ADD COLUMN (older files null-fill at the target's dtype), DROP COLUMN (column disappears), and RENAME COLUMN (older files' physical names are translated through the catalog's stable column_id).
Partitioned tables — identity transforms (the partition column is written into each Parquet) and non-identity transforms (year, month, bucket(N), etc., where the source column is in the Parquet) both read correctly.
Compacted lakes — files merged across snapshots (partial_max IS NOT NULL) get a per-row snapshot filter applied for time-travel reads, so pinning to an older snapshot returns only rows whose origin was at-or-before that snapshot.

The returned LazyFrame is just a pl.scan_parquet (or a pl.concat of several when per-file work is needed). All predicate / projection pushdown is handled by the Polars query engine — lf.filter(...) and lf.select(...) work exactly as they do on any other Polars source.

Current limitations

The following are not yet supported and will raise NotImplementedError rather than silently produce wrong results:

No writes. Read-only.
No inlined data. Tables with rows stored directly in the catalog DB (small writes under DuckLake's inlining threshold — default 10 rows, on by default in the DuckDB writer) are detected and refused. scan_ducklake is designed for the large-dataset path; for tiny interactive lakes either set DATA_INLINING_ROW_LIMIT=0 on the catalog or run the writer's inline-flush command.

Known upstream issue: Polars' native Parquet reader currently panics on the Arrow MAP logical type produced by DuckDB's writer ("MapArray expects DataType::Struct as its inner logical type"). Our catalog mapping for MAP columns is correct (assembled as List(Struct{key, value})), and reads work when the on-disk Parquet uses that shape directly. A workaround for DuckDB-MAP Parquet is to read those columns via PyArrow until the upstream Polars fix lands.

How it works

scan_ducklake runs roughly the following steps:

Normalize the catalog argument to a SQLAlchemy Engine.
Resolve the target snapshot (latest, by snapshot_id, or by as_of).
Look up the schema_id / table_id from ducklake_schema / ducklake_table using DuckLake's MVCC visibility filter.
Verify the table has no nested columns or inlined data at the target snapshot.
Read the active column set at the target snapshot (renames, adds, drops applied), the data files, the positional-delete files, and the lake-wide data_path.
Plan each data file individually:
- Schema evolution: query ducklake_column at the file's begin_snapshot and translate physical names through stable column_ids (pl.col(old_name).alias(new_name)); null-fill columns added after the file was written.
- Deletes: per-file anti-join on the delete file's pos.
- Compacted files: when the target is older than partial_max, filter rows by the writer-emitted _ducklake_internal_snapshot_id.
Fast path: if every file's schema matches target exactly and nobody needs deletes or partial filtering, do a single pl.scan_parquet(paths) so Polars can optimize across files.
Otherwise concatenate the per-file LazyFrames.

All catalog queries use parameterized SQL (no f-string interpolation) and the spec's MVCC clause:

WHERE :snapshot_id >= begin_snapshot
  AND (end_snapshot IS NULL OR :snapshot_id < end_snapshot)

References

DuckLake specification: https://ducklake.select/docs/stable/specification/introduction
DuckLake queries (canonical SQL pattern): https://ducklake.select/docs/stable/specification/queries
DuckLake table specs: https://ducklake.select/docs/stable/specification/tables/overview
Polars scan_parquet: https://docs.pola.rs/api/python/stable/reference/api/polars.scan_parquet.html

Development

# Set up the venv with all dev / catalog dependencies pinned by uv.lock
uv sync --group dev

# Bring up the project-local docker stack (MinIO + Postgres + MySQL).
# Required only for the integration suite — unit tests run without it.
docker compose up -d

# Run everything.
uv run pytest

# Run only fast unit tests (no docker needed).
uv run pytest --ignore=tests/integration

# Run only integration tests.
uv run pytest tests/integration -m integration

# Tear the docker stack down (and wipe its volumes).
docker compose down -v

The docker-compose stack uses non-default ports (MinIO 19000/19001, Postgres 15432, MySQL 13306) and a project-prefixed volume namespace, so it is fully isolated from anything else you have running locally.

License

MIT.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

May 1, 2026

This version

0.2.0

Apr 30, 2026

0.1.1

Apr 30, 2026

0.1.0

Apr 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_ducklake-0.2.0.tar.gz (27.9 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

polars_ducklake-0.2.0-py3-none-any.whl (31.6 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file polars_ducklake-0.2.0.tar.gz.

File metadata

Download URL: polars_ducklake-0.2.0.tar.gz
Upload date: Apr 30, 2026
Size: 27.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_ducklake-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`d47abe64dbd8e3b0072c6e75cc93046715e0e6c81ab4f51db41f133b10db8abf`
MD5	`f1f75855df716f43f186be8d31c613d7`
BLAKE2b-256	`4367619149d3a340270cb8446163b28f495018ddbbdfd5c3418ade55cce019a8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_ducklake-0.2.0.tar.gz:

Publisher: publish.yml on ashler-herrick/polars-ducklake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: polars_ducklake-0.2.0.tar.gz
- Subject digest: d47abe64dbd8e3b0072c6e75cc93046715e0e6c81ab4f51db41f133b10db8abf
- Sigstore transparency entry: 1414389217
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: ashler-herrick/polars-ducklake@a297d75da5d7273201ac79eeceffa26d43dcb29a
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ashler-herrick
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a297d75da5d7273201ac79eeceffa26d43dcb29a
- Trigger Event: push

File details

Details for the file polars_ducklake-0.2.0-py3-none-any.whl.

File metadata

Download URL: polars_ducklake-0.2.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 31.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for polars_ducklake-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`80540707bcb660a6b7fb6de3651338bfd39d4d45ab9dec27623afd979ebca425`
MD5	`1227598463f38f0a99e287f859c8170f`
BLAKE2b-256	`24716038c09bc98fede6e1d5bea4d05ad3dd69a7afdf1cd76726f105c7634c43`

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_ducklake-0.2.0-py3-none-any.whl:

Publisher: publish.yml on ashler-herrick/polars-ducklake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: polars_ducklake-0.2.0-py3-none-any.whl
- Subject digest: 80540707bcb660a6b7fb6de3651338bfd39d4d45ab9dec27623afd979ebca425
- Sigstore transparency entry: 1414389516
- Sigstore integration time: Apr 30, 2026
Source repository:
- Permalink: ashler-herrick/polars-ducklake@a297d75da5d7273201ac79eeceffa26d43dcb29a
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/ashler-herrick
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@a297d75da5d7273201ac79eeceffa26d43dcb29a
- Trigger Event: push

polars-ducklake 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

polars-ducklake

Install

Catalog backends

Quickstart

Time travel

Object storage

What works

Current limitations

How it works

References

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance