dplyr for Python: tidy piped verbs over polars and duckdb, with real autocompletion and dplyr-verified semantics.
Project description
dpyr
dplyr for Python. The tidyverse's verbs — filter, mutate, group_by,
summarize, joins, across, tidyselect — as Python method chains, executing
on polars or duckdb, with real IDE
autocompletion and semantics verified against dplyr itself.
pip install dpyr # or: uv add dpyr
from dpyr import read, col, n, desc
starwars = read("starwars.parquet") # read() takes anything tabular:
# .parquet/.csv/.arrow/.db paths, dicts,
# polars/pandas frames, arrow tables,
# Hugging Face datasets, numpy/torch/jax
(
starwars
.filter(col.height > 180, col.mass < 100)
.mutate(bmi = col.mass / (col.height / 100) ** 2)
.group_by(col.species)
.summarize(
n = n(),
mean_bmi = col.bmi.mean(),
)
.arrange(desc(col.mean_bmi))
)
Evaluate that in a notebook and you see rows immediately. Typo a column name
and you get the error on that line, with a did-you-mean suggestion. Wrap
the same code in a pipeline and only .collect() at the end, and the whole
chain runs as one fused query with predicate pushdown. That combination —
schema-eager, data-lazy, display-eager — is the core design.
Two backends, one semantics
import duckdb
from dpyr import read
df = read({"x": [1, 2, 3], "g": ["a", "a", "b"]}) # polars engine
con = duckdb.connect("warehouse.db")
tbl = read(con, "events") # SQL pushdown
Identical chains produce identical results on both engines — enforced by a
Hypothesis fuzzer that runs random verb chains on both and compares
bit-for-bit, and by differential tests against real dplyr: every spec in
tests/specs/ is executed by dplyr (via oracle/run_specs.R) to produce a
committed golden parquet, then replayed through dpyr on both backends. Where
R and the engines genuinely disagree, the decision is documented in
docs/SEMANTICS.md, not left to chance.
The dplyr you know
| dplyr | dpyr |
|---|---|
filter(df, height > 180) |
df.filter(col.height > 180) |
mutate(df, bmi = mass / h^2) |
df.mutate(bmi = col.mass / col.h ** 2) |
summarise(df, n = n(), m = mean(x, na.rm = TRUE)) |
df.summarize(n = n(), m = col.x.mean()) |
arrange(df, desc(mass)) |
df.arrange(desc(col.mass)) |
select(df, name, starts_with("h")) |
df.select(col.name, starts_with("h")) |
select(df, -mass) |
df.select(-col.mass) |
across(where(is.numeric), mean) |
across(where(is_numeric), "mean") |
left_join(a, b, by = "k") |
a.left_join(b, on = col.k) |
pivot_longer(df, x:y) |
df.pivot_longer([col.x, col.y]) |
if_else(), case_when(), n_distinct() |
if_else(), case_when(), .n_unique() |
lag(), lead(), row_number(), min_rank() |
lag(), lead(), row_number(), min_rank() |
cumsum(), dense_rank(), percent_rank() |
cum_sum(), dense_rank(), percent_rank() |
slice_min(x, n), slice_max(x, n) (ties kept) |
slice_min(col.x, n), slice_max(col.x, n) |
separate(), unite(), relocate() |
separate(), unite(), relocate() |
coalesce(), replace_na() |
coalesce(), replace_na() |
Grouped mutate/filter are windowed per group, summarize peels one
grouping level, joins use .x/.y suffixes and match NAs by default —
the dplyr behaviors, deliberately.
Autocompletion that actually works
df.c.height— frame-bound proxy: column names complete from the live schema, and the returned expression is typed (.mean()on numerics,.str_detect()on strings; calling.mean()on a string column raises immediately, at build time).df.filter(lambda c: c.height > 180)— lambda style for the same effect.dpyr stubgen data/*.parquet -o schemas.py— generates typed schema modules so completion and type-checking work statically in any IDE.
The database is a destination, not just a source
db = read("warehouse.db") # catalog object: db.tables, db.orders
gold = db.orders.group_by(col.region).summarize(rev = col.amount.sum())
gold.to_table("gold_revenue") # CREATE TABLE AS <sql>, fully in-engine
gold.to_view("gold_live") # the lazy plan as a named view
gold.write("gold.parquet") # in-engine COPY (extension dispatch)
mem = read({"region": ["east"], "target": [1000.0]})
gold.inner_join(mem, on = col.region) # in-memory frames bridge into duckdb
# automatically (arrow, zero-copy)
Interactive by default, lazy when you need it
df.persist() # checkpoint: materialize now (duckdb: temp table)
df.lazy() # this frame never executes implicitly
dpyr.options.interactive = False # global opt-out for production pipelines
Results are cached by plan hash, so re-displaying a frame in a notebook never recomputes it.
Documentation
Full guides at maximerivest.github.io/dpyr — get started, grouped data, joins, window functions, column-wise operations, reshaping, expressions & autocompletion, and the backends guide (connecting and operating polars and duckdb).
Project documents
| Doc | What it pins down |
|---|---|
| docs/DESIGN.md | API design, the materialization model, autocompletion strategy, architecture |
| docs/SEMANTICS.md | Every deliberate decision where R, polars and duckdb disagree |
| docs/TESTING.md | dplyr-as-oracle goldens, backend-agreement fuzzing, Hypothesis properties |
| docs/ROADMAP.md | What shipped in 1.0 and what's next |
License
MIT © Maxime Rivest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dpyr-1.7.0.tar.gz.
File metadata
- Download URL: dpyr-1.7.0.tar.gz
- Upload date:
- Size: 134.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06bde744eb586d8e5bcb4733c1dc007ee3b51a66443275e6a27ae90ec9d3f603
|
|
| MD5 |
31b85703cb9cb95a5e0532c65c9d0122
|
|
| BLAKE2b-256 |
4c48994dcfc55bfe66ad5aefe0bd05f98c4e11fe2ea051c5730d3131a8cae827
|
File details
Details for the file dpyr-1.7.0-py3-none-any.whl.
File metadata
- Download URL: dpyr-1.7.0-py3-none-any.whl
- Upload date:
- Size: 54.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10f2c75098384e3ba061d48d0a818019eb2860e7bbd75c3493092b4f65980b54
|
|
| MD5 |
ec4e508e5971ab5656f7c59c0900c441
|
|
| BLAKE2b-256 |
9843f98838304291691e81de6f1d38f52e60e5b3d2de4fbacc6589bfbc8185b1
|