Skip to main content

dplyr for Python: tidy piped verbs over polars and duckdb, with real autocompletion and dplyr-verified semantics.

Project description

dpyr

dplyr for Python. The tidyverse's verbs — filter, mutate, group_by, summarize, joins, across, tidyselect — as Python method chains, executing on polars or duckdb, with real IDE autocompletion and semantics verified against dplyr itself.

pip install dpyr        # or: uv add dpyr
from dpyr import read_parquet, col, n, desc

starwars = read_parquet("starwars.parquet")

(
    starwars
    .filter(col.height > 180, col.mass < 100)
    .mutate(bmi = col.mass / (col.height / 100) ** 2)
    .group_by(col.species)
    .summarize(
        n = n(),
        mean_bmi = col.bmi.mean(),
    )
    .arrange(desc(col.mean_bmi))
)

Evaluate that in a notebook and you see rows immediately. Typo a column name and you get the error on that line, with a did-you-mean suggestion. Wrap the same code in a pipeline and only .collect() at the end, and the whole chain runs as one fused query with predicate pushdown. That combination — schema-eager, data-lazy, display-eager — is the core design.

Two backends, one semantics

import duckdb
from dpyr import from_duckdb, from_polars, from_dict

df  = from_dict({"x": [1, 2, 3], "g": ["a", "a", "b"]})   # polars engine
con = duckdb.connect("warehouse.db")
tbl = from_duckdb(con, "events")                          # SQL pushdown

Identical chains produce identical results on both engines — enforced by a Hypothesis fuzzer that runs random verb chains on both and compares bit-for-bit, and by differential tests against real dplyr: every spec in tests/specs/ is executed by dplyr (via oracle/run_specs.R) to produce a committed golden parquet, then replayed through dpyr on both backends. Where R and the engines genuinely disagree, the decision is documented in docs/SEMANTICS.md, not left to chance.

The dplyr you know

dplyr dpyr
filter(df, height > 180) df.filter(col.height > 180)
mutate(df, bmi = mass / h^2) df.mutate(bmi = col.mass / col.h ** 2)
summarise(df, n = n(), m = mean(x, na.rm = TRUE)) df.summarize(n = n(), m = col.x.mean())
arrange(df, desc(mass)) df.arrange(desc(col.mass))
select(df, name, starts_with("h")) df.select(col.name, starts_with("h"))
select(df, -mass) df.select(-col.mass)
across(where(is.numeric), mean) across(where(is_numeric), "mean")
left_join(a, b, by = "k") a.left_join(b, on = col.k)
pivot_longer(df, x:y) df.pivot_longer([col.x, col.y])
if_else(), case_when(), n_distinct() if_else(), case_when(), .n_unique()
lag(), lead(), row_number(), min_rank() lag(), lead(), row_number(), min_rank()
cumsum(), dense_rank(), percent_rank() cum_sum(), dense_rank(), percent_rank()
slice_min(x, n), slice_max(x, n) (ties kept) slice_min(col.x, n), slice_max(col.x, n)
separate(), unite(), relocate() separate(), unite(), relocate()
coalesce(), replace_na() coalesce(), replace_na()

Grouped mutate/filter are windowed per group, summarize peels one grouping level, joins use .x/.y suffixes and match NAs by default — the dplyr behaviors, deliberately.

Autocompletion that actually works

  • df.c.height — frame-bound proxy: column names complete from the live schema, and the returned expression is typed (.mean() on numerics, .str_detect() on strings; calling .mean() on a string column raises immediately, at build time).
  • df.filter(lambda c: c.height > 180) — lambda style for the same effect.
  • dpyr stubgen data/*.parquet -o schemas.py — generates typed schema modules so completion and type-checking work statically in any IDE.

Interactive by default, lazy when you need it

df.persist()           # checkpoint: materialize now (duckdb: temp table)
df.lazy()              # this frame never executes implicitly
dpyr.options.interactive = False   # global opt-out for production pipelines

Results are cached by plan hash, so re-displaying a frame in a notebook never recomputes it.

Project documents

Doc What it pins down
docs/DESIGN.md API design, the materialization model, autocompletion strategy, architecture
docs/SEMANTICS.md Every deliberate decision where R, polars and duckdb disagree
docs/TESTING.md dplyr-as-oracle goldens, backend-agreement fuzzing, Hypothesis properties
docs/ROADMAP.md What shipped in 1.0 and what's next

License

MIT © Maxime Rivest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dpyr-1.1.0.tar.gz (82.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dpyr-1.1.0-py3-none-any.whl (42.7 kB view details)

Uploaded Python 3

File details

Details for the file dpyr-1.1.0.tar.gz.

File metadata

  • Download URL: dpyr-1.1.0.tar.gz
  • Upload date:
  • Size: 82.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for dpyr-1.1.0.tar.gz
Algorithm Hash digest
SHA256 9e9e2140e86e0519ce2c34fc038ab2c94d52cbe85a67739af117479cc8924599
MD5 af9583f6cbb6b63ce14e27074da9d1ee
BLAKE2b-256 3cc39ba8646ca30bfce0970ec80c3677dfae792546153f34a52776f42853f682

See more details on using hashes here.

File details

Details for the file dpyr-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: dpyr-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 42.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for dpyr-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f293dded5df4a614f37e8c9d1c5133ee82d433f6339005160011ffb1e427223e
MD5 b036ecdbe3439f078b50d9cdd862cd73
BLAKE2b-256 83498c591f418c6932c07b389ced823a5d6fe7e609357f7df3998305d0b6ccde

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page