Automatic diagnosis for pandas, Polars, NumPy, Arrow, and distributed data pipelines.
Project description
dr-dasci
Automatic diagnostics for pandas, Polars, NumPy, Arrow, and distributed data pipelines.
dr-dasci combines:
- Dataframe diagnostics for pandas-like and Polars-like objects.
- Array diagnostics for NumPy memory layout, dtype, and copy risks.
- Operation preflight checks for joins, groupbys, pivots, conversions, and Parquet reads.
- Optional engine diagnostics for Dask, DuckDB, Spark, and Arrow Dataset metadata.
- Runtime instrumentation for measuring actual Python allocation peaks.
- Configurable thresholds for laptop, CI, and server memory budgets.
- Suppression config for accepted finding codes.
- Machine-readable reports with stable finding codes, metadata, and JSON export.
- Notebook rendering through HTML and optional
ipywidgets. - Safe execution plans for large tabular transformations.
- Optional dependencies so the base package stays lightweight.
Why Diagnostics for Data Pipelines?
pandas, Polars, NumPy, and Arrow are powerful, but many expensive operations look cheap at the call site:
Input object or file
-> Detect runtime and shape
-> Inspect dtypes, indexes, memory, cardinality, and layout
-> Estimate operation-specific peak memory
-> Report findings with stable codes
-> Suggest safer dtypes and execution plans
-> Export text or JSON for notebooks, CI, and logs
dr-dasci is designed to catch common problems before they become production
failures:
- Hidden copies from pandas object blocks, index alignment, and NumPy views.
- Memory blowups in joins, groupbys, pivots, unstack, fillna, and conversions.
- String dtype traps where
object, high-cardinality text, or repeated labels need different treatment. - Parquet-to-pandas expansion when encoded Arrow data becomes pandas blocks.
- Join cardinality surprises from duplicate keys, null keys, and many-to-many merges.
Architecture
DataFrame / ndarray / file path
|
v
Adapter detection
- pandas DataFrame
- Polars DataFrame / LazyFrame
- NumPy ndarray
- dataframe-like fallback
- Parquet metadata reader
- Dask / DuckDB / Spark / Arrow Dataset metadata adapters
|
v
Diagnostics
- shape and memory estimates
- dtype and cardinality checks
- pandas index/copy-risk checks
- NumPy layout checks
- join/groupby/pivot/conversion preflight
- runtime peak allocation instrumentation
|
v
DoctorReport
- human-readable show()
- suggestions via suggest()
- safe_execution_plan()
- machine-readable to_dict() / to_json()
Install
pip install dr-dasci
For pandas support:
pip install "dr-dasci[pandas]"
For Polars support:
pip install "dr-dasci[polars]"
For Dask, DuckDB, Spark, or notebook support:
pip install "dr-dasci[dask]"
pip install "dr-dasci[duckdb]"
pip install "dr-dasci[spark]"
pip install "dr-dasci[notebook]"
For all optional dataframe, array, and Parquet support:
pip install "dr-dasci[all]"
For development:
pip install -e ".[dev,all]"
pytest -q
python -m build
twine check dist/*
Quick Start
Basic Diagnosis
from dr_dasci import diagnose
report = diagnose(df, name="orders")
report.show()
print(report.suggest())
Machine-Readable Output
from dr_dasci import diagnose
report = diagnose(df)
payload = report.to_dict()
json_text = report.to_json()
print(payload["findings"][0]["code"])
print(json_text)
Safe Execution Plan
report = diagnose(df, name="events")
for step in report.safe_execution_plan():
print(step)
Configurable Thresholds
from dr_dasci import DoctorConfig, diagnose
config = DoctorConfig(
available_memory_bytes=8_000_000_000,
large_memory_bytes=1_500_000_000,
expensive_column_bytes=150_000_000,
)
report = diagnose(df, config=config)
Suppress accepted finding codes:
from dr_dasci import DoctorConfig, diagnose
config = DoctorConfig(suppress_codes=("EXPENSIVE_OBJECT_COLUMN",))
report = diagnose(df, config=config)
CLI
Inspect a local data file:
dr-dasci inspect data.parquet
Emit JSON:
dr-dasci inspect data.parquet --json
Main Features
1. Dataframe Diagnosis
Detect expensive object columns, large shapes, numeric downcast candidates, nullable dtype candidates, and pandas index risks:
from dr_dasci import diagnose
report = diagnose(df)
report.show()
Common finding codes include:
EXPENSIVE_OBJECT_COLUMNDOWNSIZE_NUMERIC_CANDIDATEDUPLICATE_INDEXNON_MONOTONIC_INDEXPANDAS_OBJECT_BLOCK_COPY_RISKPANDAS_ALIGNMENT_COPY_RISK
2. Join Preflight
Estimate join cardinality, null-key risk, many-to-many risk, and peak memory:
from dr_dasci import diagnose_join
report = diagnose_join(left, right, on="customer_id", how="left")
report.show()
3. Groupby Preflight
Check high-cardinality grouping keys and aggregation memory pressure:
from dr_dasci import diagnose_groupby
report = diagnose_groupby(events, by=["account_id", "event_day"])
print(report.risky_operations(minimum="high"))
4. Pivot and Unstack Preflight
Estimate dense expansion before reshaping:
from dr_dasci import diagnose_pivot
report = diagnose_pivot(df, index="user_id", columns="event_type")
report.show()
5. Conversion Diagnostics
Preflight conversion costs between pandas, Polars, NumPy, and Arrow-backed data:
from dr_dasci import diagnose_conversion
report = diagnose_conversion(df, target="pandas")
print(report.to_json())
6. Parquet Metadata Diagnostics
Inspect Parquet row groups, column counts, compression, encodings, and pandas conversion risk without loading the full dataset:
from dr_dasci import diagnose_parquet
report = diagnose_parquet("events.parquet")
report.show()
7. NumPy Copy-Risk Checks
Catch object arrays and non-contiguous views:
from dr_dasci import diagnose
report = diagnose(array)
report.show()
8. Optional Engine Metadata Diagnostics
Inspect Dask, DuckDB, Spark, and Arrow Dataset objects from metadata without triggering computation or collecting rows:
from dr_dasci import diagnose
report = diagnose(lazy_or_distributed_frame)
print(report.safe_execution_plan())
9. Runtime Memory Instrumentation
Measure actual Python allocation peak for a callable when you intentionally want to run it:
from dr_dasci import diagnose_runtime
result, report = diagnose_runtime(lambda: df.assign(total=df["a"] + df["b"]), name="assign_total")
print(report.estimates[0].metadata["peak_bytes"])
Preflight helpers do not execute transformations; diagnose_runtime is the
explicit execution-time measurement API.
10. Notebook Reports
from dr_dasci import diagnose
diagnose(df).to_notebook()
11. Stable Finding Codes
Every finding includes a stable code, severity, suggestion, optional
column, documentation URL, and metadata:
for finding in report.findings:
print(finding.code, finding.severity, finding.metadata)
See docs/FINDINGS.md for the finding catalog.
Configuration
Tune behavior via DoctorConfig:
from dr_dasci import DoctorConfig
config = DoctorConfig(
large_memory_bytes=1_000_000_000,
expensive_column_bytes=100_000_000,
large_cell_count=50_000_000,
large_rows=1_000_000,
very_large_rows=5_000_000,
pivot_row_warning=250_000,
pivot_width_warning=25,
join_high_memory_bytes=500_000_000,
low_cardinality_ratio=0.2,
low_cardinality_max_unique=50_000,
high_cardinality_ratio=0.8,
index_warning_rows=100_000,
available_memory_bytes=None,
)
Examples
from dr_dasci import diagnose, diagnose_join
orders_report = diagnose(orders, name="orders")
customers_join_report = diagnose_join(orders, customers, on="customer_id")
orders_report.show()
customers_join_report.show()
dr-dasci inspect warehouse/orders.parquet --json
Project Structure
src/dr_dasci/
__init__.py # Public API
config.py # DoctorConfig thresholds
core.py # Diagnostics and operation preflight helpers
report.py # DoctorReport, findings, estimates, JSON export
cli.py # Command-line interface
py.typed # Typing marker
docs/
FINDINGS.md # Stable finding-code catalog
CHANGELOG.md # Release history
tests/
test_*.py # Unit and optional integration tests
.github/
workflows/
ci.yml # Lint, type check, tests, build, twine check
publish.yml # PyPI publishing workflow
pyproject.toml # Project metadata and dependencies
drdasci.png # Project logo
Development
# Install with dev extras
pip install -e ".[dev,all]"
# Lint
ruff check .
# Type check
mypy src
# Run tests
pytest -q
# Build package
python -m build
# Check distributions
twine check dist/*
License
MIT
Contributing
Contributions are welcome. Open an issue with a reproducible dataframe shape, dtypes, operation, and observed memory or runtime behavior.
Citation
If you use dr-dasci in research, please cite:
@software{drdasci2026,
title={dr-dasci: Automatic Diagnostics for Data Science Pipelines},
author={Arkay92},
url={https://github.com/Arkay92/dr-dasci},
year={2026},
version={0.2.0},
}
Acknowledgments
- pandas for dataframe analytics.
- Polars for high-performance dataframe execution.
- NumPy for array computing.
- Apache Arrow for columnar memory and Parquet tooling.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dr_dasci-0.2.0.tar.gz.
File metadata
- Download URL: dr_dasci-0.2.0.tar.gz
- Upload date:
- Size: 827.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
805fdeeeb789d43b54e119f767175e17a9affa01c6fe9a3bc1679f46afc7f678
|
|
| MD5 |
3c733270032905ea7d3aab9b76193c67
|
|
| BLAKE2b-256 |
0765491b9a6c8d0175e87617025bcb6a4489f7193180ec0f5ebcb470f31bd971
|
Provenance
The following attestation bundles were made for dr_dasci-0.2.0.tar.gz:
Publisher:
publish.yml on Arkay92/Dr-DaSci
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dr_dasci-0.2.0.tar.gz -
Subject digest:
805fdeeeb789d43b54e119f767175e17a9affa01c6fe9a3bc1679f46afc7f678 - Sigstore transparency entry: 1781722168
- Sigstore integration time:
-
Permalink:
Arkay92/Dr-DaSci@27b97b025817652058781b9ccceda71330881254 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Arkay92
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@27b97b025817652058781b9ccceda71330881254 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dr_dasci-0.2.0-py3-none-any.whl.
File metadata
- Download URL: dr_dasci-0.2.0-py3-none-any.whl
- Upload date:
- Size: 22.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5817a029f4f3445a7f852209ac89f2871c7b2fa0dc240982a502c1fe28425a3
|
|
| MD5 |
507b3300e10eb94617fb2b21d8b191e0
|
|
| BLAKE2b-256 |
a8356ad87d594f8a09ec57e2017df1dd0e78e9d3f670a9aa2b193d8213072c57
|
Provenance
The following attestation bundles were made for dr_dasci-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on Arkay92/Dr-DaSci
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dr_dasci-0.2.0-py3-none-any.whl -
Subject digest:
a5817a029f4f3445a7f852209ac89f2871c7b2fa0dc240982a502c1fe28425a3 - Sigstore transparency entry: 1781722321
- Sigstore integration time:
-
Permalink:
Arkay92/Dr-DaSci@27b97b025817652058781b9ccceda71330881254 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Arkay92
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@27b97b025817652058781b9ccceda71330881254 -
Trigger Event:
push
-
Statement type: