A unified data ingestion and sanitization engine for ML workflows.
Project description
data-prep-engine
A lightweight, modular, production-friendly Python library for data ingestion, diagnostics, sanitization, visualization, and end-to-end ML data preparation.
Built to provide a single, unified, reproducible pipeline that works across CSV, JSON, Parquet, images, and more — without depending on massive profiling libraries.
🌟 Key Features
🔌 1. Ingestion Engine (Loader)
Load CSV, JSON, Parquet, Images, and more into a unified StandardTable.
🩺 2. Diagnostics Engine (DataDoctor)
Column-level summaries, warnings, null counts, duplicates, outliers, cardinality & constant-column detection.
🧼 3. Sanitization Engine (The Surgeon)
- Missing value imputation
- Duplicate removal
- IQR-based outlier capping
- Fully extensible sanitization steps
🎨 4. Visualization Engine (The Artist)
Smart single-flag plots (numeric histograms, categorical counts) guided by the diagnostics report.
🚀 5. AutoPrep — End-to-End Unified Pipeline
One line to prepare any dataset:
from data_prep_engine import AutoPrep
prep = AutoPrep.default()
result = prep.run_from_uri("data.csv")
result.cleaned_table.to_pandas().head()
⚙️ Installation
```bash
pip install data-prep-engine
(Coming soon to PyPI — currently install locally using:)
### pip install -e .
🏁 Quickstart
from data_prep_engine import AutoPrep
prep = AutoPrep.default()
# Load, diagnose, clean, visualize — all in one step
result = prep.run_from_uri("data.csv")
print(result.sanitization_logs)
result.cleaned_table.to_pandas().head()
📥 Ingestion Examples
from data_prep_engine.ingestion import Loader
loader = Loader()
table_csv = loader.load("data.csv")
table_json = loader.load("data.json")
table_parquet = loader.load("data.parquet")
table_image = loader.load("image.jpg") # stored as array metadata
All ingestion results are returned as a StandardTable, guaranteeing uniform structure.
🩺 Diagnostics Examples
from data_prep_engine.diagnostics import DataDoctor
doctor = DataDoctor()
report = doctor.diagnose(table)
print(report.summary_table())
print(report.warnings)
Common warnings include: • High missing values • High-cardinality categorical columns • Outlier-heavy numeric columns • Constant features Duplicate rows
🧼 Sanitization Examples
from data_prep_engine.sanitization.pipeline import SanitizationPipeline
from data_prep_engine.sanitization.steps import (
MissingValueHandler,
DuplicateHandler,
OutlierHandler,
)
pipeline = SanitizationPipeline([
MissingValueHandler(),
DuplicateHandler(),
OutlierHandler(),
])
result = pipeline.run(table)
clean_table = result.cleaned_table
print(result.logs)
🎨 Visualization Examples
from data_prep_engine.visualization import Artist
fig = Artist.plot(clean_table, doctor.diagnose(clean_table))
fig.show()
Or save as PNG:
Artist.to_png(fig, "preview.png")
🚀 Full AutoPrep Pipeline
from data_prep_engine import AutoPrep
prep = AutoPrep.default()
result = prep.run_from_uri("data.csv")
print("Warnings before:", result.diagnostics_before.warnings)
print("Warnings after:", result.diagnostics_after.warnings)
fig = prep.plot(result)
fig.show()
###📐 Project Architecture
data_prep_engine/
│
├── ingestion/ # Loaders + format adapters
├── diagnostics/ # DataDoctor + reports
├── sanitization/ # Sanitization steps + pipeline
├── visualization/ # Artist plotting engine
├── core/ # StandardTable + utilities
└── autoprep.py # Full unified pipeline
Each block is independent, testable, and extendable.
🧪 Running Tests
pytest -q
The test suite covers: • Ingestion adapters • Diagnostics summaries • Sanitization steps • Visualization engine • AutoPrep unified pipeline
🤝 Contributing
1. Fork the repo
2. Create a feature branch
3. Add tests for your change
4. Submit a PR
All PRs must pass GitHub Actions (unit tests, linting, and security scanners).
📜 License
MIT License © 2025
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file data_prep_engine-0.1.0.tar.gz.
File metadata
- Download URL: data_prep_engine-0.1.0.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3ca3ebcae4b6e053c5f4009e36cb9a1b495762de6150424867ba88ba29ca8c71
|
|
| MD5 |
dc0caa2546cb4b8fb219bf1e014fed69
|
|
| BLAKE2b-256 |
bd07ab9e3171a2cf5c3539413288b8a63d233aa3d70296608916ee6c8c74e574
|
File details
Details for the file data_prep_engine-0.1.0-py3-none-any.whl.
File metadata
- Download URL: data_prep_engine-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9770ad896f4b08fb29aed079d9fbd4635a6ac9160f6395313595fd7f4bad28c6
|
|
| MD5 |
02be56c96d43536bce7e29d7cdcf5e55
|
|
| BLAKE2b-256 |
b1e721b1baa5fa4255aabb19911039e4f712015b1cdb71d29d7618806ab4ab00
|