AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.
Project description
PrePro Auto
AI-assisted tabular data preprocessing with human-in-the-loop control.
Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.
pip install prepro-auto
Author: Shivanshu Pandey · Source: github.com/Chilliflex/prepro_auto
Quickstart — Notebook (no upload)
import pandas as pd
import prepro_auto
df = pd.read_csv("your_data.csv")
session = prepro_auto.launch(df) # opens the local workbench, NO upload
# -> click the printed http://127.0.0.1:8721/workbench?job=... link
# clean visually in the browser, then back in the notebook:
cleaned = session.current() # the UI-edited DataFrame
session.update(cleaned) # push notebook edits back to the UI
That's the whole loop. Your DataFrame is loaded directly from the notebook's memory — no file upload, no context switch. df (your original) never changes; session.current() always returns the latest cleaned version.
Quickstart — Web UI
prepro_auto # starts the workbench at http://127.0.0.1:8000
Then open http://127.0.0.1:8000/workbench and upload a file.
What it does
- Profile — per-column type inference, missing rates, 0–100 quality score
- Clean (guided) — missing values, outliers, scaling, correlation/leakage, encoding; each issue becomes a reviewable decision with a recommended action and alternatives
- Transform (manual) — 17 preset ops, sandboxed expressions, multi-column batches
- AI assistant — optional; describe a change in plain English; confirms intent and shows a real preview before applying
- Visualize & dashboard — histograms, bar, scatter charts, plus a before/after dashboard with KPI tiles and per-column comparison
- Data drift — compare two datasets to detect distribution shifts (PSI + KS)
- Undo/redo — every change is a version
- Export — clean data (CSV/Parquet), audit PDF, and a runnable Python pipeline script
What you get out of PrePro Auto
Five concrete outputs you can take away after a session. Each one is designed to plug straight into a real-world workflow:
| Output | What it is | Where to use it |
|---|---|---|
| Cleaned DataFrame | The in-memory DataFrame after all your cleaning + transforms, returned by session.current() in the notebook |
Feed straight into model.fit(X, y) for scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow. No file I/O needed. |
| Cleaned dataset file | A CSV or Parquet file via GET /datasets/{job_id}/export/data?format=csv (or format=parquet) |
Share with teammates, upload to a feature store, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo, or feed into downstream ETL jobs. Parquet is smaller and faster for large datasets. |
| Audit PDF | A multi-page PDF via GET /datasets/{job_id}/export/audit listing every transformation with its parameters, before/after stats, and who approved it |
Compliance trail for regulated industries (finance, healthcare, insurance); attach to a model-card or experiment-tracking entry; hand to a reviewer or data-governance team to prove the cleaning is reproducible and reasoned, not arbitrary. |
| Runnable Python pipeline | A standalone .py script via GET /datasets/{job_id}/export/pipeline that reproduces the exact cleaning with pandas + scikit-learn — no PrePro Auto dependency |
Drop into a production training pipeline, an Airflow/Prefect/Dagster DAG, a CI job, or a coworker's machine. They run python pipeline.py raw.csv clean.csv and get the same result you produced visually. |
| Drift report | A per-column JSON verdict (PSI, KS test, severity bands) via POST /drift/compare between two datasets |
Monitor a deployed model — compare last month's input distribution to this month's. Catch silent data shifts (a new product category, a sensor recalibration, a market regime change) before they degrade model performance. Plug into a monitoring dashboard or alert on overall_verdict == "significant_drift". |
Two common workflows:
# Workflow 1 — notebook to model, all in-process (zero file I/O):
session = prepro_auto.launch(df)
# ...clean visually in the browser...
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)
# Workflow 2 — clean once, productionize with the exported pipeline:
# 1) export pipeline.py from the workbench
# 2) commit pipeline.py to your model repo
# 3) in production: subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])
Methods
Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.
AI providers (optional)
AI features are optional. Everything works offline without a key. PrePro Auto supports five providers:
| Provider | ID | Install | Get a key |
|---|---|---|---|
| Groq (free tier, fast) | groq |
pip install prepro-auto[groq] |
https://console.groq.com |
| OpenAI / GPT | openai |
pip install prepro-auto[openai] |
https://platform.openai.com |
| Anthropic Claude | anthropic |
pip install prepro-auto[anthropic] |
https://console.anthropic.com |
| Google Gemini | gemini |
pip install prepro-auto[gemini] |
https://aistudio.google.com/app/apikey |
| Mistral | mistral |
pip install prepro-auto[mistral] |
https://console.mistral.ai |
Or install all five at once: pip install prepro-auto[ai].
Three ways to give PrePro Auto your API key
1. From the notebook (in-memory, session-only — safest):
import prepro_auto
prepro_auto.set_api_key("openai", "sk-...") # any of the 5 provider IDs
session = prepro_auto.launch(df)
The key lives only in the running process. Lost on restart (re-enter next session). PrePro Auto makes a tiny test call before returning, so you know immediately whether the key works.
2. From the web UI (in-memory by default, optional .env persistence):
In the workbench, click "AI settings (API key)" in the side rail. Pick a provider, paste the key, click Test & apply. PrePro Auto verifies the key with a live test call before accepting it. Tick "Also save to .env" if you want it to survive restarts (local convenience only — leave unchecked on any shared/hosted machine).
3. From a .env file (persists across restarts):
Add to .env in the project root:
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
Each provider has its own env-key name: GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY.
Honest security note: the .env file is plain text. Fine for a personal machine; never use the persist option on a hosted/shared deployment until proper per-user auth is in place.
Notebook API reference
After import prepro_auto, these are the top-level functions:
| Function | What it does |
|---|---|
prepro_auto.launch(df, domain="general", port=None, open_browser=False) |
Registers an in-memory DataFrame as a job (no upload), starts the local workbench server, returns a Session. Prints a clickable URL. |
prepro_auto.set_api_key(provider, api_key, model=None) |
Sets the AI provider and key at runtime (in-memory). Returns {ok, provider, model, verified, reason} after a live test call. |
After session = prepro_auto.launch(df):
| Method / property | What it does |
|---|---|
session.current() |
Returns the current (active-version) DataFrame as it stands in the UI right now. |
session.update(df) |
Pushes a notebook-edited DataFrame to the UI as a new undoable version. |
session.url |
The workbench URL for this session. |
session.job_id |
The internal job ID for this session. |
session.port |
The local port the workbench server is running on. |
Typical sync cycle:
cur = session.current() # pull current state from UI
cur["price_per_sqft"] = cur["price"] / cur["sqft"] # your own code
session.update(cur) # push back, refresh UI to see it
REST API reference
The web app and SDK both call the same endpoints, all under /api/v1. Once the server is running, the live interactive docs are at http://localhost:8000/docs.
| Endpoint | Purpose |
|---|---|
POST /datasets/upload |
Upload a dataset |
GET /datasets/{job_id}/preview |
First rows + shape |
POST /datasets/{job_id}/profile |
Per-column profile + quality score |
GET /datasets/{job_id}/view |
The current (active-version) data |
GET /datasets/{job_id}/comparison |
Raw vs current summary |
POST /datasets/{job_id}/stages/{stage} |
Run a cleaning stage (missing_values, outliers, scaling, correlation, encoding) |
POST /datasets/{job_id}/stages/{stage}/execute |
Apply approved decisions, commit a snapshot |
GET /datasets/{job_id}/decisions |
List decision cards (filter by ?stage=) |
POST /decisions/{id}/approve · /override · /skip · /drop-column |
Resolve a card |
GET /datasets/{job_id}/queue |
Decision summary across stages |
GET /datasets/{job_id}/history · POST /undo · POST /redo |
Version history & navigation |
GET /datasets/{job_id}/snapshots |
List committed versions |
GET /datasets/{job_id}/transform/operations |
List available preset ops |
POST /datasets/{job_id}/transform/preset |
Apply a preset op (rename, drop, cast, fillna, filter, …) |
POST /datasets/{job_id}/transform/expression |
Run a sandboxed pandas expression |
POST /datasets/{job_id}/transform/batch |
Apply one op to many columns as one undoable step |
POST /datasets/{job_id}/transform/ai-propose · ai-advise · assistant · chat |
AI helpers (needs a key) |
POST /datasets/{job_id}/viz/chart · metric · compare · ask |
Charts and condition counts |
GET /datasets/{job_id}/viz/dashboard |
Before/after KPI dashboard |
POST /drift/compare |
Drift detection between two uploaded datasets |
GET /datasets/{job_id}/export/data · audit · pipeline |
Clean data, audit PDF, reproducible script |
GET /api/v1/system/limits |
Live RAM-aware upload limits |
GET /api/v1/system/llm |
List available providers + active one |
POST /api/v1/system/llm/configure |
Set provider + key at runtime |
Open http://localhost:8000/docs after prepro_auto is running for the interactive Swagger UI with full request/response schemas.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prepro_auto-1.0.0b1.tar.gz.
File metadata
- Download URL: prepro_auto-1.0.0b1.tar.gz
- Upload date:
- Size: 133.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab621dd8ed9857b5a1f23419ac8944d11d02a200544c029dfe3e3e8460be1d28
|
|
| MD5 |
5bef8e4428d6b4f67855c52d676b8c08
|
|
| BLAKE2b-256 |
af510dd5f18a4cf292d2726a41b40b8415b3a69c9558c437a886d9723d0fe25c
|
File details
Details for the file prepro_auto-1.0.0b1-py3-none-any.whl.
File metadata
- Download URL: prepro_auto-1.0.0b1-py3-none-any.whl
- Upload date:
- Size: 153.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8dd1c1ae543f75835b6f8cd0e0d31851458af5166b9ec871bb9523e1ed27fde9
|
|
| MD5 |
64fbb64cce661d4f2d7bc554483bee60
|
|
| BLAKE2b-256 |
67dce0868cf7b84d0c7ffe937340b6f61d9138a73829cfe593c9a09465b612bf
|