AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.
Project description
PrePro Auto
AI-assisted tabular data preprocessing with human-in-the-loop control.
Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.
pip install prepro-auto
Author: Shivanshu Pandey · Source: github.com/Chilliflex/prepro_auto
Contents
- Quickstart — get going in 30 seconds
- Ways to give PrePro Auto your data — 4 from notebook, 3 from web UI
- 1. Input functions (notebook) — how to load data into a session
- 2. Preprocessing functions — clean, transform, visualize
- 3. Output functions — DataFrames, files, audit PDFs, pipelines
- AI providers — optional, 5 providers supported
- REST API reference
- Documentation
Quickstart
Notebook (recommended for data scientists):
import prepro_auto
# Easiest: point at a file, auto-detects encoding (handles Latin-1, cp1252, BOM)
session = prepro_auto.launch_file(r"C:\path\to\your_data.csv")
# Click the printed http://127.0.0.1:8721/workbench?job=... link
Web UI (recommended for analysts): open Command Prompt (not Jupyter) and run:
prepro_auto
Then open http://127.0.0.1:8000/workbench and drag-drop a file.
Note:
prepro_autotyped inside a Jupyter cell just prints the module object — it doesn't start a server. The CLI command runs from a terminal only. Inside a notebook, useprepro_auto.launch_file(path)orprepro_auto.launch(df)instead.
Ways to give PrePro Auto your data
There are 4 input methods in the notebook and 3 in the web UI. Pick whichever fits your workflow.
From a Jupyter notebook (4 ways)
| # | Method | When to use it |
|---|---|---|
| 1 | prepro_auto.launch_file(path) |
You have a file on disk — CSV, Excel, JSON, Parquet, etc. Easiest path. Auto-detects encoding and delimiter. No pd.read_csv() needed. |
| 2 | prepro_auto.launch(df) |
You already have a pandas DataFrame in memory (from a database query, API response, generated data, or a tricky read you handled yourself). |
| 3 | session.update(df) |
You already have a session and want to push a new DataFrame to it (e.g. after notebook-side edits). Commits a new undoable version. |
| 4 | Web upload, then notebook reads | Start the server with the CLI, upload via browser, then in the notebook do prepro_auto.Session(job_id, port).current() to pull the data back into Python. Rare but valid. |
From the web UI (3 ways)
| # | Method | When to use it |
|---|---|---|
| 1 | Drag-and-drop upload on Step 1 of the workbench | Standard. Drop a CSV/Parquet/Excel/JSON file into the upload box. The engine auto-detects encoding and delimiter. |
| 2 | File picker on Step 1 | Same as drag-and-drop, just clicked. Useful when dragging is awkward (split screens, touchpads). |
| 3 | URL parameter ?job=<id> |
When the notebook launched the session, the printed URL already includes ?job=... — no upload needed, the workbench adopts the existing job. |
Supported file formats
| Format | Extensions | Notes |
|---|---|---|
| CSV | .csv, .tsv, .txt |
Auto-detects encoding (utf-8 / utf-8-sig / latin-1 / cp1252) and delimiter (comma, tab, semicolon, pipe) |
| Excel | .xlsx, .xls, .xlsm |
First sheet by default; multi-sheet handling via the upload form |
| Parquet | .parquet, .pq |
Fastest format for large datasets, preserves dtypes |
| JSON | .json, .jsonl, .ndjson |
JSON-records and JSON-lines both supported |
| Feather | .feather |
Apache Arrow's native columnar format |
Not supported: PDF, DOCX, HTML, images. PrePro Auto is a tabular-data tool — these formats need a dedicated extraction step first (Camelot or pdfplumber for PDFs, BeautifulSoup for HTML).
1. Input functions (notebook)
Everything you call before preprocessing starts. The functions that get data into a session.
| Function | Parameters | Returns | What it does |
|---|---|---|---|
prepro_auto.launch_file(file_path, domain="general", port=None, open_browser=False) |
file_path: str or Path |
Session |
Reads a file from disk with auto-encoding-detection, starts the local workbench, returns a session. Handles all supported formats. Prints the workbench URL. |
prepro_auto.launch(df, domain="general", port=None, open_browser=False) |
df: pandas DataFrame |
Session |
Registers an in-memory DataFrame as a job (no upload, no file I/O), starts the workbench, returns a session. Use when you already have a DataFrame. |
prepro_auto.Session(job_id, port) |
job_id: str, port: int |
Session |
Reconnect to an existing session by ID. Use when the notebook restarted but the server is still running, or to attach to a job created from the web UI. |
prepro_auto.set_api_key(provider, api_key, model=None) |
provider: one of "groq" / "openai" / "anthropic" / "gemini" / "mistral" |
dict with ok, verified, provider, model, reason |
Configures the AI provider at runtime (in-memory only — not written to disk). Makes a tiny test call to verify the key works. Call before launch() if you want AI features active for the session. |
Example — most common pattern:
import prepro_auto
# Optional: enable AI features for this session
prepro_auto.set_api_key("openai", "sk-...")
# Load a file (auto-encoding-detection)
session = prepro_auto.launch_file(r"C:\Users\me\data\sales.csv")
2. Preprocessing functions
The work itself — clean, transform, version. These are called on the session object that input functions returned, or via REST endpoints under /api/v1/.
Profile and clean
| Function / Endpoint | What it does |
|---|---|
POST /datasets/{job_id}/profile |
Per-column type inference, missing rates, 0–100 quality score. Run once after upload. |
POST /datasets/{job_id}/stages/missing_values |
Detect missingness mechanism (MCAR / MAR / MNAR), recommend fill strategy per column. Creates decision cards. |
POST /datasets/{job_id}/stages/outliers |
IQR + modified Z-score + Isolation Forest. Classifies findings as data errors vs rare events. |
POST /datasets/{job_id}/stages/scaling |
Normality-driven scaler choice: Standard / Robust / Box-Cox / Yeo-Johnson / MinMax / log1p. |
POST /datasets/{job_id}/stages/correlation |
Find correlated pairs, detect constant / ID-like / target-leaking columns. |
POST /datasets/{job_id}/stages/encoding |
Categorical encoding routed by cardinality: label / ordinal / one-hot / frequency / target. |
POST /datasets/{job_id}/stages/{stage_name}/execute |
Apply your approved decisions, commit a new version. stage_name is one of the five above. |
Decision cards (the human-in-the-loop)
| Endpoint | What it does |
|---|---|
GET /datasets/{job_id}/decisions?stage=<stage> |
List decision cards for a stage |
POST /decisions/{decision_id}/approve |
Use the recommended action |
POST /decisions/{decision_id}/override |
Use an alternative action (body: {"action": "...", "reason": "..."}) |
POST /decisions/{decision_id}/skip |
Don't change this column |
POST /decisions/{decision_id}/drop-column |
Drop the column entirely |
Manual transforms (when you need more control)
| Endpoint | What it does |
|---|---|
GET /datasets/{job_id}/transform/operations |
List all 17 preset operations and their parameters |
POST /datasets/{job_id}/transform/preset |
Apply one preset op (rename, drop, cast, fillna, filter, merge, math, map, string ops, regex, sort, dedup, extract-number) |
POST /datasets/{job_id}/transform/expression |
Run a sandboxed pandas expression (e.g. df["profit"] = df["revenue"] - df["cost"]) |
POST /datasets/{job_id}/transform/batch |
Apply one operation across many columns as a single undoable step |
AI-assisted transforms (optional, needs an API key)
| Endpoint | What it does |
|---|---|
POST /datasets/{job_id}/transform/ai-propose |
Describe a change in plain English; AI proposes a concrete transform with preview |
POST /datasets/{job_id}/transform/ai-advise |
Ask the AI for advice on a column without changing anything |
POST /datasets/{job_id}/transform/assistant |
One-shot assistant call (full message) |
POST /datasets/{job_id}/transform/chat |
Multi-turn conversation preserving history |
Versioning and history
| Endpoint | What it does |
|---|---|
GET /datasets/{job_id}/view |
Current (active-version) data with shape, dtypes, sample rows |
GET /datasets/{job_id}/history |
Full version history with labels |
POST /datasets/{job_id}/undo |
Move active pointer back one version |
POST /datasets/{job_id}/redo |
Move active pointer forward one version |
GET /datasets/{job_id}/snapshots |
List all committed snapshots |
Visualization and monitoring
| Endpoint | What it does |
|---|---|
POST /datasets/{job_id}/viz/chart |
Build a histogram, bar, or scatter chart |
POST /datasets/{job_id}/viz/metric |
Compute a condition-based metric (e.g. "rows where price > 1000") |
POST /datasets/{job_id}/viz/compare |
Compare one column's distribution raw vs current |
GET /datasets/{job_id}/viz/dashboard |
Power-BI-style before/after dashboard (KPI tiles + per-column comparison) |
POST /drift/compare |
Compare two uploaded datasets for distribution drift (PSI + KS) |
3. Output functions
The artifacts you take away from a session. Notebook methods return Python objects; REST endpoints return downloadable files.
From the notebook (Python objects)
| Method | Returns | Where to use it |
|---|---|---|
session.current() |
pandas DataFrame | The current (active-version) DataFrame as it stands in the UI. Drop straight into model.fit(X, y). |
session.url |
str | The workbench URL for this session — useful for re-opening after closing the tab. |
session.job_id |
str | The internal job ID — use it for raw REST API calls. |
session.port |
int | The local port the server is running on. |
From the REST API or web UI (downloadable files)
| Endpoint | File | Where to use it |
|---|---|---|
GET /datasets/{job_id}/export/data?format=csv |
Cleaned CSV | Share with teammates, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo. |
GET /datasets/{job_id}/export/data?format=parquet |
Cleaned Parquet | Faster and smaller than CSV for large datasets; preserves dtypes exactly. |
GET /datasets/{job_id}/export/audit |
Audit PDF | Compliance trail listing every transformation with parameters, before/after stats, who approved. Attach to a model-card or hand to a data-governance reviewer. |
GET /datasets/{job_id}/export/pipeline |
Runnable .py script |
Reproduces the exact cleaning with pandas + scikit-learn, no PrePro Auto dependency. Drop into Airflow / Prefect / GitHub Actions. Run with python pipeline.py raw.csv ready.csv. |
POST /drift/compare (returns JSON) |
Drift report | Per-column PSI / KS verdicts with severity bands. Plug into a monitoring dashboard, alert on overall_verdict == "significant_drift". |
Two typical workflows end-to-end
# Workflow 1 — notebook to model, no file I/O:
session = prepro_auto.launch_file(r"C:\data\sales.csv")
# ...clean visually in the browser, then:
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)
# Workflow 2 — clean once, productionize the pipeline:
# 1) Download pipeline.py from the workbench's Export step
# 2) Commit it to your model repo
# 3) In production:
# subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])
What it does
- Profile — per-column type inference, missing rates, 0–100 quality score
- Clean (guided) — five HITL stages: missing values, outliers, scaling, correlation/leakage, encoding
- Transform (manual) — 17 preset ops, sandboxed expressions, multi-column batches
- AI assistant — optional; describe a change in plain English; preview before applying
- Visualize & dashboard — histograms, bar, scatter; before/after dashboard with KPI tiles
- Data drift — PSI + KS test between two datasets
- Undo/redo — every change is a version
- Export — cleaned data (CSV/Parquet), audit PDF, runnable Python pipeline
Methods
Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.
AI providers (optional)
AI features are optional. Everything works offline without a key. PrePro Auto supports five providers:
| Provider | ID | Install | Get a key |
|---|---|---|---|
| Groq (free tier, fast) | groq |
pip install prepro-auto[groq] |
https://console.groq.com |
| OpenAI / GPT | openai |
pip install prepro-auto[openai] |
https://platform.openai.com |
| Anthropic Claude | anthropic |
pip install prepro-auto[anthropic] |
https://console.anthropic.com |
| Google Gemini | gemini |
pip install prepro-auto[gemini] |
https://aistudio.google.com/app/apikey |
| Mistral | mistral |
pip install prepro-auto[mistral] |
https://console.mistral.ai |
Or install all five at once: pip install prepro-auto[ai].
Three ways to give PrePro Auto your API key
1. Notebook (in-memory, session-only — safest):
prepro_auto.set_api_key("openai", "sk-...")
2. Web UI: click AI Provider → Configure API key in the side rail, paste key, click Test & apply.
3. .env file (survives restarts):
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
Security note: the .env file is plain text. Fine for a personal machine; never enable disk-persistence on a shared or hosted deployment.
REST API reference
The web app and SDK both call the same endpoints under /api/v1. Once the server is running, the interactive Swagger UI is at http://localhost:8000/docs.
For the full table organized by category, see Section 2 — Preprocessing functions and Section 3 — Output functions above. System endpoints:
| Endpoint | Purpose |
|---|---|
GET /api/v1/health |
Liveness check |
GET /api/v1/system/limits |
Live RAM-aware upload limits |
GET /api/v1/system/llm |
List providers + active one |
POST /api/v1/system/llm/configure |
Set provider + key at runtime |
Documentation
- Project Overview (PDF) — what it is, who it's for, design philosophy
- Developer Guide (PDF) — every function, endpoint, UI element
- Update Guide (PDF) — for maintainers shipping new versions
- Interactive Swagger at
http://localhost:8000/docs(once running)
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prepro_auto-1.0.0b2.tar.gz.
File metadata
- Download URL: prepro_auto-1.0.0b2.tar.gz
- Upload date:
- Size: 138.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92908add2329d4fc1b02235156bdcc74efeadea574a32706a67f01cec0917bcd
|
|
| MD5 |
851f2f9fb5932bc68f71008e4db53182
|
|
| BLAKE2b-256 |
6242d22a1ea7a6827da6b22aaf367cd1b3bf94369a73e9b802cfa946046d225b
|
File details
Details for the file prepro_auto-1.0.0b2-py3-none-any.whl.
File metadata
- Download URL: prepro_auto-1.0.0b2-py3-none-any.whl
- Upload date:
- Size: 156.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cad7ddc8eec3b00994690e95037f354bc7bb524f9075211f5c8cb92356e8dd7
|
|
| MD5 |
4e3116525e244c899055fce981a23e8b
|
|
| BLAKE2b-256 |
09d0790d1858490933b3d47411109431a04b4f2197e14a39eb6c54923dca5470
|