AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.

These details have not been verified by PyPI

Project links

Project description

PrePro Auto

AI-assisted tabular data preprocessing with human-in-the-loop control.

Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.

pip install prepro-auto

Author: Shivanshu Pandey · Source: github.com/Chilliflex/prepro_auto

Quickstart — Notebook (no upload)

import pandas as pd
import prepro_auto

df = pd.read_csv("your_data.csv")
session = prepro_auto.launch(df)        # opens the local workbench, NO upload
# -> click the printed http://127.0.0.1:8721/workbench?job=... link

# clean visually in the browser, then back in the notebook:
cleaned = session.current()             # the UI-edited DataFrame
session.update(cleaned)                 # push notebook edits back to the UI

That's the whole loop. Your DataFrame is loaded directly from the notebook's memory — no file upload, no context switch. df (your original) never changes; session.current() always returns the latest cleaned version.

Quickstart — Web UI

prepro_auto                             # starts the workbench at http://127.0.0.1:8000

Then open http://127.0.0.1:8000/workbench and upload a file.

What it does

Profile — per-column type inference, missing rates, 0–100 quality score
Clean (guided) — missing values, outliers, scaling, correlation/leakage, encoding; each issue becomes a reviewable decision with a recommended action and alternatives
Transform (manual) — 17 preset ops, sandboxed expressions, multi-column batches
AI assistant — optional; describe a change in plain English; confirms intent and shows a real preview before applying
Visualize & dashboard — histograms, bar, scatter charts, plus a before/after dashboard with KPI tiles and per-column comparison
Data drift — compare two datasets to detect distribution shifts (PSI + KS)
Undo/redo — every change is a version
Export — clean data (CSV/Parquet), audit PDF, and a runnable Python pipeline script

What you get out of PrePro Auto

Five concrete outputs you can take away after a session. Each one is designed to plug straight into a real-world workflow:

Output	What it is	Where to use it
Cleaned DataFrame	The in-memory DataFrame after all your cleaning + transforms, returned by `session.current()` in the notebook	Feed straight into `model.fit(X, y)` for scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow. No file I/O needed.
Cleaned dataset file	A CSV or Parquet file via `GET /datasets/{job_id}/export/data?format=csv` (or `format=parquet`)	Share with teammates, upload to a feature store, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo, or feed into downstream ETL jobs. Parquet is smaller and faster for large datasets.
Audit PDF	A multi-page PDF via `GET /datasets/{job_id}/export/audit` listing every transformation with its parameters, before/after stats, and who approved it	Compliance trail for regulated industries (finance, healthcare, insurance); attach to a model-card or experiment-tracking entry; hand to a reviewer or data-governance team to prove the cleaning is reproducible and reasoned, not arbitrary.
Runnable Python pipeline	A standalone `.py` script via `GET /datasets/{job_id}/export/pipeline` that reproduces the exact cleaning with pandas + scikit-learn — no PrePro Auto dependency	Drop into a production training pipeline, an Airflow/Prefect/Dagster DAG, a CI job, or a coworker's machine. They run `python pipeline.py raw.csv clean.csv` and get the same result you produced visually.
Drift report	A per-column JSON verdict (PSI, KS test, severity bands) via `POST /drift/compare` between two datasets	Monitor a deployed model — compare last month's input distribution to this month's. Catch silent data shifts (a new product category, a sensor recalibration, a market regime change) before they degrade model performance. Plug into a monitoring dashboard or alert on `overall_verdict == "significant_drift"`.

Two common workflows:

# Workflow 1 — notebook to model, all in-process (zero file I/O):
session = prepro_auto.launch(df)
# ...clean visually in the browser...
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)

# Workflow 2 — clean once, productionize with the exported pipeline:
# 1) export pipeline.py from the workbench
# 2) commit pipeline.py to your model repo
# 3) in production: subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])

Methods

Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.

AI providers (optional)

AI features are optional. Everything works offline without a key. PrePro Auto supports five providers:

Provider	ID	Install	Get a key
Groq (free tier, fast)	`groq`	`pip install prepro-auto[groq]`	https://console.groq.com
OpenAI / GPT	`openai`	`pip install prepro-auto[openai]`	https://platform.openai.com
Anthropic Claude	`anthropic`	`pip install prepro-auto[anthropic]`	https://console.anthropic.com
Google Gemini	`gemini`	`pip install prepro-auto[gemini]`	https://aistudio.google.com/app/apikey
Mistral	`mistral`	`pip install prepro-auto[mistral]`	https://console.mistral.ai

Or install all five at once: pip install prepro-auto[ai].

Three ways to give PrePro Auto your API key

1. From the notebook (in-memory, session-only — safest):

import prepro_auto
prepro_auto.set_api_key("openai", "sk-...")  # any of the 5 provider IDs
session = prepro_auto.launch(df)

The key lives only in the running process. Lost on restart (re-enter next session). PrePro Auto makes a tiny test call before returning, so you know immediately whether the key works.

2. From the web UI (in-memory by default, optional .env persistence):

In the workbench, click "AI settings (API key)" in the side rail. Pick a provider, paste the key, click Test & apply. PrePro Auto verifies the key with a live test call before accepting it. Tick "Also save to .env" if you want it to survive restarts (local convenience only — leave unchecked on any shared/hosted machine).

3. From a .env file (persists across restarts):

Add to .env in the project root:

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

Each provider has its own env-key name: GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY.

Honest security note: the .env file is plain text. Fine for a personal machine; never use the persist option on a hosted/shared deployment until proper per-user auth is in place.

Notebook API reference

After import prepro_auto, these are the top-level functions:

Function	What it does
`prepro_auto.launch(df, domain="general", port=None, open_browser=False)`	Registers an in-memory DataFrame as a job (no upload), starts the local workbench server, returns a `Session`. Prints a clickable URL.
`prepro_auto.set_api_key(provider, api_key, model=None)`	Sets the AI provider and key at runtime (in-memory). Returns `{ok, provider, model, verified, reason}` after a live test call.

After session = prepro_auto.launch(df):

Method / property	What it does
`session.current()`	Returns the current (active-version) DataFrame as it stands in the UI right now.
`session.update(df)`	Pushes a notebook-edited DataFrame to the UI as a new undoable version.
`session.url`	The workbench URL for this session.
`session.job_id`	The internal job ID for this session.
`session.port`	The local port the workbench server is running on.

Typical sync cycle:

cur = session.current()                            # pull current state from UI
cur["price_per_sqft"] = cur["price"] / cur["sqft"] # your own code
session.update(cur)                                # push back, refresh UI to see it

REST API reference

The web app and SDK both call the same endpoints, all under /api/v1. Once the server is running, the live interactive docs are at http://localhost:8000/docs.

Endpoint	Purpose
`POST /datasets/upload`	Upload a dataset
`GET /datasets/{job_id}/preview`	First rows + shape
`POST /datasets/{job_id}/profile`	Per-column profile + quality score
`GET /datasets/{job_id}/view`	The current (active-version) data
`GET /datasets/{job_id}/comparison`	Raw vs current summary
`POST /datasets/{job_id}/stages/{stage}`	Run a cleaning stage (`missing_values`, `outliers`, `scaling`, `correlation`, `encoding`)
`POST /datasets/{job_id}/stages/{stage}/execute`	Apply approved decisions, commit a snapshot
`GET /datasets/{job_id}/decisions`	List decision cards (filter by `?stage=`)
`POST /decisions/{id}/approve` · `/override` · `/skip` · `/drop-column`	Resolve a card
`GET /datasets/{job_id}/queue`	Decision summary across stages
`GET /datasets/{job_id}/history` · `POST /undo` · `POST /redo`	Version history & navigation
`GET /datasets/{job_id}/snapshots`	List committed versions
`GET /datasets/{job_id}/transform/operations`	List available preset ops
`POST /datasets/{job_id}/transform/preset`	Apply a preset op (rename, drop, cast, fillna, filter, …)
`POST /datasets/{job_id}/transform/expression`	Run a sandboxed pandas expression
`POST /datasets/{job_id}/transform/batch`	Apply one op to many columns as one undoable step
`POST /datasets/{job_id}/transform/ai-propose` · `ai-advise` · `assistant` · `chat`	AI helpers (needs a key)
`POST /datasets/{job_id}/viz/chart` · `metric` · `compare` · `ask`	Charts and condition counts
`GET /datasets/{job_id}/viz/dashboard`	Before/after KPI dashboard
`POST /drift/compare`	Drift detection between two uploaded datasets
`GET /datasets/{job_id}/export/data` · `audit` · `pipeline`	Clean data, audit PDF, reproducible script
`GET /api/v1/system/limits`	Live RAM-aware upload limits
`GET /api/v1/system/llm`	List available providers + active one
`POST /api/v1/system/llm/configure`	Set provider + key at runtime

Open http://localhost:8000/docs after prepro_auto is running for the interactive Swagger UI with full request/response schemas.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0b4 pre-release

May 31, 2026

1.0.0b3 pre-release

May 30, 2026

1.0.0b2 pre-release

May 30, 2026

This version

1.0.0b1 pre-release

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prepro_auto-1.0.0b1.tar.gz (133.2 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prepro_auto-1.0.0b1-py3-none-any.whl (153.2 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file prepro_auto-1.0.0b1.tar.gz.

File metadata

Download URL: prepro_auto-1.0.0b1.tar.gz
Upload date: May 29, 2026
Size: 133.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b1.tar.gz
Algorithm	Hash digest
SHA256	`ab621dd8ed9857b5a1f23419ac8944d11d02a200544c029dfe3e3e8460be1d28`
MD5	`5bef8e4428d6b4f67855c52d676b8c08`
BLAKE2b-256	`af510dd5f18a4cf292d2726a41b40b8415b3a69c9558c437a886d9723d0fe25c`

See more details on using hashes here.

File details

Details for the file prepro_auto-1.0.0b1-py3-none-any.whl.

File metadata

Download URL: prepro_auto-1.0.0b1-py3-none-any.whl
Upload date: May 29, 2026
Size: 153.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8dd1c1ae543f75835b6f8cd0e0d31851458af5166b9ec871bb9523e1ed27fde9`
MD5	`64fbb64cce661d4f2d7bc554483bee60`
BLAKE2b-256	`67dce0868cf7b84d0c7ffe937340b6f61d9138a73829cfe593c9a09465b612bf`

See more details on using hashes here.

prepro-auto 1.0.0b1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PrePro Auto

Quickstart — Notebook (no upload)

Quickstart — Web UI

What it does

What you get out of PrePro Auto

Methods

AI providers (optional)

Three ways to give PrePro Auto your API key

Notebook API reference

REST API reference

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes