AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.

These details have not been verified by PyPI

Project links

Project description

PrePro Auto

AI-assisted tabular data preprocessing with human-in-the-loop control.

Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.

pip install prepro-auto

Author: Shivanshu Pandey · Source: github.com/Chilliflex/prepro_auto

Quickstart — get going in 30 seconds
Ways to give PrePro Auto your data — 4 from notebook, 3 from web UI
1. Input functions (notebook) — how to load data into a session
2. Preprocessing functions — clean, transform, visualize
3. Output functions — DataFrames, files, audit PDFs, pipelines
AI providers — optional, 5 providers supported
REST API reference
Documentation

Quickstart

Notebook (recommended for data scientists):

import prepro_auto

# Easiest: point at a file, auto-detects encoding (handles Latin-1, cp1252, BOM)
session = prepro_auto.launch_file(r"C:\path\to\your_data.csv")
# Click the printed http://127.0.0.1:8721/workbench?job=... link

Web UI (recommended for analysts): open Command Prompt (not Jupyter) and run:

prepro_auto

Then open http://127.0.0.1:8000/workbench and drag-drop a file.

Note: prepro_auto typed inside a Jupyter cell just prints the module object — it doesn't start a server. The CLI command runs from a terminal only. Inside a notebook, use prepro_auto.launch_file(path) or prepro_auto.launch(df) instead.

Ways to give PrePro Auto your data

There are 4 input methods in the notebook and 3 in the web UI. Pick whichever fits your workflow.

From a Jupyter notebook (4 ways)

#	Method	When to use it
1	`prepro_auto.launch_file(path)`	You have a file on disk — CSV, Excel, JSON, Parquet, etc. Easiest path. Auto-detects encoding and delimiter. No `pd.read_csv()` needed.
2	`prepro_auto.launch(df)`	You already have a pandas DataFrame in memory (from a database query, API response, generated data, or a tricky read you handled yourself).
3	`session.update(df)`	You already have a session and want to push a new DataFrame to it (e.g. after notebook-side edits). Commits a new undoable version.
4	Web upload, then notebook reads	Start the server with the CLI, upload via browser, then in the notebook do `prepro_auto.Session(job_id, port).current()` to pull the data back into Python. Rare but valid.

From the web UI (3 ways)

#	Method	When to use it
1	Drag-and-drop upload on Step 1 of the workbench	Standard. Drop a CSV/Parquet/Excel/JSON file into the upload box. The engine auto-detects encoding and delimiter.
2	File picker on Step 1	Same as drag-and-drop, just clicked. Useful when dragging is awkward (split screens, touchpads).
3	URL parameter `?job=<id>`	When the notebook launched the session, the printed URL already includes `?job=...` — no upload needed, the workbench adopts the existing job.

Supported file formats

Format	Extensions	Notes
CSV	`.csv`, `.tsv`, `.txt`	Auto-detects encoding (utf-8 / utf-8-sig / latin-1 / cp1252) and delimiter (comma, tab, semicolon, pipe)
Excel	`.xlsx`, `.xls`, `.xlsm`	First sheet by default; multi-sheet handling via the upload form
Parquet	`.parquet`, `.pq`	Fastest format for large datasets, preserves dtypes
JSON	`.json`, `.jsonl`, `.ndjson`	JSON-records and JSON-lines both supported
Feather	`.feather`	Apache Arrow's native columnar format

Not supported: PDF, DOCX, HTML, images. PrePro Auto is a tabular-data tool — these formats need a dedicated extraction step first (Camelot or pdfplumber for PDFs, BeautifulSoup for HTML).

Have a PDF with a table?

Extract it to a DataFrame first, then hand it to PrePro Auto:

import pdfplumber, pandas as pd, prepro_auto

with pdfplumber.open("report.pdf") as pdf:
    rows = pdf.pages[0].extract_table()        # pick the right page
df = pd.DataFrame(rows[1:], columns=rows[0])    # first row is the header
session = prepro_auto.launch(df)                # now clean it like any DataFrame

For PDFs with merged cells or complex layouts, try camelot-py (better for bordered tables) or tabula-py (requires Java). PrePro Auto deliberately leaves PDF extraction to specialised tools because generic PDF-to-table conversion succeeds only ~30–70% of the time depending on the document — bundling it would mean silent extraction errors hidden under PrePro Auto's name.

1. Input functions (notebook)

Everything you call before preprocessing starts. The functions that get data into a session.

Function	Parameters	Returns	What it does
`prepro_auto.launch_file(file_path, domain="general", port=None, open_browser=False)`	`file_path`: str or Path	`Session`	Reads a file from disk with auto-encoding-detection, starts the local workbench, returns a session. Handles all supported formats. Prints the workbench URL.
`prepro_auto.launch(df, domain="general", port=None, open_browser=False)`	`df`: pandas DataFrame	`Session`	Registers an in-memory DataFrame as a job (no upload, no file I/O), starts the workbench, returns a session. Use when you already have a DataFrame.
`prepro_auto.Session(job_id, port)`	`job_id`: str, `port`: int	`Session`	Reconnect to an existing session by ID. Use when the notebook restarted but the server is still running, or to attach to a job created from the web UI.
`prepro_auto.set_api_key(provider, api_key, model=None)`	`provider`: one of `"groq" / "openai" / "anthropic" / "gemini" / "mistral"`	`dict` with `ok`, `verified`, `provider`, `model`, `reason`	Configures the AI provider at runtime (in-memory only — not written to disk). Makes a tiny test call to verify the key works. Call before `launch()` if you want AI features active for the session.

Example — most common pattern:

import prepro_auto

# Optional: enable AI features for this session
prepro_auto.set_api_key("openai", "sk-...")

# Load a file (auto-encoding-detection)
session = prepro_auto.launch_file(r"C:\Users\me\data\sales.csv")

2. Preprocessing functions

The work itself — clean, transform, version. These are called on the session object that input functions returned, or via REST endpoints under /api/v1/.

Profile and clean

Function / Endpoint	What it does
`POST /datasets/{job_id}/profile`	Per-column type inference, missing rates, 0–100 quality score. Run once after upload.
`POST /datasets/{job_id}/stages/missing_values`	Detect missingness mechanism (MCAR / MAR / MNAR), recommend fill strategy per column. Creates decision cards.
`POST /datasets/{job_id}/stages/outliers`	IQR + modified Z-score + Isolation Forest. Classifies findings as data errors vs rare events.
`POST /datasets/{job_id}/stages/scaling`	Normality-driven scaler choice: Standard / Robust / Box-Cox / Yeo-Johnson / MinMax / log1p.
`POST /datasets/{job_id}/stages/correlation`	Find correlated pairs, detect constant / ID-like / target-leaking columns.
`POST /datasets/{job_id}/stages/encoding`	Categorical encoding routed by cardinality: label / ordinal / one-hot / frequency / target.
`POST /datasets/{job_id}/stages/{stage_name}/execute`	Apply your approved decisions, commit a new version. `stage_name` is one of the five above.

Decision cards (the human-in-the-loop)

Endpoint	What it does
`GET /datasets/{job_id}/decisions?stage=<stage>`	List decision cards for a stage
`POST /decisions/{decision_id}/approve`	Use the recommended action
`POST /decisions/{decision_id}/override`	Use an alternative action (body: `{"action": "...", "reason": "..."}`)
`POST /decisions/{decision_id}/skip`	Don't change this column
`POST /decisions/{decision_id}/drop-column`	Drop the column entirely

Manual transforms (when you need more control)

Endpoint	What it does
`GET /datasets/{job_id}/transform/operations`	List all 17 preset operations and their parameters
`POST /datasets/{job_id}/transform/preset`	Apply one preset op (rename, drop, cast, fillna, filter, merge, math, map, string ops, regex, sort, dedup, extract-number)
`POST /datasets/{job_id}/transform/expression`	Run a sandboxed pandas expression (e.g. `df["profit"] = df["revenue"] - df["cost"]`)
`POST /datasets/{job_id}/transform/batch`	Apply one operation across many columns as a single undoable step

AI-assisted transforms (optional, needs an API key)

Endpoint	What it does
`POST /datasets/{job_id}/transform/ai-propose`	Describe a change in plain English; AI proposes a concrete transform with preview
`POST /datasets/{job_id}/transform/ai-advise`	Ask the AI for advice on a column without changing anything
`POST /datasets/{job_id}/transform/assistant`	One-shot assistant call (full message)
`POST /datasets/{job_id}/transform/chat`	Multi-turn conversation preserving history

Versioning and history

Endpoint	What it does
`GET /datasets/{job_id}/view`	Current (active-version) data with shape, dtypes, sample rows
`GET /datasets/{job_id}/history`	Full version history with labels
`POST /datasets/{job_id}/undo`	Move active pointer back one version
`POST /datasets/{job_id}/redo`	Move active pointer forward one version
`GET /datasets/{job_id}/snapshots`	List all committed snapshots

Visualization and monitoring

Endpoint	What it does
`POST /datasets/{job_id}/viz/chart`	Build a histogram, bar, or scatter chart
`POST /datasets/{job_id}/viz/metric`	Compute a condition-based metric (e.g. "rows where price > 1000")
`POST /datasets/{job_id}/viz/compare`	Compare one column's distribution raw vs current
`GET /datasets/{job_id}/viz/dashboard`	Power-BI-style before/after dashboard (KPI tiles + per-column comparison)
`POST /drift/compare`	Compare two uploaded datasets for distribution drift (PSI + KS)

3. Output functions

The artifacts you take away from a session. Notebook methods return Python objects; REST endpoints return downloadable files.

From the notebook (Python objects)

Method	Returns	Where to use it
`session.current()`	pandas DataFrame	The current (active-version) DataFrame as it stands in the UI. Drop straight into `model.fit(X, y)`.
`session.url`	str	The workbench URL for this session — useful for re-opening after closing the tab.
`session.job_id`	str	The internal job ID — use it for raw REST API calls.
`session.port`	int	The local port the server is running on.

From the REST API or web UI (downloadable files)

Endpoint	File	Where to use it
`GET /datasets/{job_id}/export/data?format=csv`	Cleaned CSV	Share with teammates, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo.
`GET /datasets/{job_id}/export/data?format=parquet`	Cleaned Parquet	Faster and smaller than CSV for large datasets; preserves dtypes exactly.
`GET /datasets/{job_id}/export/audit`	Audit PDF	Compliance trail listing every transformation with parameters, before/after stats, who approved. Attach to a model-card or hand to a data-governance reviewer.
`GET /datasets/{job_id}/export/pipeline`	Runnable `.py` script	Reproduces the exact cleaning with pandas + scikit-learn, no PrePro Auto dependency. Drop into Airflow / Prefect / GitHub Actions. Run with `python pipeline.py raw.csv ready.csv`.
`POST /drift/compare` (returns JSON)	Drift report	Per-column PSI / KS verdicts with severity bands. Plug into a monitoring dashboard, alert on `overall_verdict == "significant_drift"`.

Two typical workflows end-to-end

# Workflow 1 — notebook to model, no file I/O:
session = prepro_auto.launch_file(r"C:\data\sales.csv")
# ...clean visually in the browser, then:
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)

# Workflow 2 — clean once, productionize the pipeline:
# 1) Download pipeline.py from the workbench's Export step
# 2) Commit it to your model repo
# 3) In production:
#    subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])

What it does

Profile — per-column type inference, missing rates, 0–100 quality score
Clean (guided) — five HITL stages: missing values, outliers, scaling, correlation/leakage, encoding
Transform (manual) — 17 preset ops, sandboxed expressions, multi-column batches
AI assistant — optional; describe a change in plain English; preview before applying
Visualize & dashboard — histograms, bar, scatter; before/after dashboard with KPI tiles
Data drift — PSI + KS test between two datasets
Undo/redo — every change is a version
Export — cleaned data (CSV/Parquet), audit PDF, runnable Python pipeline

Methods

Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.

AI providers (optional)

AI features are optional. Everything works offline without a key. PrePro Auto supports five providers:

Provider	ID	Install	Get a key
Groq (free tier, fast)	`groq`	`pip install prepro-auto[groq]`	https://console.groq.com
OpenAI / GPT	`openai`	`pip install prepro-auto[openai]`	https://platform.openai.com
Anthropic Claude	`anthropic`	`pip install prepro-auto[anthropic]`	https://console.anthropic.com
Google Gemini	`gemini`	`pip install prepro-auto[gemini]`	https://aistudio.google.com/app/apikey
Mistral	`mistral`	`pip install prepro-auto[mistral]`	https://console.mistral.ai

Or install all five at once: pip install prepro-auto[ai].

Three ways to give PrePro Auto your API key

1. Notebook (in-memory, session-only — safest):

prepro_auto.set_api_key("openai", "sk-...")

2. Web UI: click AI Provider → Configure API key in the side rail, paste key, click Test & apply.

3. .env file (survives restarts):

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

Security note: the .env file is plain text. Fine for a personal machine; never enable disk-persistence on a shared or hosted deployment.

REST API reference

The web app and SDK both call the same endpoints under /api/v1. Once the server is running, the interactive Swagger UI is at http://localhost:8000/docs.

For the full table organized by category, see Section 2 — Preprocessing functions and Section 3 — Output functions above. System endpoints:

Endpoint	Purpose
`GET /api/v1/health`	Liveness check
`GET /api/v1/system/limits`	Live RAM-aware upload limits
`GET /api/v1/system/llm`	List providers + active one
`POST /api/v1/system/llm/configure`	Set provider + key at runtime

Documentation

Project Overview (PDF) — what it is, who it's for, design philosophy
Developer Guide (PDF) — every function, endpoint, UI element
Update Guide (PDF) — for maintainers shipping new versions
Interactive Swagger at http://localhost:8000/docs (once running)

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0b4 pre-release

May 31, 2026

This version

1.0.0b3 pre-release

May 30, 2026

1.0.0b2 pre-release

May 30, 2026

1.0.0b1 pre-release

May 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prepro_auto-1.0.0b3.tar.gz (140.6 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prepro_auto-1.0.0b3-py3-none-any.whl (158.3 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file prepro_auto-1.0.0b3.tar.gz.

File metadata

Download URL: prepro_auto-1.0.0b3.tar.gz
Upload date: May 30, 2026
Size: 140.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b3.tar.gz
Algorithm	Hash digest
SHA256	`55a6ff1b56a85c7645e7ab74292f895a4ae4bef14b5bc75e7f343e54126b995d`
MD5	`b9aa4ea411dc97dcff2c9773109e6970`
BLAKE2b-256	`b19641be54a59aa3c8fd9e1fcaaf5946a26fb96d0b38ddba49348bc1e7f51c3e`

See more details on using hashes here.

File details

Details for the file prepro_auto-1.0.0b3-py3-none-any.whl.

File metadata

Download URL: prepro_auto-1.0.0b3-py3-none-any.whl
Upload date: May 30, 2026
Size: 158.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b11b954ab133c13c3e7349efd3858050c2acac30fee1b0c1d770aca6e42c6f8`
MD5	`432eb98fcc61dbc652900e56cd664b67`
BLAKE2b-256	`b75cf0288c5562797bba093485a46e789269be2521825827e7fe9b4e52b6becb`

See more details on using hashes here.

prepro-auto 1.0.0b3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PrePro Auto

Contents

Quickstart

Ways to give PrePro Auto your data

From a Jupyter notebook (4 ways)

From the web UI (3 ways)

Supported file formats

Have a PDF with a table?

1. Input functions (notebook)

2. Preprocessing functions

Profile and clean

Decision cards (the human-in-the-loop)

Manual transforms (when you need more control)

AI-assisted transforms (optional, needs an API key)

Versioning and history

Visualization and monitoring

3. Output functions

From the notebook (Python objects)

From the REST API or web UI (downloadable files)

Two typical workflows end-to-end

What it does

Methods

AI providers (optional)

Three ways to give PrePro Auto your API key

REST API reference

Documentation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes