Skip to main content

AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.

Project description

PrePro Auto

PyPI version Python License: MIT Tests Downloads CI

AI-assisted tabular data preprocessing with human-in-the-loop control.

Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.

pip install prepro-auto

Author: Shivanshu Pandey · Source: github.com/Chilliflex/prepro_auto


Contents


Quickstart

Notebook (recommended for data scientists):

import prepro_auto

# Easiest: point at a file, auto-detects encoding (handles Latin-1, cp1252, BOM)
session = prepro_auto.launch_file(r"C:\path\to\your_data.csv")
# Click the printed http://127.0.0.1:8721/workbench?job=... link

Web UI (recommended for analysts): open Command Prompt (not Jupyter) and run:

prepro_auto

Then open http://127.0.0.1:8000/workbench and drag-drop a file.

Note: prepro_auto typed inside a Jupyter cell just prints the module object — it doesn't start a server. The CLI command runs from a terminal only. Inside a notebook, use prepro_auto.launch_file(path) or prepro_auto.launch(df) instead.


Screenshots

Step 1 — Upload

Drop any CSV, Parquet, Excel, or JSON file. Encoding and delimiter are auto-detected. The sidebar shows live RAM-aware upload limits for your machine.

Upload

Step 2 — Profile

Per-column semantic type inference, missing rates, cardinality, and a 0–100 dataset quality score — all in one pass, before any data is changed.

Profile

Step 3 — View data

Live table toggling between Original (raw) and Current (cleaned). Version label, row/column count, and quality score update after every operation.

View data

Step 4 — Clean (human-in-the-loop)

Each stage (Missing Values, Outliers, Scaling, Correlation, Encoding) generates per-column decision cards. Approve the recommendation, Override to an alternative, Skip, or Drop — then Execute to commit.

Clean

Step 5a — Preset Operations

17 built-in transforms (rename, cast, filter, merge, math, string ops, regex, sort, dedup, and more). Every operation is a single undoable version.

Preset Operations

Step 5b — AI Assistant

Describe a transform in plain English. The AI proposes the pandas code, shows a preview, and waits for your confirmation before touching the data.

AI Assistant

Step 5c — Expression Editor

Write any sandboxed pandas expression directly: df['profit'] = df['revenue'] - df['cost']. Validated against the current schema before execution.

Expression Editor

Step 5d — Visualization

Histograms, bar charts, scatter plots, and condition-based metrics — all rendered live against the current dataset version.

Visualization

Step 5e — Before & After Dashboard

KPI tiles comparing raw upload to current cleaned version: quality score delta, per-column type changes, and data samples side-by-side.

Before and After Dashboard

Step 5f — Data Drift Detection

Upload a second dataset (e.g. last month's production data) and compare distributions. PSI + KS test per column with stable / moderate / significant severity bands.

Data Drift

Step 6 — Results & Export

Quality score before vs after, column-type changes, and three downloads: cleaned data (CSV or Parquet), audit PDF, and a standalone pipeline script.

Results and Export


Ways to give PrePro Auto your data

There are 4 input methods in the notebook and 3 in the web UI. Pick whichever fits your workflow.

From a Jupyter notebook (4 ways)

# Method When to use it
1 prepro_auto.launch_file(path) You have a file on disk — CSV, Excel, JSON, Parquet, etc. Easiest path. Auto-detects encoding and delimiter. No pd.read_csv() needed.
2 prepro_auto.launch(df) You already have a pandas DataFrame in memory (from a database query, API response, generated data, or a tricky read you handled yourself).
3 session.update(df) You already have a session and want to push a new DataFrame to it (e.g. after notebook-side edits). Commits a new undoable version.
4 Web upload, then notebook reads Start the server with the CLI, upload via browser, then in the notebook do prepro_auto.Session(job_id, port).current() to pull the data back into Python. Rare but valid.

From the web UI (3 ways)

# Method When to use it
1 Drag-and-drop upload on Step 1 of the workbench Standard. Drop a CSV/Parquet/Excel/JSON file into the upload box. The engine auto-detects encoding and delimiter.
2 File picker on Step 1 Same as drag-and-drop, just clicked. Useful when dragging is awkward (split screens, touchpads).
3 URL parameter ?job=<id> When the notebook launched the session, the printed URL already includes ?job=... — no upload needed, the workbench adopts the existing job.

Supported file formats

Format Extensions Notes
CSV .csv, .tsv, .txt Auto-detects encoding (utf-8 / utf-8-sig / latin-1 / cp1252) and delimiter (comma, tab, semicolon, pipe)
Excel .xlsx, .xls, .xlsm First sheet by default; multi-sheet handling via the upload form
Parquet .parquet, .pq Fastest format for large datasets, preserves dtypes
JSON .json, .jsonl, .ndjson JSON-records and JSON-lines both supported
Feather .feather Apache Arrow's native columnar format

Not supported: PDF, DOCX, HTML, images. PrePro Auto is a tabular-data tool — these formats need a dedicated extraction step first (Camelot or pdfplumber for PDFs, BeautifulSoup for HTML).

Have a PDF with a table?

Extract it to a DataFrame first, then hand it to PrePro Auto:

import pdfplumber, pandas as pd, prepro_auto

with pdfplumber.open("report.pdf") as pdf:
    rows = pdf.pages[0].extract_table()        # pick the right page
df = pd.DataFrame(rows[1:], columns=rows[0])    # first row is the header
session = prepro_auto.launch(df)                # now clean it like any DataFrame

For PDFs with merged cells or complex layouts, try camelot-py (better for bordered tables) or tabula-py (requires Java). PrePro Auto deliberately leaves PDF extraction to specialised tools because generic PDF-to-table conversion succeeds only ~30–70% of the time depending on the document — bundling it would mean silent extraction errors hidden under PrePro Auto's name.


1. Input functions (notebook)

Everything you call before preprocessing starts. The functions that get data into a session.

Function Parameters Returns What it does
prepro_auto.launch_file(file_path, domain="general", port=None, open_browser=False) file_path: str or Path Session Reads a file from disk with auto-encoding-detection, starts the local workbench, returns a session. Handles all supported formats. Prints the workbench URL.
prepro_auto.launch(df, domain="general", port=None, open_browser=False) df: pandas DataFrame Session Registers an in-memory DataFrame as a job (no upload, no file I/O), starts the workbench, returns a session. Use when you already have a DataFrame.
prepro_auto.Session(job_id, port) job_id: str, port: int Session Reconnect to an existing session by ID. Use when the notebook restarted but the server is still running, or to attach to a job created from the web UI.
prepro_auto.set_api_key(provider, api_key, model=None) provider: one of "groq" / "openai" / "anthropic" / "gemini" / "mistral" dict with ok, verified, provider, model, reason Configures the AI provider at runtime (in-memory only — not written to disk). Makes a tiny test call to verify the key works. Call before launch() if you want AI features active for the session.

Example — most common pattern:

import prepro_auto

# Optional: enable AI features for this session
prepro_auto.set_api_key("openai", "sk-...")

# Load a file (auto-encoding-detection)
session = prepro_auto.launch_file(r"C:\Users\me\data\sales.csv")

2. Preprocessing functions

The work itself — clean, transform, version. These are called on the session object that input functions returned, or via REST endpoints under /api/v1/.

Profile and clean

Function / Endpoint What it does
POST /datasets/{job_id}/profile Per-column type inference, missing rates, 0–100 quality score. Run once after upload.
POST /datasets/{job_id}/stages/missing_values Detect missingness mechanism (MCAR / MAR / MNAR), recommend fill strategy per column. Creates decision cards.
POST /datasets/{job_id}/stages/outliers IQR + modified Z-score + Isolation Forest. Classifies findings as data errors vs rare events.
POST /datasets/{job_id}/stages/scaling Normality-driven scaler choice: Standard / Robust / Box-Cox / Yeo-Johnson / MinMax / log1p.
POST /datasets/{job_id}/stages/correlation Find correlated pairs, detect constant / ID-like / target-leaking columns.
POST /datasets/{job_id}/stages/encoding Categorical encoding routed by cardinality: label / ordinal / one-hot / frequency / target.
POST /datasets/{job_id}/stages/{stage_name}/execute Apply your approved decisions, commit a new version. stage_name is one of the five above.

Decision cards (the human-in-the-loop)

Endpoint What it does
GET /datasets/{job_id}/decisions?stage=<stage> List decision cards for a stage
POST /decisions/{decision_id}/approve Use the recommended action
POST /decisions/{decision_id}/override Use an alternative action (body: {"action": "...", "reason": "..."})
POST /decisions/{decision_id}/skip Don't change this column
POST /decisions/{decision_id}/drop-column Drop the column entirely

Manual transforms (when you need more control)

Endpoint What it does
GET /datasets/{job_id}/transform/operations List all 17 preset operations and their parameters
POST /datasets/{job_id}/transform/preset Apply one preset op (rename, drop, cast, fillna, filter, merge, math, map, string ops, regex, sort, dedup, extract-number)
POST /datasets/{job_id}/transform/expression Run a sandboxed pandas expression (e.g. df["profit"] = df["revenue"] - df["cost"])
POST /datasets/{job_id}/transform/batch Apply one operation across many columns as a single undoable step

AI-assisted transforms (optional, needs an API key)

Endpoint What it does
POST /datasets/{job_id}/transform/ai-propose Describe a change in plain English; AI proposes a concrete transform with preview
POST /datasets/{job_id}/transform/ai-advise Ask the AI for advice on a column without changing anything
POST /datasets/{job_id}/transform/assistant One-shot assistant call (full message)
POST /datasets/{job_id}/transform/chat Multi-turn conversation preserving history

Versioning and history

Endpoint What it does
GET /datasets/{job_id}/view Current (active-version) data with shape, dtypes, sample rows
GET /datasets/{job_id}/history Full version history with labels
POST /datasets/{job_id}/undo Move active pointer back one version
POST /datasets/{job_id}/redo Move active pointer forward one version
GET /datasets/{job_id}/snapshots List all committed snapshots

Visualization and monitoring

Endpoint What it does
POST /datasets/{job_id}/viz/chart Build a histogram, bar, or scatter chart
POST /datasets/{job_id}/viz/metric Compute a condition-based metric (e.g. "rows where price > 1000")
POST /datasets/{job_id}/viz/compare Compare one column's distribution raw vs current
GET /datasets/{job_id}/viz/dashboard Power-BI-style before/after dashboard (KPI tiles + per-column comparison)
POST /drift/compare Compare two uploaded datasets for distribution drift (PSI + KS)

3. Output functions

The artifacts you take away from a session. Notebook methods return Python objects; REST endpoints return downloadable files.

From the notebook (Python objects)

Method Returns Where to use it
session.current() pandas DataFrame The current (active-version) DataFrame as it stands in the UI. Drop straight into model.fit(X, y).
session.url str The workbench URL for this session — useful for re-opening after closing the tab.
session.job_id str The internal job ID — use it for raw REST API calls.
session.port int The local port the server is running on.

From the REST API or web UI (downloadable files)

Endpoint File Where to use it
GET /datasets/{job_id}/export/data?format=csv Cleaned CSV Share with teammates, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo.
GET /datasets/{job_id}/export/data?format=parquet Cleaned Parquet Faster and smaller than CSV for large datasets; preserves dtypes exactly.
GET /datasets/{job_id}/export/audit Audit PDF Compliance trail listing every transformation with parameters, before/after stats, who approved. Attach to a model-card or hand to a data-governance reviewer.
GET /datasets/{job_id}/export/pipeline Runnable .py script Reproduces the exact cleaning with pandas + scikit-learn, no PrePro Auto dependency. Drop into Airflow / Prefect / GitHub Actions. Run with python pipeline.py raw.csv ready.csv.
POST /drift/compare (returns JSON) Drift report Per-column PSI / KS verdicts with severity bands. Plug into a monitoring dashboard, alert on overall_verdict == "significant_drift".

Two typical workflows end-to-end

# Workflow 1 — notebook to model, no file I/O:
session = prepro_auto.launch_file(r"C:\data\sales.csv")
# ...clean visually in the browser, then:
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)

# Workflow 2 — clean once, productionize the pipeline:
# 1) Download pipeline.py from the workbench's Export step
# 2) Commit it to your model repo
# 3) In production:
#    subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])

What it does

  • Profile — per-column type inference, missing rates, 0–100 quality score
  • Clean (guided) — five HITL stages: missing values, outliers, scaling, correlation/leakage, encoding
  • Transform (manual) — 17 preset ops, sandboxed expressions, multi-column batches
  • AI assistant — optional; describe a change in plain English; preview before applying
  • Visualize & dashboard — histograms, bar, scatter; before/after dashboard with KPI tiles
  • Data drift — PSI + KS test between two datasets
  • Undo/redo — every change is a version
  • Export — cleaned data (CSV/Parquet), audit PDF, runnable Python pipeline

Methods

Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.


AI providers (optional)

AI features are optional. Everything works offline without a key. PrePro Auto supports five providers:

Provider ID Install Get a key
Groq (free tier, fast) groq pip install prepro-auto[groq] https://console.groq.com
OpenAI / GPT openai pip install prepro-auto[openai] https://platform.openai.com
Anthropic Claude anthropic pip install prepro-auto[anthropic] https://console.anthropic.com
Google Gemini gemini pip install prepro-auto[gemini] https://aistudio.google.com/app/apikey
Mistral mistral pip install prepro-auto[mistral] https://console.mistral.ai

Or install all five at once: pip install prepro-auto[ai].

Three ways to give PrePro Auto your API key

1. Notebook (in-memory, session-only — safest):

prepro_auto.set_api_key("openai", "sk-...")

2. Web UI: click AI Provider → Configure API key in the side rail, paste key, click Test & apply.

3. .env file (survives restarts):

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

Security note: the .env file is plain text. Fine for a personal machine; never enable disk-persistence on a shared or hosted deployment.


REST API reference

The web app and SDK both call the same endpoints under /api/v1. Once the server is running, the interactive Swagger UI is at http://localhost:8000/docs.

For the full table organized by category, see Section 2 — Preprocessing functions and Section 3 — Output functions above. System endpoints:

Endpoint Purpose
GET /api/v1/health Liveness check
GET /api/v1/system/limits Live RAM-aware upload limits
GET /api/v1/system/llm List providers + active one
POST /api/v1/system/llm/configure Set provider + key at runtime

Documentation

  • Complete Guide (PDF) — project overview, architecture, all ML/stats models used, accuracy benchmarks, full user guide for notebook and web UI
  • Interactive Swagger at http://localhost:8000/docs (once running)

Contributing

Contributions welcome — bugs, tests, docs, and features. See CONTRIBUTING.md to get started.

Roadmap

See ROADMAP.md for what's released, what's in progress, and what's planned.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prepro_auto-1.0.0b4.tar.gz (142.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prepro_auto-1.0.0b4-py3-none-any.whl (159.3 kB view details)

Uploaded Python 3

File details

Details for the file prepro_auto-1.0.0b4.tar.gz.

File metadata

  • Download URL: prepro_auto-1.0.0b4.tar.gz
  • Upload date:
  • Size: 142.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b4.tar.gz
Algorithm Hash digest
SHA256 1a1d300b0a44b03f81ee8cee6e6a5c99c23ea2b3573f986524ab105081e7894d
MD5 3f106a0ad91c59ffee2157dc342d33e7
BLAKE2b-256 37d5d261fb923f228a48bd981888561c89e1509905e99082a6b2af6c9e392602

See more details on using hashes here.

File details

Details for the file prepro_auto-1.0.0b4-py3-none-any.whl.

File metadata

  • Download URL: prepro_auto-1.0.0b4-py3-none-any.whl
  • Upload date:
  • Size: 159.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b4-py3-none-any.whl
Algorithm Hash digest
SHA256 96565382b19e75a53f65f89b10f101182e98be39991c3a0630c73146e24c686c
MD5 46f7475dbb019748493fcfc5710aced0
BLAKE2b-256 912886e72af4f9a9ac637ffe79819543b9971eadfa673fe7851f84ea3d715f0a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page