Skip to main content

AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.

Project description

PrePro Auto

PyPI version Python License: MIT Tests Downloads CI

AI-assisted tabular data preprocessing with human-in-the-loop control.

Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.

pip install prepro-auto

Author: Shivanshu Pandey · Source: github.com/Chilliflex/prepro_auto


Contents


Quickstart

Notebook (recommended for data scientists):

import prepro_auto

# Easiest: point at a file, auto-detects encoding (handles Latin-1, cp1252, BOM)
session = prepro_auto.launch_file(r"C:\path\to\your_data.csv")
# Click the printed http://127.0.0.1:8721/workbench?job=... link

Web UI (recommended for analysts): open Command Prompt (not Jupyter) and run:

prepro_auto

Then open http://127.0.0.1:8000/workbench and drag-drop a file.

Note: prepro_auto typed inside a Jupyter cell just prints the module object — it doesn't start a server. The CLI command runs from a terminal only. Inside a notebook, use prepro_auto.launch_file(path) or prepro_auto.launch(df) instead.


Ways to give PrePro Auto your data

There are 4 input methods in the notebook and 3 in the web UI. Pick whichever fits your workflow.

From a Jupyter notebook (4 ways)

# Method When to use it
1 prepro_auto.launch_file(path) You have a file on disk — CSV, Excel, JSON, Parquet, etc. Easiest path. Auto-detects encoding and delimiter. No pd.read_csv() needed.
2 prepro_auto.launch(df) You already have a pandas DataFrame in memory (from a database query, API response, generated data, or a tricky read you handled yourself).
3 session.update(df) You already have a session and want to push a new DataFrame to it (e.g. after notebook-side edits). Commits a new undoable version.
4 Web upload, then notebook reads Start the server with the CLI, upload via browser, then in the notebook do prepro_auto.Session(job_id, port).current() to pull the data back into Python. Rare but valid.

From the web UI (3 ways)

# Method When to use it
1 Drag-and-drop upload on Step 1 of the workbench Standard. Drop a CSV/Parquet/Excel/JSON file into the upload box. The engine auto-detects encoding and delimiter.
2 File picker on Step 1 Same as drag-and-drop, just clicked. Useful when dragging is awkward (split screens, touchpads).
3 URL parameter ?job=<id> When the notebook launched the session, the printed URL already includes ?job=... — no upload needed, the workbench adopts the existing job.

Supported file formats

Format Extensions Notes
CSV .csv, .tsv, .txt Auto-detects encoding (utf-8 / utf-8-sig / latin-1 / cp1252) and delimiter (comma, tab, semicolon, pipe)
Excel .xlsx, .xls, .xlsm First sheet by default; multi-sheet handling via the upload form
Parquet .parquet, .pq Fastest format for large datasets, preserves dtypes
JSON .json, .jsonl, .ndjson JSON-records and JSON-lines both supported
Feather .feather Apache Arrow's native columnar format

Not supported: PDF, DOCX, HTML, images. PrePro Auto is a tabular-data tool — these formats need a dedicated extraction step first (Camelot or pdfplumber for PDFs, BeautifulSoup for HTML).

Have a PDF with a table?

Extract it to a DataFrame first, then hand it to PrePro Auto:

import pdfplumber, pandas as pd, prepro_auto

with pdfplumber.open("report.pdf") as pdf:
    rows = pdf.pages[0].extract_table()        # pick the right page
df = pd.DataFrame(rows[1:], columns=rows[0])    # first row is the header
session = prepro_auto.launch(df)                # now clean it like any DataFrame

For PDFs with merged cells or complex layouts, try camelot-py (better for bordered tables) or tabula-py (requires Java). PrePro Auto deliberately leaves PDF extraction to specialised tools because generic PDF-to-table conversion succeeds only ~30–70% of the time depending on the document — bundling it would mean silent extraction errors hidden under PrePro Auto's name.


1. Input functions (notebook)

Everything you call before preprocessing starts. The functions that get data into a session.

Function Parameters Returns What it does
prepro_auto.launch_file(file_path, domain="general", port=None, open_browser=False) file_path: str or Path Session Reads a file from disk with auto-encoding-detection, starts the local workbench, returns a session. Handles all supported formats. Prints the workbench URL.
prepro_auto.launch(df, domain="general", port=None, open_browser=False) df: pandas DataFrame Session Registers an in-memory DataFrame as a job (no upload, no file I/O), starts the workbench, returns a session. Use when you already have a DataFrame.
prepro_auto.Session(job_id, port) job_id: str, port: int Session Reconnect to an existing session by ID. Use when the notebook restarted but the server is still running, or to attach to a job created from the web UI.
prepro_auto.set_api_key(provider, api_key, model=None) provider: one of "groq" / "openai" / "anthropic" / "gemini" / "mistral" dict with ok, verified, provider, model, reason Configures the AI provider at runtime (in-memory only — not written to disk). Makes a tiny test call to verify the key works. Call before launch() if you want AI features active for the session.

Example — most common pattern:

import prepro_auto

# Optional: enable AI features for this session
prepro_auto.set_api_key("openai", "sk-...")

# Load a file (auto-encoding-detection)
session = prepro_auto.launch_file(r"C:\Users\me\data\sales.csv")

2. Preprocessing functions

The work itself — clean, transform, version. These are called on the session object that input functions returned, or via REST endpoints under /api/v1/.

Profile and clean

Function / Endpoint What it does
POST /datasets/{job_id}/profile Per-column type inference, missing rates, 0–100 quality score. Run once after upload.
POST /datasets/{job_id}/stages/missing_values Detect missingness mechanism (MCAR / MAR / MNAR), recommend fill strategy per column. Creates decision cards.
POST /datasets/{job_id}/stages/outliers IQR + modified Z-score + Isolation Forest. Classifies findings as data errors vs rare events.
POST /datasets/{job_id}/stages/scaling Normality-driven scaler choice: Standard / Robust / Box-Cox / Yeo-Johnson / MinMax / log1p.
POST /datasets/{job_id}/stages/correlation Find correlated pairs, detect constant / ID-like / target-leaking columns.
POST /datasets/{job_id}/stages/encoding Categorical encoding routed by cardinality: label / ordinal / one-hot / frequency / target.
POST /datasets/{job_id}/stages/{stage_name}/execute Apply your approved decisions, commit a new version. stage_name is one of the five above.

Decision cards (the human-in-the-loop)

Endpoint What it does
GET /datasets/{job_id}/decisions?stage=<stage> List decision cards for a stage
POST /decisions/{decision_id}/approve Use the recommended action
POST /decisions/{decision_id}/override Use an alternative action (body: {"action": "...", "reason": "..."})
POST /decisions/{decision_id}/skip Don't change this column
POST /decisions/{decision_id}/drop-column Drop the column entirely

Manual transforms (when you need more control)

Endpoint What it does
GET /datasets/{job_id}/transform/operations List all 17 preset operations and their parameters
POST /datasets/{job_id}/transform/preset Apply one preset op (rename, drop, cast, fillna, filter, merge, math, map, string ops, regex, sort, dedup, extract-number)
POST /datasets/{job_id}/transform/expression Run a sandboxed pandas expression (e.g. df["profit"] = df["revenue"] - df["cost"])
POST /datasets/{job_id}/transform/batch Apply one operation across many columns as a single undoable step

AI-assisted transforms (optional, needs an API key)

Endpoint What it does
POST /datasets/{job_id}/transform/ai-propose Describe a change in plain English; AI proposes a concrete transform with preview
POST /datasets/{job_id}/transform/ai-advise Ask the AI for advice on a column without changing anything
POST /datasets/{job_id}/transform/assistant One-shot assistant call (full message)
POST /datasets/{job_id}/transform/chat Multi-turn conversation preserving history

Versioning and history

Endpoint What it does
GET /datasets/{job_id}/view Current (active-version) data with shape, dtypes, sample rows
GET /datasets/{job_id}/history Full version history with labels
POST /datasets/{job_id}/undo Move active pointer back one version
POST /datasets/{job_id}/redo Move active pointer forward one version
GET /datasets/{job_id}/snapshots List all committed snapshots

Visualization and monitoring

Endpoint What it does
POST /datasets/{job_id}/viz/chart Build a histogram, bar, or scatter chart
POST /datasets/{job_id}/viz/metric Compute a condition-based metric (e.g. "rows where price > 1000")
POST /datasets/{job_id}/viz/compare Compare one column's distribution raw vs current
GET /datasets/{job_id}/viz/dashboard Power-BI-style before/after dashboard (KPI tiles + per-column comparison)
POST /drift/compare Compare two uploaded datasets for distribution drift (PSI + KS)

3. Output functions

The artifacts you take away from a session. Notebook methods return Python objects; REST endpoints return downloadable files.

From the notebook (Python objects)

Method Returns Where to use it
session.current() pandas DataFrame The current (active-version) DataFrame as it stands in the UI. Drop straight into model.fit(X, y).
session.url str The workbench URL for this session — useful for re-opening after closing the tab.
session.job_id str The internal job ID — use it for raw REST API calls.
session.port int The local port the server is running on.

From the REST API or web UI (downloadable files)

Endpoint File Where to use it
GET /datasets/{job_id}/export/data?format=csv Cleaned CSV Share with teammates, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo.
GET /datasets/{job_id}/export/data?format=parquet Cleaned Parquet Faster and smaller than CSV for large datasets; preserves dtypes exactly.
GET /datasets/{job_id}/export/audit Audit PDF Compliance trail listing every transformation with parameters, before/after stats, who approved. Attach to a model-card or hand to a data-governance reviewer.
GET /datasets/{job_id}/export/pipeline Runnable .py script Reproduces the exact cleaning with pandas + scikit-learn, no PrePro Auto dependency. Drop into Airflow / Prefect / GitHub Actions. Run with python pipeline.py raw.csv ready.csv.
POST /drift/compare (returns JSON) Drift report Per-column PSI / KS verdicts with severity bands. Plug into a monitoring dashboard, alert on overall_verdict == "significant_drift".

Two typical workflows end-to-end

# Workflow 1 — notebook to model, no file I/O:
session = prepro_auto.launch_file(r"C:\data\sales.csv")
# ...clean visually in the browser, then:
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)

# Workflow 2 — clean once, productionize the pipeline:
# 1) Download pipeline.py from the workbench's Export step
# 2) Commit it to your model repo
# 3) In production:
#    subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])

What it does

  • Profile — per-column type inference, missing rates, 0–100 quality score
  • Clean (guided) — five HITL stages: missing values, outliers, scaling, correlation/leakage, encoding
  • Transform (manual) — 17 preset ops, sandboxed expressions, multi-column batches
  • AI assistant — optional; describe a change in plain English; preview before applying
  • Visualize & dashboard — histograms, bar, scatter; before/after dashboard with KPI tiles
  • Data drift — PSI + KS test between two datasets
  • Undo/redo — every change is a version
  • Export — cleaned data (CSV/Parquet), audit PDF, runnable Python pipeline

Methods

Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.


AI providers (optional)

AI features are optional. Everything works offline without a key. PrePro Auto supports five providers:

Provider ID Install Get a key
Groq (free tier, fast) groq pip install prepro-auto[groq] https://console.groq.com
OpenAI / GPT openai pip install prepro-auto[openai] https://platform.openai.com
Anthropic Claude anthropic pip install prepro-auto[anthropic] https://console.anthropic.com
Google Gemini gemini pip install prepro-auto[gemini] https://aistudio.google.com/app/apikey
Mistral mistral pip install prepro-auto[mistral] https://console.mistral.ai

Or install all five at once: pip install prepro-auto[ai].

Three ways to give PrePro Auto your API key

1. Notebook (in-memory, session-only — safest):

prepro_auto.set_api_key("openai", "sk-...")

2. Web UI: click AI Provider → Configure API key in the side rail, paste key, click Test & apply.

3. .env file (survives restarts):

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

Security note: the .env file is plain text. Fine for a personal machine; never enable disk-persistence on a shared or hosted deployment.


REST API reference

The web app and SDK both call the same endpoints under /api/v1. Once the server is running, the interactive Swagger UI is at http://localhost:8000/docs.

For the full table organized by category, see Section 2 — Preprocessing functions and Section 3 — Output functions above. System endpoints:

Endpoint Purpose
GET /api/v1/health Liveness check
GET /api/v1/system/limits Live RAM-aware upload limits
GET /api/v1/system/llm List providers + active one
POST /api/v1/system/llm/configure Set provider + key at runtime

Documentation

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prepro_auto-1.0.0b3.tar.gz (140.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prepro_auto-1.0.0b3-py3-none-any.whl (158.3 kB view details)

Uploaded Python 3

File details

Details for the file prepro_auto-1.0.0b3.tar.gz.

File metadata

  • Download URL: prepro_auto-1.0.0b3.tar.gz
  • Upload date:
  • Size: 140.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b3.tar.gz
Algorithm Hash digest
SHA256 55a6ff1b56a85c7645e7ab74292f895a4ae4bef14b5bc75e7f343e54126b995d
MD5 b9aa4ea411dc97dcff2c9773109e6970
BLAKE2b-256 b19641be54a59aa3c8fd9e1fcaaf5946a26fb96d0b38ddba49348bc1e7f51c3e

See more details on using hashes here.

File details

Details for the file prepro_auto-1.0.0b3-py3-none-any.whl.

File metadata

  • Download URL: prepro_auto-1.0.0b3-py3-none-any.whl
  • Upload date:
  • Size: 158.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b3-py3-none-any.whl
Algorithm Hash digest
SHA256 6b11b954ab133c13c3e7349efd3858050c2acac30fee1b0c1d770aca6e42c6f8
MD5 432eb98fcc61dbc652900e56cd664b67
BLAKE2b-256 b75cf0288c5562797bba093485a46e789269be2521825827e7fe9b4e52b6becb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page