Skip to main content

AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.

Project description

PrePro Auto

AI-assisted tabular data preprocessing with human-in-the-loop control.

Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.

pip install prepro-auto

Author: Shivanshu Pandey · Source: github.com/Chilliflex/prepro_auto


Quickstart — Notebook (no upload)

import pandas as pd
import prepro_auto

df = pd.read_csv("your_data.csv")
session = prepro_auto.launch(df)        # opens the local workbench, NO upload
# -> click the printed http://127.0.0.1:8721/workbench?job=... link

# clean visually in the browser, then back in the notebook:
cleaned = session.current()             # the UI-edited DataFrame
session.update(cleaned)                 # push notebook edits back to the UI

That's the whole loop. Your DataFrame is loaded directly from the notebook's memory — no file upload, no context switch. df (your original) never changes; session.current() always returns the latest cleaned version.

Quickstart — Web UI

prepro_auto                             # starts the workbench at http://127.0.0.1:8000

Then open http://127.0.0.1:8000/workbench and upload a file.


What it does

  • Profile — per-column type inference, missing rates, 0–100 quality score
  • Clean (guided) — missing values, outliers, scaling, correlation/leakage, encoding; each issue becomes a reviewable decision with a recommended action and alternatives
  • Transform (manual) — 17 preset ops, sandboxed expressions, multi-column batches
  • AI assistant — optional; describe a change in plain English; confirms intent and shows a real preview before applying
  • Visualize & dashboard — histograms, bar, scatter charts, plus a before/after dashboard with KPI tiles and per-column comparison
  • Data drift — compare two datasets to detect distribution shifts (PSI + KS)
  • Undo/redo — every change is a version
  • Export — clean data (CSV/Parquet), audit PDF, and a runnable Python pipeline script

What you get out of PrePro Auto

Five concrete outputs you can take away after a session. Each one is designed to plug straight into a real-world workflow:

Output What it is Where to use it
Cleaned DataFrame The in-memory DataFrame after all your cleaning + transforms, returned by session.current() in the notebook Feed straight into model.fit(X, y) for scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow. No file I/O needed.
Cleaned dataset file A CSV or Parquet file via GET /datasets/{job_id}/export/data?format=csv (or format=parquet) Share with teammates, upload to a feature store, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo, or feed into downstream ETL jobs. Parquet is smaller and faster for large datasets.
Audit PDF A multi-page PDF via GET /datasets/{job_id}/export/audit listing every transformation with its parameters, before/after stats, and who approved it Compliance trail for regulated industries (finance, healthcare, insurance); attach to a model-card or experiment-tracking entry; hand to a reviewer or data-governance team to prove the cleaning is reproducible and reasoned, not arbitrary.
Runnable Python pipeline A standalone .py script via GET /datasets/{job_id}/export/pipeline that reproduces the exact cleaning with pandas + scikit-learn — no PrePro Auto dependency Drop into a production training pipeline, an Airflow/Prefect/Dagster DAG, a CI job, or a coworker's machine. They run python pipeline.py raw.csv clean.csv and get the same result you produced visually.
Drift report A per-column JSON verdict (PSI, KS test, severity bands) via POST /drift/compare between two datasets Monitor a deployed model — compare last month's input distribution to this month's. Catch silent data shifts (a new product category, a sensor recalibration, a market regime change) before they degrade model performance. Plug into a monitoring dashboard or alert on overall_verdict == "significant_drift".

Two common workflows:

# Workflow 1 — notebook to model, all in-process (zero file I/O):
session = prepro_auto.launch(df)
# ...clean visually in the browser...
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)

# Workflow 2 — clean once, productionize with the exported pipeline:
# 1) export pipeline.py from the workbench
# 2) commit pipeline.py to your model repo
# 3) in production: subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])

Methods

Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.


AI providers (optional)

AI features are optional. Everything works offline without a key. PrePro Auto supports five providers:

Provider ID Install Get a key
Groq (free tier, fast) groq pip install prepro-auto[groq] https://console.groq.com
OpenAI / GPT openai pip install prepro-auto[openai] https://platform.openai.com
Anthropic Claude anthropic pip install prepro-auto[anthropic] https://console.anthropic.com
Google Gemini gemini pip install prepro-auto[gemini] https://aistudio.google.com/app/apikey
Mistral mistral pip install prepro-auto[mistral] https://console.mistral.ai

Or install all five at once: pip install prepro-auto[ai].

Three ways to give PrePro Auto your API key

1. From the notebook (in-memory, session-only — safest):

import prepro_auto
prepro_auto.set_api_key("openai", "sk-...")  # any of the 5 provider IDs
session = prepro_auto.launch(df)

The key lives only in the running process. Lost on restart (re-enter next session). PrePro Auto makes a tiny test call before returning, so you know immediately whether the key works.

2. From the web UI (in-memory by default, optional .env persistence):

In the workbench, click "AI settings (API key)" in the side rail. Pick a provider, paste the key, click Test & apply. PrePro Auto verifies the key with a live test call before accepting it. Tick "Also save to .env" if you want it to survive restarts (local convenience only — leave unchecked on any shared/hosted machine).

3. From a .env file (persists across restarts):

Add to .env in the project root:

LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...

Each provider has its own env-key name: GROQ_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, MISTRAL_API_KEY.

Honest security note: the .env file is plain text. Fine for a personal machine; never use the persist option on a hosted/shared deployment until proper per-user auth is in place.


Notebook API reference

After import prepro_auto, these are the top-level functions:

Function What it does
prepro_auto.launch(df, domain="general", port=None, open_browser=False) Registers an in-memory DataFrame as a job (no upload), starts the local workbench server, returns a Session. Prints a clickable URL.
prepro_auto.set_api_key(provider, api_key, model=None) Sets the AI provider and key at runtime (in-memory). Returns {ok, provider, model, verified, reason} after a live test call.

After session = prepro_auto.launch(df):

Method / property What it does
session.current() Returns the current (active-version) DataFrame as it stands in the UI right now.
session.update(df) Pushes a notebook-edited DataFrame to the UI as a new undoable version.
session.url The workbench URL for this session.
session.job_id The internal job ID for this session.
session.port The local port the workbench server is running on.

Typical sync cycle:

cur = session.current()                            # pull current state from UI
cur["price_per_sqft"] = cur["price"] / cur["sqft"] # your own code
session.update(cur)                                # push back, refresh UI to see it

REST API reference

The web app and SDK both call the same endpoints, all under /api/v1. Once the server is running, the live interactive docs are at http://localhost:8000/docs.

Endpoint Purpose
POST /datasets/upload Upload a dataset
GET /datasets/{job_id}/preview First rows + shape
POST /datasets/{job_id}/profile Per-column profile + quality score
GET /datasets/{job_id}/view The current (active-version) data
GET /datasets/{job_id}/comparison Raw vs current summary
POST /datasets/{job_id}/stages/{stage} Run a cleaning stage (missing_values, outliers, scaling, correlation, encoding)
POST /datasets/{job_id}/stages/{stage}/execute Apply approved decisions, commit a snapshot
GET /datasets/{job_id}/decisions List decision cards (filter by ?stage=)
POST /decisions/{id}/approve · /override · /skip · /drop-column Resolve a card
GET /datasets/{job_id}/queue Decision summary across stages
GET /datasets/{job_id}/history · POST /undo · POST /redo Version history & navigation
GET /datasets/{job_id}/snapshots List committed versions
GET /datasets/{job_id}/transform/operations List available preset ops
POST /datasets/{job_id}/transform/preset Apply a preset op (rename, drop, cast, fillna, filter, …)
POST /datasets/{job_id}/transform/expression Run a sandboxed pandas expression
POST /datasets/{job_id}/transform/batch Apply one op to many columns as one undoable step
POST /datasets/{job_id}/transform/ai-propose · ai-advise · assistant · chat AI helpers (needs a key)
POST /datasets/{job_id}/viz/chart · metric · compare · ask Charts and condition counts
GET /datasets/{job_id}/viz/dashboard Before/after KPI dashboard
POST /drift/compare Drift detection between two uploaded datasets
GET /datasets/{job_id}/export/data · audit · pipeline Clean data, audit PDF, reproducible script
GET /api/v1/system/limits Live RAM-aware upload limits
GET /api/v1/system/llm List available providers + active one
POST /api/v1/system/llm/configure Set provider + key at runtime

Open http://localhost:8000/docs after prepro_auto is running for the interactive Swagger UI with full request/response schemas.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prepro_auto-1.0.0b1.tar.gz (133.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prepro_auto-1.0.0b1-py3-none-any.whl (153.2 kB view details)

Uploaded Python 3

File details

Details for the file prepro_auto-1.0.0b1.tar.gz.

File metadata

  • Download URL: prepro_auto-1.0.0b1.tar.gz
  • Upload date:
  • Size: 133.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b1.tar.gz
Algorithm Hash digest
SHA256 ab621dd8ed9857b5a1f23419ac8944d11d02a200544c029dfe3e3e8460be1d28
MD5 5bef8e4428d6b4f67855c52d676b8c08
BLAKE2b-256 af510dd5f18a4cf292d2726a41b40b8415b3a69c9558c437a886d9723d0fe25c

See more details on using hashes here.

File details

Details for the file prepro_auto-1.0.0b1-py3-none-any.whl.

File metadata

  • Download URL: prepro_auto-1.0.0b1-py3-none-any.whl
  • Upload date:
  • Size: 153.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for prepro_auto-1.0.0b1-py3-none-any.whl
Algorithm Hash digest
SHA256 8dd1c1ae543f75835b6f8cd0e0d31851458af5166b9ec871bb9523e1ed27fde9
MD5 64fbb64cce661d4f2d7bc554483bee60
BLAKE2b-256 67dce0868cf7b84d0c7ffe937340b6f61d9138a73829cfe593c9a09465b612bf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page