One-line Data Analyst + AutoML library for tabular datasets — clean, train, and predict with minimal code
Project description
mrpravin
One-line Data Analyst + AutoML for tabular datasets
Built by Pravin MR · Chennai, India
mrpravin000@gmail.com ·
LinkedIn ·
GitHub
What is mrpravin?
mrpravin is a Python library that automates the entire ML pipeline for tabular data — from raw CSV to trained, production-ready model — in as few as 3 lines of code.
import mrpravin as mr
df = mr.pravinDA("data.csv") # clean, encode, ready
model = mr.pravinDS(df, target="loan_status") # AutoML → best model
model.summary() # full model card
No manual preprocessing. No encoder fitting. No scaler setup. No model selection loop.
Install
# Core (pandas, numpy, scikit-learn)
pip install mrpravin
# Full — adds XGBoost, LightGBM, encoding detection
pip install "mrpravin[full]"
The 3 Functions
| Function | What it does |
|---|---|
mr.pravinDA(source) |
Loads, cleans, encodes, and returns a ready DataFrame |
mr.pravinDS(df, target) |
Full AutoML — selects and tunes the best model |
mr.pravinML |
Production inference layer — predict, validate, explain, benchmark |
Quick Start
Data Analyst Mode
import mrpravin as mr
# Works with CSV, Excel, JSON, or a DataFrame
df = mr.pravinDA("data.csv")
print(df.head()) # fully cleaned, encoded, human-readable
print(df.shape)
What happens automatically:
- Duplicate rows removed
- Missing values filled (median for numeric, mode for categorical)
- Outliers winsorized
- Boolean columns →
0 / 1 - Categorical columns → one-hot encoded
- High cardinality columns → frequency encoded
- Datetime columns → year / month / day / dayofweek features
- ID columns → dropped
Data Scientist Mode — AutoML
import mrpravin as mr
df = mr.pravinDA("data.csv")
model = mr.pravinDS(df, target="price")
model.summary()
What happens automatically:
- Train / test split with zero data leakage
- Runs:
LinearRegression / LogisticRegression,RandomForest,GradientBoosting(+ XGBoost, LightGBM if installed) - Cross-validated hyperparameter tuning
- Picks the best model
- Evaluates on held-out test set
- Returns a
pravinMLobject ready for production
Production Inference — pravinML
import pandas as pd
# Predict on new raw data
new_data = pd.DataFrame({
"feature_1": [8, 3, 6],
"feature_2": [95, 55, 75],
"category": ["Yes", "No", "Yes"],
})
predictions = model.predict(new_data) # auto-cleans internally
probabilities = model.predict_proba(new_data) # classification only
# Validate schema before predicting
report = model.validate(new_data)
print(report.summary())
# Feature importance
for feature, pct in model.explain().items():
print(f" {feature}: {pct:.1f}%")
# Benchmark inference speed
bench = model.benchmark(new_data, n_runs=100)
print(f"p50: {bench['p50_ms']} ms | throughput: {bench['throughput_rows_per_sec']} rows/s")
# Save and load
model.save("model.pkl")
model = mr.pravinML.load("model.pkl")
Real Results
| Dataset | Rows | Problem | Best Model | Score |
|---|---|---|---|---|
| Student Performance | 10,000 | Regression | LinearRegression | R² = 0.988 |
| Loan Default Prediction | 45,000 | Classification | GradientBoosting | Accuracy = 93.4%, ROC-AUC = 97.8% |
Both achieved with identical 3-line code.
Configuration
from mrpravin import MrPravinConfig
cfg = MrPravinConfig(
random_seed=42,
cv_folds=5,
n_iter_search=20, # hyperparameter search iterations
use_xgboost=True,
use_lightgbm=True,
outlier_method="iqr", # "iqr" | "zscore" | "mad"
verbose=True,
)
model = mr.pravinDS("data.csv", target="label", cfg=cfg)
Save and reuse config:
cfg.to_json("my_config.json")
cfg = MrPravinConfig.from_json("my_config.json")
Architecture
mrpravin/
├── mrpravin/
│ ├── __init__.py ← public API
│ ├── config.py ← MrPravinConfig
│ ├── pipeline.py ← pravinDA() and pravinDS()
│ ├── ml.py ← pravinML (inference layer)
│ ├── core/
│ │ ├── loader.py ← CSV / Excel / JSON loading
│ │ ├── profiler.py ← column type detection
│ │ ├── cleaner.py ← dedup, imputation, outliers
│ │ ├── encoder.py ← OHE, frequency, boolean encoding
│ │ ├── scaler.py ← StandardScaler / RobustScaler
│ │ └── report.py ← report builder + JSON/HTML export
│ └── automl/
│ ├── model_selector.py ← problem detection + candidates
│ ├── tuner.py ← RandomizedSearchCV
│ └── evaluator.py ← metrics + feature importance
└── tests/
└── test_mrpravin.py ← 37 tests
Full pipeline flow
CSV / Excel / JSON / DataFrame
↓
pravinDA()
├── load
├── detect column types (7 types)
├── clean (dedup, impute, outliers, text)
├── encode (OHE / frequency / boolean / datetime)
└── returns DataFrame ← human-readable, ML-ready
↓
pravinDS()
├── dedup full dataset before split (prevents X/y misalignment)
├── train / test split (no leakage)
├── clean + encode + scale (fit on train only)
├── model selection + CV tuning
├── evaluate on test set
└── returns pravinML object
↓
pravinML.predict()
├── validate schema
├── clean + encode + scale (transform only, no re-fit)
└── predict
Zero data leakage by design. Encoders and scalers are always fit on training data only, then applied via .transform() at inference time.
pravinML — Full API Reference
model.predict(X) # predict labels / values
model.predict_proba(X) # predict class probabilities
model.validate(X) # schema + drift check → ValidationReport
model.evaluate(X, y) # metrics on any labelled dataset
model.explain(top_n=20) # feature importance as % contribution
model.summary() # full model card printout
model.benchmark(X, n_runs=100)# inference latency p50/p95/p99
model.save("model.pkl") # persist with integrity checksum
mr.pravinML.load("model.pkl") # load with checksum verification
model.metrics # raw metrics dict
model.feature_names # training feature list
model.problem_type # 'regression' | 'binary_classification' | ...
model.schema # InputSchema — raw feature ranges
model.model_name # winning algorithm name
Supported Formats
| Format | Extension |
|---|---|
| CSV | .csv, .tsv, .txt |
| Excel | .xlsx, .xls, .xlsm |
| JSON | .json (records or lines) |
| DataFrame | Pass directly |
Requirements
- Python ≥ 3.9
- pandas ≥ 1.5
- numpy ≥ 1.23
- scikit-learn ≥ 1.3
- scipy ≥ 1.10
- openpyxl ≥ 3.1 (Excel support)
Optional:
- xgboost ≥ 1.7
- lightgbm ≥ 4.0
- chardet ≥ 5.0 (non-UTF-8 CSV encoding detection)
Run Tests
cd mrpravin
pip install -e ".[dev]"
pytest tests/ -v
37 tests covering profiler, cleaner, encoder, scaler, pravinDA, pravinDS, pravinML (predict, validate, evaluate, explain, benchmark, save/load).
Roadmap
- Phase 1 —
pravinDA·pravinDS·pravinML - Phase 2 —
pravinAI— static pipeline compiler and anti-pattern detector - Phase 3 —
pravinDL·pravinNLP— deep learning and NLP extensions
Author
Pravin MR — Data Engineer & ML Systems Builder, Chennai, India
- Website: mrpravin000.vercel.app
- Email: mrpravin000@gmail.com
- LinkedIn: linkedin.com/in/mr-pravin
- GitHub: github.com/mr-pravin
License
MIT © 2026 Pravin MR — see LICENSE for full text.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mrpravin-0.1.0.tar.gz.
File metadata
- Download URL: mrpravin-0.1.0.tar.gz
- Upload date:
- Size: 38.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97c4db5fb0c73156cf9577bc9875bc8586de7b2f3a4bd0490bcf3595bd3b6ea3
|
|
| MD5 |
4340fe9ca7cd5261f964492aa2828bb0
|
|
| BLAKE2b-256 |
6109d2f3af29757cd9cd27a43b4deb4361cc417808713f38c199317b7c8f28e4
|
File details
Details for the file mrpravin-0.1.0-py3-none-any.whl.
File metadata
- Download URL: mrpravin-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc3c32011f002ad534758f5d74dcfa9b51649949330e3bf032bae3528cfb82da
|
|
| MD5 |
b5431c94ddbb419c706ae5918ce06434
|
|
| BLAKE2b-256 |
48be1b95f74be629ce7d5bea4f832b645ac6335d7acb7fe6097c8b3ee2997abf
|