AI-Assisted Multi-Modal Data Preprocessing Pipeline for ML

These details have not been verified by PyPI

Project links

Project description

AutoPrepML

Multi-Modal Data Preprocessing Pipeline

Quick Start • Installation • Examples • Docs • Contributing

Automate data preprocessing for ANY data type — Tabular, Text, Time Series, Graphs, and Images.

A comprehensive Python library that automatically detects, cleans, and transforms data across multiple modalities. Built for real-world ML pipelines with one-line automation and detailed reporting.

┌─────────────┐      ┌──────────────┐      ┌─────────────────┐      ┌────────────┐
│  Raw Data   │      │  AutoPrepML  │      │  Cleaned Data   │      │   Report   │
│ (Any Type)  │ ───> │   Detects    │ ───> │   Transformed   │ ───> │ (HTML/JSON)│
└─────────────┘      │   Cleans     │      │    Features     │      └────────────┘
                     └──────────────┘      └─────────────────┘

🎯 Features

Core Features

✨ Multi-Modal Support - Works with 5 different data types out of the box
🔍 Automatic Issue Detection - Missing values, outliers, duplicates, anomalies
📊 Visual Reports - HTML reports with embedded plots and statistics
⚙️ Highly Configurable - YAML/JSON configuration for reproducibility
🚀 CLI + Python API - Use from command line or Python scripts
🧪 Production Ready - 227 tests passing, 95%+ code coverage, optimized CI/CD

Advanced Features (v1.3.0) 🆕

📊 AutoEDA - Automated exploratory data analysis with insights generation
⚙️ AutoFeatureEngine - Intelligent feature engineering with 8 creation methods
📈 Interactive Dashboards - Plotly visualizations and Streamlit app generation
🤖 Enhanced LLM Assistant - Column renaming, documentation, quality analysis

Previous Releases

🤖 LLM Integration - AI-powered suggestions with GPT-4, Claude, Gemini, Ollama (v1.2.0)
🖼️ Image Preprocessing - Automatic image cleaning, resizing, normalization (v1.2.0)
🆕 Advanced Imputation - KNN and Iterative (MICE) imputation methods (v1.1.0)
🎯 SMOTE Balancing - Synthetic minority oversampling for imbalanced data (v1.1.0)

📋 Quick Navigation

Section	Description
📊 Supported Data Types	Overview of Tabular, Text, Time Series, Graph
📦 Installation	Install from source or PyPI (v1.0.1+)
🚀 Quick Start	5-minute tutorial for each data type
🆕 v1.3.0 Features	AutoEDA, Feature Engineering, Dashboards (NEW!)
🆕 Advanced Features	KNN/Iterative Imputation, SMOTE (v1.1.0)
🤖 LLM Integration	AI-powered suggestions with multiple providers (v1.2.0)
🎯 Dynamic LLM Config	Use ANY model - no hardcoded values!
⚙️ CLI Configuration	Manage API keys with autoprepml-config
💻 CLI Reference	Command-line options and examples
🔧 Examples	Working demo scripts with outputs
📚 Full API	Comprehensive function documentation
⚙️ Configuration	YAML/JSON config for reproducibility
🧪 Testing	Run tests and check coverage
🛠️ Development	Contributing guide

📊 Supported Data Types

Data Type	Module	Use Cases	Status
Tabular	`AutoPrepML`	Classification, Regression, General ML	✅ Ready
Text/NLP	`TextPrepML`	Sentiment Analysis, Topic Modeling, Classification	✅ Ready
Time Series	`TimeSeriesPrepML`	Forecasting, Trend Analysis, Anomaly Detection	✅ Ready
Graph	`GraphPrepML`	Social Networks, Recommendation Systems, Link Prediction	✅ Ready
Image	`ImagePrepML`	Computer Vision, Image Classification, Object Detection	✅ Ready

📦 Installation

Prerequisites

Python 3.10 or higher
pip (Python package manager)

Option 1: Install from PyPI

# Basic installation
pip install autoprepml

# With LLM support (AI-powered suggestions)
pip install autoprepml[llm]

# With all optional dependencies
pip install autoprepml[all]

Option 2: Install from Source (Latest Development Version)

git clone https://github.com/mdshoaibuddinchanda/autoprepml.git
cd autoprepml
pip install -e .

# Or with LLM support
pip install -e ".[llm]"

Option 3: With Development Tools

pip install -e ".[dev]"  # Includes pytest, coverage, linting tools
pip install -e ".[all]"  # Everything (dev + llm + docs)

Configure LLM Support (Optional)

After installing with LLM support, configure your API keys:

# Interactive configuration wizard
autoprepml-config

# Or set a specific provider
autoprepml-config --set openai
autoprepml-config --set anthropic
autoprepml-config --set google

# Use Ollama for local LLM (no API key needed!)
# Just install Ollama from https://ollama.ai

See LLM Configuration Guide for detailed instructions.

Verify Installation

python -c "from autoprepml import AutoPrepML; print('✓ Installation successful!')"
autoprepml --help

🆕 v1.3.0 New Features

📊 AutoEDA - Automated Exploratory Data Analysis

Comprehensive automated EDA with insights generation:

from autoprepml import AutoEDA

# Initialize with your DataFrame
eda = AutoEDA(df)

# Run full analysis
results = eda.analyze(
    include_correlations=True,
    include_distributions=True,
    include_outliers=True,
    generate_insights=True
)

# Generate interactive HTML report
eda.generate_report('eda_report.html')

# Export results to JSON
eda.to_json('eda_results.json')

# Access specific analysis results
print(results['insights'])
print(results['correlations']['high_correlations'])
print(results['outliers']['iqr_outliers'])

Features:

Statistical summaries (mean, std, quartiles, skewness, kurtosis)
Missing value analysis with percentages
Correlation matrix with high correlation detection (>0.7)
Distribution analysis (skewness, kurtosis, quartiles)
Outlier detection (IQR and Z-score methods)
Categorical analysis (cardinality, mode, value counts)
Automated insights generation in natural language
Interactive HTML reports with visualizations
JSON export for programmatic access

⚙️ AutoFeatureEngine - Intelligent Feature Engineering

Create powerful features automatically with 8 different methods:

from autoprepml import AutoFeatureEngine, auto_feature_engineering

# Initialize with your DataFrame
fe = AutoFeatureEngine(df, target_column='target')

# 1. Polynomial features (degree 2 or 3)
df_poly = fe.create_polynomial_features(columns=['age', 'income'], degree=2)

# 2. Interaction features (multiplication)
df_interact = fe.create_interactions(columns=['age', 'income', 'score'])

# 3. Ratio features (division-based)
df_ratio = fe.create_ratio_features(columns=['income', 'loan_amount'])

# 4. Binned features (discretization)
df_binned = fe.create_binned_features(columns=['age'], n_bins=5, strategy='quantile')

# 5. Aggregation features (sum, mean, std, min, max)
df_agg = fe.create_aggregation_features(columns=['col1', 'col2', 'col3'])

# 6. Datetime features (year, month, day, hour, quarter)
df_date = fe.create_datetime_features(columns=['date'], features=['year', 'month', 'day'])

# 7. Feature selection (keep best k features)
df_selected = fe.select_features(method='mutual_info', k=10, task='classification')

# 8. Feature importance ranking
importance = fe.get_feature_importance(task='classification')
print(importance)

# Quick auto feature engineering
df_enhanced = auto_feature_engineering(
    df,
    numeric_columns=['age', 'income', 'score'],
    target_column='target',
    select_top_k=15
)

Methods:

create_polynomial_features() - Polynomial & interaction terms
create_interactions() - Pairwise multiplications
create_ratio_features() - Division-based features
create_binned_features() - Discretization (uniform, quantile, kmeans)
create_aggregation_features() - Row-wise aggregations
create_datetime_features() - Extract temporal components
select_features() - Mutual info or F-test selection
get_feature_importance() - Rank features by importance

📈 Interactive Dashboards - Visualization & Streamlit

Create interactive dashboards with Plotly and generate full Streamlit apps:

from autoprepml import InteractiveDashboard, create_plotly_dashboard, generate_streamlit_app

# Initialize dashboard
dashboard = InteractiveDashboard(df)

# Create comprehensive Plotly dashboard
dashboard.create_dashboard(
    title="My Data Dashboard",
    output_path="dashboard.html"
)

# Create correlation heatmap
dashboard.create_correlation_heatmap(output_path="correlation.html")

# Create missing data visualization
dashboard.create_missing_data_plot(output_path="missing_data.html")

# Generate full Streamlit app
dashboard.generate_streamlit_app(output_path="app.py")

# Run the generated Streamlit app
# streamlit run app.py

# Or use convenience functions
create_plotly_dashboard(df, title="Quick Dashboard", output_path="quick_dash.html")
generate_streamlit_app(df, output_path="my_app.py")

Features:

Multi-subplot Plotly dashboards (histograms, box plots, scatter, bar charts)
Interactive correlation heatmaps
Missing data visualizations
Full Streamlit app generation with:
- File upload functionality
- Overview tab (shape, dtypes, memory)
- EDA tab (distributions, correlations, missing values)
- Preprocessing tab (missing value handling, encoding)
- Feature engineering tab (interactions, polynomial, binning)

🤖 Enhanced LLM Assistant - Intelligent Data Cleaning

Advanced AI-powered assistance for data preprocessing:

from autoprepml import LLMSuggestor, suggest_column_rename, generate_data_documentation

# Initialize LLM suggestor
suggestor = LLMSuggestor(provider='openai')  # or 'anthropic', 'google', 'ollama'

# 1. Suggest better column names
new_names = suggestor.suggest_all_column_renames(df)
df_renamed = df.rename(columns=new_names)

# 2. Get specific column rename suggestion
new_name = suggest_column_rename(df, column='col1')
print(f"Suggested name: {new_name}")

# 3. Explain data quality issues in natural language
explanation = suggestor.explain_data_quality_issues(df)
print(explanation)

# 4. Generate comprehensive data documentation
documentation = generate_data_documentation(df)
with open('data_docs.md', 'w') as f:
    f.write(documentation)

# 5. Get preprocessing pipeline recommendations
pipeline = suggestor.suggest_preprocessing_pipeline(df, task='classification')
print(pipeline)

# 6. Get specific fix suggestions
fix = suggestor.suggest_fix(df, column='age', issue_type='missing')
print(fix)

New LLM Capabilities:

suggest_column_rename() - AI-powered intelligent column naming
suggest_all_column_renames() - Batch rename all columns
explain_data_quality_issues() - Natural language quality explanations
generate_data_documentation() - Auto-generate Markdown documentation
suggest_preprocessing_pipeline() - Complete pipeline recommendations
Works with OpenAI (GPT-4), Anthropic (Claude), Google (Gemini), and Ollama (local)

📦 New Dependencies

v1.3.0 adds optional dependencies for visualization:

# Install with visualization support
pip install autoprepml[viz]

# Or install manually
pip install plotly streamlit

🚀 Quick Start Guide

Step 1: Import the Library

import pandas as pd
from autoprepml import AutoPrepML, TextPrepML, TimeSeriesPrepML, GraphPrepML

Step 2: Choose Your Data Type

📊 Tabular Data (CSV, Excel, JSON)

# Load your data
df = pd.read_csv('data.csv')

# Initialize and clean
prep = AutoPrepML(df)
clean_df, target = prep.clean(task='classification', target_col='label')

# Generate report
prep.save_report('report.html')

🤖 With AI-Powered Suggestions (v1.2.0+)

# Enable LLM support for AI suggestions
prep = AutoPrepML(df, enable_llm=True, llm_provider='openai')

# Get AI analysis of your dataset
analysis = prep.analyze_with_llm(task='classification', target_col='label')
print(analysis)

# Get suggestions for missing values
suggestions = prep.get_llm_suggestions(column='age', issue_type='missing')
print(suggestions)

# Get feature engineering ideas
features = prep.get_feature_suggestions(task='classification', target_col='label')
for feature in features:
    print(f"  • {feature}")

# Clean with advanced methods
clean_df, report = prep.clean(
    task='classification',
    target_col='label',
    use_advanced=True,
    imputation_method='knn',  # or 'iterative'
    balance_method='smote'     # Advanced class balancing
)

📝 Text/NLP Data (Reviews, Documents, Tweets)

# Load text data
df = pd.read_csv('reviews.csv')

# Initialize with text column
prep = TextPrepML(df, text_column='review_text')

# Clean text
prep.clean_text(lowercase=True, remove_urls=True, remove_html=True)
prep.remove_stopwords()
prep.extract_features()

# Get cleaned data
cleaned_df = prep.df

⏰ Time Series Data (Sales, Sensor Data, Logs)

# Load time series
df = pd.read_csv('sales.csv')

# Initialize with timestamp and value columns
prep = TimeSeriesPrepML(df, timestamp_column='date', value_column='sales')

# Fill gaps and add features
prep.fill_missing_timestamps(freq='D')
prep.interpolate_missing(method='linear')
prep.add_time_features()
prep.add_lag_features(lags=[1, 7, 30])

# Get enhanced data
enhanced_df = prep.df

🕸️ Graph Data (Social Networks, Relationships)

# Load nodes and edges
nodes_df = pd.read_csv('nodes.csv')
edges_df = pd.read_csv('edges.csv')

# Initialize graph
prep = GraphPrepML(nodes_df=nodes_df, edges_df=edges_df,
                   node_id_col='id', source_col='source', target_col='target')

# Validate and clean
prep.validate_node_ids()
prep.validate_edges(remove_self_loops=True, remove_dangling=True)
prep.add_node_features()

# Get cleaned graph
clean_nodes = prep.nodes_df
clean_edges = prep.edges_df

🖼️ Image Data (Computer Vision, ML Models)

from autoprepml import ImagePrepML

# Initialize with image directory
prep = ImagePrepML(
    image_dir='./images',
    target_size=(224, 224),
    color_mode='rgb',
    normalize=True
)

# Detect issues
issues = prep.detect()

# Clean and preprocess
processed_images = prep.clean(
    remove_corrupted=True,
    resize=True,
    convert_mode=True
)

# Split dataset
train, val, test = prep.split_dataset(
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15
)

# Save processed images
prep.save_processed('./output', format='png')

# Generate report
prep.save_report('image_report.html')


## 💻 Command Line Usage

### Quick Reference

| Option          | Short | Description                         | Example             |
| --------------- | ----- | ----------------------------------- | ------------------- |
| `--input`       | `-i`  | Input CSV file                      | `-i data.csv`       |
| `--output`      | `-o`  | Output CSV file                     | `-o cleaned.csv`    |
| `--task`        | `-t`  | ML task (classification/regression) | `-t classification` |
| `--target`      |       | Target column name                  | `--target label`    |
| `--report`      | `-r`  | HTML report path                    | `-r report.html`    |
| `--config`      | `-c`  | Config file (YAML/JSON)             | `-c config.yaml`    |
| `--detect-only` |       | Only detect issues, no cleaning     | `--detect-only`     |
| `--verbose`     | `-v`  | Verbose output                      | `-v`                |


### Common Workflows

```bash
# 1. Quick data inspection
autoprepml -i data.csv --detect-only -v

# 2. Clean and generate report
autoprepml -i raw.csv -o clean.csv -r report.html -t classification --target label

# 3. Use custom configuration
autoprepml -i data.csv -o cleaned.csv -c config.yaml

# 4. Classification task with balancing
autoprepml -i train.csv -o train_clean.csv -t classification --target Survived

# 5. Regression task with outlier removal
autoprepml -i housing.csv -o housing_clean.csv -t regression --target price -v

� Complete Feature Reference

1️⃣ Tabular Data (AutoPrepML)

Detection Capabilities:

✅ Missing values (count, percentage by column)
✅ Outliers (Isolation Forest, Z-score methods)
✅ Class imbalance (for classification tasks)
✅ Data type validation

Cleaning Operations:

✅ Imputation (mean, median, mode, auto)
✅ Scaling (StandardScaler, MinMaxScaler)
✅ Encoding (Label, One-Hot)
✅ Class balancing (Oversampling, Undersampling)
✅ Outlier removal

Example:

from autoprepml import AutoPrepML

df = pd.read_csv('titanic.csv')
prep = AutoPrepML(df)

# Detect issues
issues = prep.detect(target_col='Survived')
print(f"Missing values: {issues['missing_values']}")
print(f"Outliers: {issues['outliers']['outlier_count']}")

# Auto-clean
clean_df, target = prep.clean(task='classification', target_col='Survived', auto=True)

# Generate report
prep.save_report('titanic_report.html')

2️⃣ Text/NLP Data (TextPrepML)

Detection Capabilities:

✅ Missing/empty text
✅ Very short/long texts
✅ URLs, emails, HTML tags
✅ Average text length
✅ Duplicates

Cleaning Operations:

✅ Text cleaning (lowercase, remove URLs/HTML/emails)
✅ Special character & number removal
✅ Stopword removal (English + custom)
✅ Tokenization (word/sentence)
✅ Feature extraction (length, word count, etc.)
✅ Language detection (heuristic)
✅ Duplicate removal
✅ Length filtering

Example:

from autoprepml import TextPrepML

df = pd.read_csv('reviews.csv')
prep = TextPrepML(df, text_column='review_text')

# Detect issues
issues = prep.detect_issues()
print(f"Contains URLs: {issues['contains_urls']}")
print(f"Contains HTML: {issues['contains_html']}")

# Clean text
prep.clean_text(lowercase=True, remove_urls=True, remove_html=True)
prep.remove_stopwords()
prep.filter_by_length(min_length=10, max_length=500)

# Extract features
prep.extract_features()
prep.tokenize(method='word')

# Get vocabulary
vocab = prep.get_vocabulary(top_n=50)

# Save
cleaned_df = prep.df
cleaned_df.to_csv('reviews_cleaned.csv', index=False)

3️⃣ Time Series Data (TimeSeriesPrepML)

Detection Capabilities:

✅ Duplicate timestamps
✅ Missing dates/gaps
✅ Chronological order validation
✅ Missing values in series
✅ Negative/zero values

Cleaning Operations:

✅ Sort by timestamp
✅ Remove/aggregate duplicate timestamps
✅ Fill missing timestamps (any frequency)
✅ Interpolation (linear, forward-fill, back-fill)
✅ Outlier detection (Z-score, IQR)
✅ Time feature extraction (year, month, day, hour, day of week, quarter, weekend)
✅ Lag features (1-day, 7-day, 30-day, custom)
✅ Rolling window statistics (mean, std, min, max)
✅ Resampling to different frequencies

Example:

from autoprepml import TimeSeriesPrepML

df = pd.read_csv('sales.csv')
prep = TimeSeriesPrepML(df, timestamp_column='date', value_column='sales')

# Detect issues
issues = prep.detect_issues()
print(f"Detected gaps: {issues['detected_gaps']}")
print(f"Duplicate timestamps: {issues['duplicate_timestamps']}")

# Clean and enhance
prep.sort_by_time()
prep.remove_duplicate_timestamps(aggregate='mean')
prep.fill_missing_timestamps(freq='D')  # Daily frequency
prep.interpolate_missing(method='linear')

# Feature engineering for ML
prep.add_time_features()
prep.add_lag_features(lags=[1, 7, 30])
prep.add_rolling_features(windows=[7, 30], functions=['mean', 'std'])

# Optional: Detect outliers
prep.detect_outliers(method='zscore', threshold=3.0)

# Save enhanced data
enhanced_df = prep.df
enhanced_df.to_csv('sales_enhanced.csv', index=False)

4️⃣ Graph Data (GraphPrepML)

Detection Capabilities:

✅ Duplicate node IDs
✅ Missing node IDs
✅ Duplicate edges
✅ Self-loops
✅ Dangling edges (edges to non-existent nodes)
✅ Isolated nodes

Cleaning Operations:

✅ Node ID validation
✅ Edge validation (remove self-loops, dangling edges)
✅ Duplicate removal (nodes and edges)
✅ Node feature extraction (in/out/total degree)
✅ Edge feature extraction
✅ Connected component identification (BFS algorithm)
✅ Isolated node filtering
✅ Graph statistics (density, average degree)
✅ Format conversion (edge list, adjacency dict)

Example:

from autoprepml import GraphPrepML

nodes = pd.read_csv('users.csv')
edges = pd.read_csv('friendships.csv')

prep = GraphPrepML(nodes_df=nodes, edges_df=edges,
                   node_id_col='user_id',
                   source_col='from_user',
                   target_col='to_user')

# Detect issues
issues = prep.detect_issues()
print(f"Duplicate nodes: {issues['nodes']['duplicate_node_ids']}")
print(f"Dangling edges: {issues['edges']['dangling_edges']}")

# Clean graph
prep.validate_node_ids()
prep.validate_edges(remove_self_loops=True, remove_dangling=True)
prep.remove_duplicate_edges()

# Feature extraction
prep.add_node_features()  # Adds degree centrality
prep.identify_components()  # Finds connected components

# Get statistics
stats = prep.get_graph_stats()
print(f"Graph density: {stats['density']:.4f}")
print(f"Average degree: {stats['avg_degree']:.2f}")

# Save cleaned data
prep.nodes_df.to_csv('users_cleaned.csv', index=False)
prep.edges_df.to_csv('friendships_cleaned.csv', index=False)

⚙️ Configuration

AutoPrepML supports YAML/JSON configuration files for reproducible workflows.

Create Configuration File

config.yaml:

cleaning:
  missing_strategy: auto  # auto, mean, median, mode, drop
  outlier_method: iforest  # iforest, zscore
  outlier_contamination: 0.1
  scale_method: standard  # standard, minmax
  encode_method: label  # label, onehot
  balance_method: oversample  # oversample, undersample
  remove_outliers: false

detection:
  outlier_method: iforest
  outlier_contamination: 0.1
  imbalance_threshold: 0.3

reporting:
  include_plots: true
  plot_dpi: 100

logging:
  level: INFO

Use Configuration

from autoprepml import AutoPrepML

# Load with config file
prep = AutoPrepML(df, config_path='config.yaml')
clean_df, target = prep.clean(task='classification', target_col='label')

# Or pass config dict directly
config = {
    'cleaning': {
        'missing_strategy': 'median',
        'scale_method': 'minmax'
    }
}
prep = AutoPrepML(df, config=config)

� Examples Directory

The examples/ directory contains working demo scripts for all data types.

Available Demos

Demo Script	Input Data	Generated Output	Features Shown
demo_script.py	Iris dataset (150 rows)	`iris_cleaned.csv` `iris_report.html`	Tabular preprocessing, scaling, encoding, HTML reports
demo_text.py	Customer reviews (100 texts)	`reviews_cleaned.csv`	Text cleaning, stopword removal, tokenization, feature extraction
demo_timeseries.py	Sales data with gaps (365 days)	`sales_cleaned.csv`	Gap filling, interpolation, lag features, rolling statistics
demo_graph.py	Social network (50 nodes, 100 edges)	`social_network_nodes_cleaned.csv` `social_network_edges_cleaned.csv`	Graph validation, component detection, degree centrality
demo_all.py	All 4 data types	Console output	Multi-modal preprocessing in one script

Run Demos

# Navigate to project directory
cd autoprepml

# Run individual demos
python examples/demo_script.py        # Tabular data (Iris)
python examples/demo_text.py          # Text/NLP (reviews)
python examples/demo_timeseries.py    # Time series (sales)
python examples/demo_graph.py         # Graph data (social network)
python examples/demo_all.py           # All data types

# Check generated files
ls *.csv *.html

Expected Output Files

After running demos, you'll find these files in your directory:

iris_cleaned.csv, iris_report.html
reviews_cleaned.csv
sales_cleaned.csv
social_network_nodes_cleaned.csv, social_network_edges_cleaned.csv

🧪 Testing

AutoPrepML has comprehensive test coverage with 103 tests.

Run All Tests

# Run all tests
pytest tests/ -v

# Run with coverage report
pytest tests/ --cov=autoprepml --cov-report=html

# Run specific test file
pytest tests/test_text.py -v

# Run tests for specific module
pytest tests/test_timeseries.py -v

Test Coverage

Module	Tests	Coverage
`core.py`	6 tests	95%
`detection.py`	8 tests	98%
`cleaning.py`	11 tests	96%
`visualization.py`	7 tests	92%
`reports.py`	3 tests	90%
`text.py`	18 tests	95%
`timeseries.py`	18 tests	95%
`graph.py`	26 tests	97%
Total	103 tests	95%

Quick Test Command

# Just see if everything passes
pytest tests/ -q

# Output: 103 passed, 7 warnings in 5.01s

🏗️ Project Structure

autoprepml/
├── autoprepml/              # Core library
│   ├── __init__.py         # Package initialization
│   ├── core.py             # AutoPrepML class (tabular data)
│   ├── text.py             # TextPrepML class (text/NLP)
│   ├── timeseries.py       # TimeSeriesPrepML class (time series)
│   ├── graph.py            # GraphPrepML class (graph data)
│   ├── detection.py        # Issue detection functions
│   ├── cleaning.py         # Data cleaning transformations
│   ├── visualization.py    # Plot generation
│   ├── reports.py          # JSON/HTML report generators
│   ├── config.py           # Configuration management
│   ├── llm_suggest.py      # AI suggestions (placeholder)
│   ├── cli.py              # Command-line interface
│   └── utils.py            # Helper utilities
├── tests/                   # Test suite (103 tests)
│   ├── test_core.py        # Tabular data tests (6)
│   ├── test_text.py        # Text preprocessing tests (18)
│   ├── test_timeseries.py  # Time series tests (18)
│   ├── test_graph.py       # Graph data tests (26)
│   ├── test_detection.py   # Detection tests (8)
│   ├── test_cleaning.py    # Cleaning tests (11)
│   ├── test_visualization.py # Visualization tests (7)
│   ├── test_reports.py     # Reporting tests (3)
│   └── test_llm_suggest.py # LLM tests (6)
├── examples/                # Demo scripts
│   ├── demo_script.py      # Tabular data demo
│   ├── demo_text.py        # Text/NLP demo
│   ├── demo_timeseries.py  # Time series demo
│   ├── demo_graph.py       # Graph data demo
│   ├── demo_all.py         # Multi-modal demo
│   └── demo_notebook.ipynb # Jupyter notebook demo
├── docs/                    # Documentation
│   ├── index.md            # Documentation home
│   ├── usage.md            # Usage guide
│   ├── api_reference.md    # API documentation
│   └── tutorials.md        # Detailed tutorials
├── scripts/                 # Utility scripts
│   ├── run_tests.sh        # Test runner
│   ├── build_docs.sh       # Documentation builder
│   └── release.sh          # Release automation
├── setup.py                # Package setup
├── pyproject.toml          # Modern Python packaging
├── requirements.txt        # Dependencies
├── README.md               # This file
├── LICENSE                 # MIT License
├── .gitignore              # Git ignore rules
└── autoprepml.yaml         # Sample configuration

🛠️ Development Setup

For Contributors

# 1. Fork and clone the repository
git clone https://github.com/mdshoaibuddinchanda/autoprepml.git
cd autoprepml

# 2. Create a virtual environment (recommended)
python -m venv venv

# Activate on Windows
venv\Scripts\activate

# Activate on macOS/Linux
source venv/bin/activate

# 3. Install in development mode with dev dependencies
pip install -e ".[dev]"

# 4. Run tests to verify setup
pytest tests/ -v

# 5. Make your changes and run tests again
pytest tests/ -v

Development Commands

# Run tests with coverage
pytest tests/ --cov=autoprepml --cov-report=html

# Run tests for specific module
pytest tests/test_text.py -v

# Run linting (if configured)
black autoprepml/ tests/
ruff check autoprepml/

# Build documentation
cd docs
mkdocs serve  # View at http://localhost:8000

# Create distribution packages
python -m build

📚 Documentation

Comprehensive documentation is available in the docs/ directory:

Usage Guide - Step-by-step tutorials for each data type
API Reference - Complete function and class documentation
Tutorials - Real-world examples and best practices
Multi-Modal Summary - Overview of all data type features

Build Documentation Locally

pip install mkdocs mkdocs-material
cd docs
mkdocs serve  # View at http://localhost:8000

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Quick Start:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make changes and add tests
Run tests: pytest tests/ -v
Commit: git commit -m "Add amazing feature"
Push and open a Pull Request

🐛 Troubleshooting

Issue	Solution
Import Error	`pip install -e .`
CLI not recognized	Reinstall: `pip uninstall autoprepml && pip install -e .`
Tests failing	Install dev dependencies: `pip install -e ".[dev]"`
Matplotlib backend issues	Set backend: `import matplotlib; matplotlib.use('Agg')`
Memory issues	Process in chunks: `pd.read_csv('file.csv', chunksize=10000)`

For more help, see GitHub Issues or Discussions.

📊 Performance

Benchmarks

Dataset Size	Data Type	Processing Time	Memory Usage
1K rows	Tabular	<0.5s	<50MB
10K rows	Tabular	<2s	<100MB
100K rows	Tabular	<10s	<500MB
1K texts	Text/NLP	<1s	<100MB
10K texts	Text/NLP	<5s	<300MB
1K timestamps	Time Series	<1s	<80MB
10K nodes/edges	Graph	<2s	<150MB

Benchmarks run on: Intel Core i5, 16GB RAM, Python 3.10

Optimization Tips

# 1. Use auto mode for faster processing
prep.clean(task='classification', target_col='label', auto=True)

# 2. Disable reporting for speed
prep = AutoPrepML(df, config={'reporting': {'include_plots': False}})

# 3. Process in chunks for large data
for chunk in pd.read_csv('big.csv', chunksize=10000):
    prep = AutoPrepML(chunk)
    # Process

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with pandas, scikit-learn, and matplotlib
Inspired by the need for faster data preprocessing in ML workflows
Thanks to all contributors

📧 Contact

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: mdshoaibuddinchanda@gmail.com

🗺️ Roadmap

✅ Version 1.0.0 (Released)

Tabular data preprocessing (AutoPrepML)
Text/NLP preprocessing (TextPrepML)
Time series preprocessing (TimeSeriesPrepML)
Graph data preprocessing (GraphPrepML)
JSON/HTML reports with visualizations
CLI support with comprehensive options
103 unit tests with 95%+ coverage
YAML/JSON configuration system

✅ Version 1.1.0 (Released - Q1 2025)

Advanced imputation (KNN, iterative)
SMOTE for class balancing
Enhanced documentation website
PyPI package publication (In Progress)

✅ Version 1.2.0 (Released - Q1 2025)

LLM integration for smart suggestions (OpenAI, Anthropic, Google, Ollama)
Configuration manager for API keys
CLI configuration tool (autoprepml-config)
Complete LLM documentation and examples
Image data preprocessing module

📋 Version 1.3.0 (Q2 2025)

Audio/video metadata extraction
Distributed processing (Dask support)
Cloud storage integration (S3, GCS, Azure)

🌟 Version 2.0.0 (Q3-Q4 2025)

Real-time streaming support
MLOps integration (MLflow, W&B)
Docker containers and Kubernetes
Web UI for interactive preprocessing
Community plugin system

💡 Use Cases

By Industry

Industry	Use Cases
E-Commerce	Customer review sentiment (Text), Sales forecasting (Time Series), Product recommendations (Graph)
Finance	Fraud detection (Tabular), Stock prediction (Time Series), Transaction networks (Graph)
Healthcare	Patient data (Tabular), Medical reports (Text), Disease tracking (Time Series), Provider networks (Graph)
Social Media	User behavior (Tabular), Content moderation (Text), Trend detection (Time Series), Social networks (Graph)

By Task

Machine Learning: Feature engineering, data quality assessment, automated preprocessing
Data Science: EDA, data cleaning for visualization, statistical analysis
Research: Dataset preparation, reproducible workflows, benchmark creation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with pandas, scikit-learn, matplotlib, and seaborn.

📧 Contact

Author: MD Shoaibuddin Chanda
GitHub: @mdshoaibuddinchanda
Issues: Report bugs or request features

⭐ Star this repo if AutoPrepML helped you!

Documentation • Examples • Changelog • Contributing

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.3.0

Oct 25, 2025

1.0.1

Oct 22, 2025

1.0.0

Oct 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoprepml-1.3.0.tar.gz (146.4 kB view details)

Uploaded Oct 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autoprepml-1.3.0-py3-none-any.whl (71.2 kB view details)

Uploaded Oct 25, 2025 Python 3

File details

Details for the file autoprepml-1.3.0.tar.gz.

File metadata

Download URL: autoprepml-1.3.0.tar.gz
Upload date: Oct 25, 2025
Size: 146.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for autoprepml-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`937b6ff241f36101382321d7b0add8cb19df60298a163fca6c1bfd5becaf7489`
MD5	`9dd6588a16b30a91f838bbd230de6cc8`
BLAKE2b-256	`0a0516b26f8541fb654ed03eaf9f208b5889a9e1890da57e079c1e4f6f2745a4`

See more details on using hashes here.

File details

Details for the file autoprepml-1.3.0-py3-none-any.whl.

File metadata

Download URL: autoprepml-1.3.0-py3-none-any.whl
Upload date: Oct 25, 2025
Size: 71.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.11

File hashes

Hashes for autoprepml-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`887cb6828321d43a271749fb107bbdf57c0bb0498143a7a4fc91c46a0ef5782b`
MD5	`621e2cb3ad6367fd90ef74e236b8e1d5`
BLAKE2b-256	`b47f262ee3d73f4815d40fe91599a395094fce4c70b73be1ec3b7e9e004ea1a1`

See more details on using hashes here.

autoprepml 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutoPrepML

🎯 Features

Core Features

Advanced Features (v1.3.0) 🆕

Previous Releases

📋 Quick Navigation

📊 Supported Data Types

📦 Installation

Prerequisites

Option 1: Install from PyPI

Option 2: Install from Source (Latest Development Version)

Option 3: With Development Tools

Configure LLM Support (Optional)

Verify Installation

🆕 v1.3.0 New Features

📊 AutoEDA - Automated Exploratory Data Analysis

⚙️ AutoFeatureEngine - Intelligent Feature Engineering

📈 Interactive Dashboards - Visualization & Streamlit

🤖 Enhanced LLM Assistant - Intelligent Data Cleaning

📦 New Dependencies

🚀 Quick Start Guide

Step 1: Import the Library

Step 2: Choose Your Data Type

📊 Tabular Data (CSV, Excel, JSON)

🤖 With AI-Powered Suggestions (v1.2.0+)

📝 Text/NLP Data (Reviews, Documents, Tweets)

⏰ Time Series Data (Sales, Sensor Data, Logs)

🕸️ Graph Data (Social Networks, Relationships)

🖼️ Image Data (Computer Vision, ML Models)

� Complete Feature Reference

1️⃣ Tabular Data (AutoPrepML)

2️⃣ Text/NLP Data (TextPrepML)

3️⃣ Time Series Data (TimeSeriesPrepML)

4️⃣ Graph Data (GraphPrepML)

⚙️ Configuration

Create Configuration File

Use Configuration

� Examples Directory

Available Demos

Run Demos

Expected Output Files

🧪 Testing

Run All Tests

Test Coverage

Quick Test Command

🏗️ Project Structure

🛠️ Development Setup

For Contributors

Development Commands

📚 Documentation

Build Documentation Locally

🤝 Contributing

🐛 Troubleshooting

📊 Performance

Benchmarks

Optimization Tips

📝 License

🙏 Acknowledgments

📧 Contact

🗺️ Roadmap

✅ Version 1.0.0 (Released)

✅ Version 1.1.0 (Released - Q1 2025)

✅ Version 1.2.0 (Released - Q1 2025)

📋 Version 1.3.0 (Q2 2025)

🌟 Version 2.0.0 (Q3-Q4 2025)

💡 Use Cases

By Industry

By Task

📄 License

🙏 Acknowledgments

📧 Contact