Skip to main content

Intelligent data preprocessing library with advanced options

Project description

FlowPrep ML 🚀

PyPI version Python 3.8+ License: MIT Downloads

Intelligent data preprocessing library with advanced options for machine learning workflows.

FlowPrep ML is a powerful Python library that provides intelligent data preprocessing capabilities with minimal code. Perfect for data scientists and ML engineers who want to quickly preprocess their datasets with advanced options.

✨ Features

  • One-liner preprocessing: preprocess("data.csv") and you're done!
  • Multiple file formats: CSV, XLS, XLSX support
  • Advanced options: Missing value imputation, feature scaling, categorical encoding, outlier removal
  • Intelligent defaults: Works out of the box with sensible preprocessing choices
  • Flexible configuration: Customize every aspect of preprocessing
  • Train-test splitting: Automatic data splitting for ML workflows
  • Comprehensive logging: Track every preprocessing step

🚀 Quick Start

Installation

pip install flowprep-ml

Basic Usage

import flowprep_ml

# One-liner preprocessing
result = flowprep_ml.preprocess("data.csv")

# Access processed data
train_data = result['train_data']
test_data = result['test_data']
print(f"Processed {result['processed_shape'][0]} rows, {result['processed_shape'][1]} columns")

Advanced Usage

import flowprep_ml

# Custom preprocessing options
result = flowprep_ml.preprocess(
    "data.csv",
    imputation_method="median",      # Handle missing values
    scaling_method="standard",       # Scale features
    encoding_method="onehot",        # Encode categorical variables
    remove_outliers=True,            # Remove outliers
    outlier_method="iqr",            # Outlier detection method
    test_size=0.2,                   # 20% for testing
    random_state=42                  # Reproducible results
)

# Access results
print("Preprocessing log:")
for log_entry in result['preprocessing_log']:
    print(f"  - {log_entry}")

print(f"Output saved to: {result['output_path']}")

📊 Supported File Formats

  • CSV: .csv
  • Excel: .xls, .xlsx, .xlsm

⚙️ Preprocessing Options

Missing Value Handling

  • imputation_method: "mean", "median", "mode", "drop"

Feature Scaling

  • scaling_method: "minmax", "standard", "robust"

Categorical Encoding

  • encoding_method: "onehot", "label"

Outlier Removal

  • remove_outliers: True/False
  • outlier_method: "iqr", "zscore"

Data Splitting

  • test_size: Fraction for test set (0.0 to 1.0)
  • random_state: Random seed for reproducibility

Output Options

  • output_format: "csv", "excel"
  • save_processed: True/False
  • output_path: Custom output path

📖 Examples

Example 1: Basic Preprocessing

import flowprep_ml
import pandas as pd

# Create sample data
data = pd.DataFrame({
    'age': [25, 30, None, 45, 50],
    'income': [50000, 60000, 70000, 80000, 90000],
    'category': ['A', 'B', 'A', 'C', 'B'],
    'score': [85, 90, 78, 92, 88]
})
data.to_csv('sample_data.csv', index=False)

# Preprocess
result = flowprep_ml.preprocess('sample_data.csv')
print(result['preprocessing_log'])

Example 2: Advanced Preprocessing

import flowprep_ml

# Advanced preprocessing with custom options
result = flowprep_ml.preprocess(
    'data.csv',
    imputation_method='median',
    scaling_method='robust',
    encoding_method='onehot',
    remove_outliers=True,
    outlier_method='zscore',
    test_size=0.3,
    random_state=123
)

# Access processed data
train_data = result['train_data']
test_data = result['test_data']

print(f"Training set: {train_data.shape}")
print(f"Test set: {test_data.shape}")
print(f"Output file: {result['output_path']}")

Example 3: Using PreprocessingOptions Class

import flowprep_ml
from flowprep_ml import PreprocessingOptions

# Create options object
options = PreprocessingOptions(
    imputation_method='mean',
    scaling_method='standard',
    encoding_method='onehot',
    remove_outliers=True,
    outlier_method='iqr',
    test_size=0.2,
    random_state=42
)

# Use with preprocessing
result = flowprep_ml.preprocess('data.csv', **options.to_dict())

🔧 API Reference

Main Functions

preprocess(file_path, **kwargs)

Main preprocessing function.

Parameters:

  • file_path (str or Path): Path to input file
  • **kwargs: Preprocessing options

Returns:

  • dict: Preprocessing results containing:
    • success (bool): Whether preprocessing succeeded
    • original_shape (tuple): Original data shape
    • processed_shape (tuple): Processed data shape
    • train_shape (tuple): Training data shape
    • test_shape (tuple): Test data shape
    • output_path (str): Path to saved processed data
    • preprocessing_log (list): Log of preprocessing steps
    • options_used (dict): Options used for preprocessing
    • train_data (DataFrame): Processed training data
    • test_data (DataFrame): Processed test data

get_supported_formats()

Get list of supported file formats.

Returns:

  • list: List of supported file extensions

validate_file(file_path)

Validate if file exists and is supported format.

Parameters:

  • file_path (str or Path): Path to file

Returns:

  • bool: True if file is valid

Raises:

  • FileNotFoundError: If file doesn't exist
  • UnsupportedFileFormatError: If file format is not supported

Classes

PreprocessingOptions

Configuration class for preprocessing options.

Attributes:

  • imputation_method (str): Method for handling missing values
  • scaling_method (str): Method for scaling numerical features
  • encoding_method (str): Method for encoding categorical variables
  • remove_outliers (bool): Whether to remove outliers
  • outlier_method (str): Method for outlier detection
  • test_size (float): Fraction of data to use for testing
  • random_state (int): Random seed for reproducibility
  • output_format (str): Output file format
  • save_processed (bool): Whether to save processed data
  • output_path (str, optional): Custom output path

🛠️ Development

Installation for Development

git clone https://github.com/flowml/flowprep-ml.git
cd flowprep-ml
pip install -e .

Running Tests

pytest

Code Formatting

black flowprep_ml/
flake8 flowprep_ml/

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📞 Support

🙏 Acknowledgments

  • Built with pandas
  • Powered by scikit-learn
  • Inspired by the need for simple, powerful data preprocessing

Made with ❤️ by the Flow ML Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flowprep_ml-1.0.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flowprep_ml-1.0.0-py3-none-any.whl (12.0 kB view details)

Uploaded Python 3

File details

Details for the file flowprep_ml-1.0.0.tar.gz.

File metadata

  • Download URL: flowprep_ml-1.0.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for flowprep_ml-1.0.0.tar.gz
Algorithm Hash digest
SHA256 86d7757a3013b8615df3d31b25c3f2ef21fd08873c4ea917aef85f060551ad47
MD5 9745dca0f0a917953099cac7f09fccb8
BLAKE2b-256 969f186cb04949e45e72ac11fb48e9ae40cc807aa6467bb2bedb3553864ce827

See more details on using hashes here.

File details

Details for the file flowprep_ml-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: flowprep_ml-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 12.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for flowprep_ml-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fec15bc1c367f82b1e78b619af92b5db360c068f2e89911fe372c20358618046
MD5 36d132d7e7e933523fea1bc91b76afb3
BLAKE2b-256 c414e6d90a0dc4b8ad255ed86cacd4298e78fe9b7fa1d57733f9c360c3c867c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page