Intelligent data preprocessing library with advanced options
Project description
FlowPrep ML 🚀
Intelligent data preprocessing library with advanced options for machine learning workflows.
FlowPrep ML is a powerful Python library that provides intelligent data preprocessing capabilities with minimal code. Perfect for data scientists and ML engineers who want to quickly preprocess their datasets with advanced options.
✨ Features
- One-liner preprocessing:
preprocess("data.csv")and you're done! - Multiple file formats: CSV, XLS, XLSX support
- Advanced options: Missing value imputation, feature scaling, categorical encoding, outlier removal
- Intelligent defaults: Works out of the box with sensible preprocessing choices
- Flexible configuration: Customize every aspect of preprocessing
- Train-test splitting: Automatic data splitting for ML workflows
- Comprehensive logging: Track every preprocessing step
🚀 Quick Start
Installation
pip install flowprep-ml
Basic Usage
import flowprep_ml
# One-liner preprocessing
result = flowprep_ml.preprocess("data.csv")
# Access processed data
train_data = result['train_data']
test_data = result['test_data']
print(f"Processed {result['processed_shape'][0]} rows, {result['processed_shape'][1]} columns")
Advanced Usage
import flowprep_ml
# Custom preprocessing options
result = flowprep_ml.preprocess(
"data.csv",
imputation_method="median", # Handle missing values
scaling_method="standard", # Scale features
encoding_method="onehot", # Encode categorical variables
remove_outliers=True, # Remove outliers
outlier_method="iqr", # Outlier detection method
test_size=0.2, # 20% for testing
random_state=42 # Reproducible results
)
# Access results
print("Preprocessing log:")
for log_entry in result['preprocessing_log']:
print(f" - {log_entry}")
print(f"Output saved to: {result['output_path']}")
📊 Supported File Formats
- CSV:
.csv - Excel:
.xls,.xlsx,.xlsm
⚙️ Preprocessing Options
Missing Value Handling
imputation_method:"mean","median","mode","drop"
Feature Scaling
scaling_method:"minmax","standard","robust"
Categorical Encoding
encoding_method:"onehot","label"
Outlier Removal
remove_outliers:True/Falseoutlier_method:"iqr","zscore"
Data Splitting
test_size: Fraction for test set (0.0 to 1.0)random_state: Random seed for reproducibility
Output Options
output_format:"csv","excel"save_processed:True/Falseoutput_path: Custom output path
📖 Examples
Example 1: Basic Preprocessing
import flowprep_ml
import pandas as pd
# Create sample data
data = pd.DataFrame({
'age': [25, 30, None, 45, 50],
'income': [50000, 60000, 70000, 80000, 90000],
'category': ['A', 'B', 'A', 'C', 'B'],
'score': [85, 90, 78, 92, 88]
})
data.to_csv('sample_data.csv', index=False)
# Preprocess
result = flowprep_ml.preprocess('sample_data.csv')
print(result['preprocessing_log'])
Example 2: Advanced Preprocessing
import flowprep_ml
# Advanced preprocessing with custom options
result = flowprep_ml.preprocess(
'data.csv',
imputation_method='median',
scaling_method='robust',
encoding_method='onehot',
remove_outliers=True,
outlier_method='zscore',
test_size=0.3,
random_state=123
)
# Access processed data
train_data = result['train_data']
test_data = result['test_data']
print(f"Training set: {train_data.shape}")
print(f"Test set: {test_data.shape}")
print(f"Output file: {result['output_path']}")
Example 3: Using PreprocessingOptions Class
import flowprep_ml
from flowprep_ml import PreprocessingOptions
# Create options object
options = PreprocessingOptions(
imputation_method='mean',
scaling_method='standard',
encoding_method='onehot',
remove_outliers=True,
outlier_method='iqr',
test_size=0.2,
random_state=42
)
# Use with preprocessing
result = flowprep_ml.preprocess('data.csv', **options.to_dict())
🔧 API Reference
Main Functions
preprocess(file_path, **kwargs)
Main preprocessing function.
Parameters:
file_path(str or Path): Path to input file**kwargs: Preprocessing options
Returns:
dict: Preprocessing results containing:success(bool): Whether preprocessing succeededoriginal_shape(tuple): Original data shapeprocessed_shape(tuple): Processed data shapetrain_shape(tuple): Training data shapetest_shape(tuple): Test data shapeoutput_path(str): Path to saved processed datapreprocessing_log(list): Log of preprocessing stepsoptions_used(dict): Options used for preprocessingtrain_data(DataFrame): Processed training datatest_data(DataFrame): Processed test data
get_supported_formats()
Get list of supported file formats.
Returns:
list: List of supported file extensions
validate_file(file_path)
Validate if file exists and is supported format.
Parameters:
file_path(str or Path): Path to file
Returns:
bool: True if file is valid
Raises:
FileNotFoundError: If file doesn't existUnsupportedFileFormatError: If file format is not supported
Classes
PreprocessingOptions
Configuration class for preprocessing options.
Attributes:
imputation_method(str): Method for handling missing valuesscaling_method(str): Method for scaling numerical featuresencoding_method(str): Method for encoding categorical variablesremove_outliers(bool): Whether to remove outliersoutlier_method(str): Method for outlier detectiontest_size(float): Fraction of data to use for testingrandom_state(int): Random seed for reproducibilityoutput_format(str): Output file formatsave_processed(bool): Whether to save processed dataoutput_path(str, optional): Custom output path
🛠️ Development
Installation for Development
git clone https://github.com/flowml/flowprep-ml.git
cd flowprep-ml
pip install -e .
Running Tests
pytest
Code Formatting
black flowprep_ml/
flake8 flowprep_ml/
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
📞 Support
- Documentation: https://flowprep-ml.readthedocs.io/
- Issues: https://github.com/flowml/flowprep-ml/issues
- Email: support@flowml.ai
🙏 Acknowledgments
- Built with pandas
- Powered by scikit-learn
- Inspired by the need for simple, powerful data preprocessing
Made with ❤️ by the Flow ML Team
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flowprep_ml-1.0.0.tar.gz.
File metadata
- Download URL: flowprep_ml-1.0.0.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
86d7757a3013b8615df3d31b25c3f2ef21fd08873c4ea917aef85f060551ad47
|
|
| MD5 |
9745dca0f0a917953099cac7f09fccb8
|
|
| BLAKE2b-256 |
969f186cb04949e45e72ac11fb48e9ae40cc807aa6467bb2bedb3553864ce827
|
File details
Details for the file flowprep_ml-1.0.0-py3-none-any.whl.
File metadata
- Download URL: flowprep_ml-1.0.0-py3-none-any.whl
- Upload date:
- Size: 12.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fec15bc1c367f82b1e78b619af92b5db360c068f2e89911fe372c20358618046
|
|
| MD5 |
36d132d7e7e933523fea1bc91b76afb3
|
|
| BLAKE2b-256 |
c414e6d90a0dc4b8ad255ed86cacd4298e78fe9b7fa1d57733f9c360c3c867c0
|