Dataset-agnostic ML classification library. Visualization tools, Slack integration, support for multiple-pipelines.
Project description
Efficient Classifier
A comprehensive, dataset-agnostic machine learning framework for rapid development and deployment of classification pipelines on tabular data. Advanced DevOps tools.
Table of Contents
- Overview
- Key Features
- Architecture
- Installation
- Quick Start
- Configuration
- Supported Models & Metrics
- Use Cases
- Performance
- Contributing
- Documentation
- License
Overview
Efficient Classifier is an enterprise-grade machine learning framework designed to accelerate the development lifecycle from data preprocessing to model deployment. Built with scalability and reproducibility in mind, it provides a unified interface for experimenting with multiple classification pipelines while maintaining rigorous tracking of experiments and results.
The framework supports both binary and multiclass classification tasks and has been extensively validated on real-world datasets, including the CCCS-CIC-AndMal-2020 cybersecurity dataset where it achieved 92% F1-score performance.
Research & Validation
Our framework has been applied to cutting-edge cybersecurity research:
- Research Paper - CCCS-CIC-AndMal-2020 Analysis
- Complete Results - Plots, logs, and execution history
- Technical Report - Methodology and findings
- EDA Notebook - Exploratory data analysis
- Presentation - Project overview
Key Features
๐ Rapid Pipeline Development
- Multi-pipeline orchestration with customizable architectures
- Zero-boilerplate configuration through YAML
- Automated hyperparameter optimization (Grid, Random, Bayesian)
- One-command execution from data to deployment
๐ฌ Advanced Analytics & Visualization
- Comprehensive residual analysis and confusion matrices
- LIME-based feature importance with permutation testing
- Model calibration with reliability diagrams
- Cross-validation with stratified sampling
- Real-time training progress monitoring
๐ Production-Ready DevOps
- Slack bot integration for real-time notifications
- Automated DAG visualization of pipeline architectures
- Model serialization with joblib/pickle support
- Comprehensive logging and experiment tracking
- Built-in testing framework integration
โก High-Performance Computing
- Multithreaded processing where parallelization is beneficial
- Memory-efficient data handling for large datasets
- Optimized feature selection algorithms (Boruta, L1 regularization)
- Smart caching mechanisms for repeated operations
Architecture
The framework follows a modular, stage-based architecture:
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Data Loading โ -> โ Preprocessing โ -> โ Feature Analysisโ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
|
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ DevOps โ <- โ Modeling โ <- โ Evaluation โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Stage-Specific Capabilities
| Stage | Capability | Description |
|---|---|---|
| Data Management | Smart Splitting | Adaptive train/validation/test splits with distribution analysis |
| Distribution Validation | Statistical tests ensuring consistent feature distributions across splits | |
| Preprocessing | Advanced Encoding | One-hot encoding with automatic categorical detection |
| Intelligent Imputation | Multiple strategies for handling missing values | |
| Outlier Detection | IQR and percentile-based detection with configurable treatment | |
| Robust Scaling | StandardScaler, RobustScaler, and MinMaxScaler support | |
| Class Balancing | SMOTE and ADASYN implementations for imbalanced datasets | |
| Feature Engineering | Automated Selection | Mutual information, variance filtering, and multicollinearity detection |
| Advanced Techniques | Boruta feature selection and L1 regularization | |
| Custom Engineering | Dataset-specific feature creation hooks | |
| Modeling | Ensemble Methods | Stacked generalization with configurable base learners |
| Neural Networks | Feed-forward architectures with epoch-wise monitoring | |
| Model Comparison | Cross-model evaluation with statistical significance testing | |
| DevOps | Real-time Monitoring | Slack integration for training progress and alerts |
| Experiment Tracking | Comprehensive CSV logging with metadata | |
| Visualization | Automated DAG generation for pipeline architecture |
Installation
PyPI Installation (Recommended)
pip install efficient-classifier
Development Installation
git clone https://github.com/javidsegura/efficient-classifier.git
cd efficient-classifier
pip install -r requirements.txt
Environment Setup
For Slack bot integration, create a .env file:
SLACK_BOT_TOKEN=your_bot_token
SLACK_SIGNING_SECRET=your_signing_secret
SLACK_APP_TOKEN=your_app_token
Quick Start
Basic Usage
from efficient_classifier import PipelineManager
# Initialize with configuration
manager = PipelineManager('configurations.yaml')
# Execute complete pipeline
results = manager.run_all_pipelines()
# Access best model
best_model = results.get_best_model()
predictions = best_model.predict(X_test)
Custom Dataset Integration
- Configure dataset-specific cleaning in
pipeline_runner.py:
def _clean_dataset_set_up_dataset_specific(self, df):
# Your custom preprocessing logic
return cleaned_df
- Implement feature engineering in
featureAnalysis_runner.py:
def _run_feature_engineering_dataset_specific(self, df):
# Your custom feature engineering
return engineered_df
- Update boundary conditions in
bound_config.pyfor data validation.
Configuration
The framework uses a comprehensive YAML configuration system. Key configuration sections:
Pipeline Definition
general:
pipelines_names: ["baseline", "advanced", "ensemble"]
max_plots_per_function: 10 # Control visualization output
Data Processing
phase_runners:
dataset_runners:
split_df:
p: [0.7, 0.8, 0.9] # Split ratios to evaluate
step: 0.05 # Granularity of split analysis
encoding:
y_column: "target" # Target variable name
Model Configuration
modelling_runner:
class_weights:
weights: {0: 1.0, 1: 2.0} # Handle class imbalance
models_to_include:
baseline: ["Random Forest", "Logistic Regression"]
advanced: ["XGBoost", "Neural Network"]
optimization:
method: "bayesian" # grid, random, or bayesian
cv_folds: 5
For complete configuration options, see the detailed documentation.
Supported Models & Metrics
Machine Learning Models
Tree-Based Algorithms:
- Random Forest, Decision Trees, Gradient Boosting
- XGBoost, LightGBM, CatBoost
- AdaBoost with configurable base estimators
Linear Models:
- Logistic Regression, Ridge Classifier
- Linear/Non-linear SVM, SGD Classifier
- Elastic Net with L1/L2 regularization
Advanced Methods:
- Feed-Forward Neural Networks
- Ensemble Stacking (meta-learning)
- K-Nearest Neighbors, Gaussian Naive Bayes
Baseline Models:
- Majority Class Classifier for benchmarking
Evaluation Metrics
- Classification Accuracy - Overall correctness
- Precision, Recall, F1-Score - Class-specific performance
- Cohen's Kappa - Inter-rater reliability
- Weighted Accuracy - Class-imbalance adjusted accuracy
- ROC-AUC - Area under receiver operating characteristic
- Calibration Metrics - Reliability diagrams and Brier score
Adding Custom Models
Extend model support by modifying modelling_runner.py:
def _model_initializers(self):
models = {
# Existing models...
"Custom Model": YourCustomClassifier(
param1=self.config['custom_param']
)
}
return models
Use Cases
MANTIS: Cybersecurity Threat Detection
Our flagship application demonstrates the framework's capabilities in cybersecurity:
Dataset: CCCS-CIC-AndMal-2020 (Android malware detection) Performance: 92% F1-score with Random Forest + Stacking ensemble Scale: 200,000+ samples with 464 features Deployment: Production-ready model with 15ms inference time
Key Results:
- Outperformed baseline approaches by 23%
- Identified 847 critical features through automated selection
- Achieved 99.1% precision for malware detection
Benchmark Datasets
Titanic Survival Prediction: View Results
- 89.3% accuracy with ensemble methods
- Comprehensive feature engineering pipeline
Iris Classification: View Results
- 97.8% accuracy across all pipeline configurations
- Validation of multi-class capabilities
Performance
Benchmarks
| Dataset | Samples | Features | Best Model | F1-Score | Training Time |
|---|---|---|---|---|---|
| CCCS-CIC-AndMal-2020 | 200K+ | 464 | Random Forest | 92.0% | 45 min |
| Titanic | 891 | 12 | Stacking Ensemble | 89.3% | 2 min |
| Iris | 150 | 4 | Neural Network | 97.8% | 30 sec |
Optimization Features
- Memory Management: Efficient handling of datasets up to 1M+ rows
- Parallel Processing: Multi-core utilization for independent operations
- Early Stopping: Automatic convergence detection for iterative algorithms
- Caching: Intelligent result caching for repeated experiments
Model Deployment
Serialization & Inference
# Save trained pipeline
manager.serialize_model(best_pipeline, 'production_model.pkl')
# Load for inference
loaded_model = manager.load_model('production_model.pkl')
# Production predictions
predictions = loaded_model.model_sklearn.predict(X_new)
probabilities = loaded_model.model_sklearn.predict_proba(X_new)
Production Integration
The serialized models contain:
- Trained sklearn estimator objects
- Complete preprocessing pipelines
- Feature engineering transformations
- Model metadata and performance metrics
Monitoring & Visualization
Real-Time Notifications
Slack bot provides real-time updates on training progress, model performance, and system alerts.
Pipeline Visualization
Automatically generated DAG visualization showing pipeline architecture, data flow, and performance metrics.
Roadmap & Known Limitations
Upcoming Features
- Multi-label Classification Support
- Cyclical Feature Encoding for temporal data
- Cloud Deployment Integration (AWS, GCP, Azure)
- Docker Containerization for production deployment
- Advanced AutoML Capabilities with neural architecture search
Current Limitations
- Missing Value Handling: Assumes preprocessed data (manual handling required)
- Grid Search Configuration: Complex setup process for new parameter spaces
- Stacking Visualization: Not included in DAG visualization
- Per-Pipeline Feature Selection: Currently uses unified feature selection
Performance Considerations
Operations marked as 'MAJOR IMPACT IN PERFORMANCE' in configuration:
- Bayesian optimization with large parameter spaces
- Neural network training with extensive architectures
- Cross-validation with high fold counts
- Feature selection on high-dimensional datasets
Contributing
We welcome contributions from the community! Please follow these guidelines:
Development Process
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit changes:
git commit -m 'Add amazing feature' - Push to branch:
git push origin feature/amazing-feature - Open a Pull Request
Code Standards
- Follow PEP 8 style guidelines
- Include comprehensive docstrings
- Add unit tests for new features
- Update documentation for API changes
Pull Request Template
Please include:
- Description of changes and motivation
- Testing performed and results
- Breaking Changes if applicable
- Documentation updates
Documentation
Comprehensive Guides
- Library Architecture - Design decisions and implementation details
- API Reference - Complete function and class documentation
- Configuration Guide - YAML parameter explanations
- Troubleshooting - Common issues and solutions
Research Publications
Access our peer-reviewed research and detailed technical reports through the links provided in the Research & Validation section.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use this framework in your research, please cite:
@software{efficient_classifier_2024,
title={Efficient Classifier: A Dataset-Agnostic ML Framework},
author={[Javier D., Caterina B, Juan A., Federica C., Irina I., Juliette J.]},
year={2025},
url={https://github.com/javidsegura/efficient-classifier}
}
Acknowledgments
- Built with scikit-learn, XGBoost, and other open-source ML libraries
- Validated on datasets from the Canadian Centre for Cyber Security
- Community contributors and beta testers
Ready to accelerate your ML workflow? Install via pip install efficient-classifier and check out our Quick Start Guide.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file efficient_classifier-2.1.2.tar.gz.
File metadata
- Download URL: efficient_classifier-2.1.2.tar.gz
- Upload date:
- Size: 52.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8de365fd7b7426789f87c20687c1902e47ab46dbe5fc1ba172b53d8f530eed69
|
|
| MD5 |
a9b72b1160bc0a685deb12d2470b5137
|
|
| BLAKE2b-256 |
2f29709b16f26d8cb445cf424b44112c313fb7db0ab002e696762790b45933a1
|
File details
Details for the file efficient_classifier-2.1.2-py3-none-any.whl.
File metadata
- Download URL: efficient_classifier-2.1.2-py3-none-any.whl
- Upload date:
- Size: 58.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36e3dcf18eda355af7fc512e2e40fa9dbee8de538502130960d9f8d7fd5ecdd1
|
|
| MD5 |
f9f57107f72b003adba387f8a5622141
|
|
| BLAKE2b-256 |
fbc3956f8dc75dd090c9c236032f249e87064822961c1663ccdf6108f537518b
|