Skip to main content

Dataset-agnostic ML classification library. Visualization tools, Slack integration, support for multiple-pipelines.

Project description

Efficient Classifier

PyPI version Python 3.7+ License: MIT

A comprehensive, dataset-agnostic machine learning framework for rapid development and deployment of classification pipelines on tabular data. Advanced DevOps tools.

Table of Contents

Overview

Efficient Classifier is an enterprise-grade machine learning framework designed to accelerate the development lifecycle from data preprocessing to model deployment. Built with scalability and reproducibility in mind, it provides a unified interface for experimenting with multiple classification pipelines while maintaining rigorous tracking of experiments and results.

The framework supports both binary and multiclass classification tasks and has been extensively validated on real-world datasets, including the CCCS-CIC-AndMal-2020 cybersecurity dataset where it achieved 92% F1-score performance.

Research & Validation

Our framework has been applied to cutting-edge cybersecurity research:

Key Features

๐Ÿš€ Rapid Pipeline Development

  • Multi-pipeline orchestration with customizable architectures
  • Zero-boilerplate configuration through YAML
  • Automated hyperparameter optimization (Grid, Random, Bayesian)
  • One-command execution from data to deployment

๐Ÿ”ฌ Advanced Analytics & Visualization

  • Comprehensive residual analysis and confusion matrices
  • LIME-based feature importance with permutation testing
  • Model calibration with reliability diagrams
  • Cross-validation with stratified sampling
  • Real-time training progress monitoring

๐Ÿ›  Production-Ready DevOps

  • Slack bot integration for real-time notifications
  • Automated DAG visualization of pipeline architectures
  • Model serialization with joblib/pickle support
  • Comprehensive logging and experiment tracking
  • Built-in testing framework integration

โšก High-Performance Computing

  • Multithreaded processing where parallelization is beneficial
  • Memory-efficient data handling for large datasets
  • Optimized feature selection algorithms (Boruta, L1 regularization)
  • Smart caching mechanisms for repeated operations

Architecture

The framework follows a modular, stage-based architecture:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Data Loading  โ”‚ -> โ”‚  Preprocessing   โ”‚ -> โ”‚ Feature Analysisโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                 |
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     DevOps      โ”‚ <- โ”‚    Modeling      โ”‚ <- โ”‚   Evaluation    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Stage-Specific Capabilities

Stage Capability Description
Data Management Smart Splitting Adaptive train/validation/test splits with distribution analysis
Distribution Validation Statistical tests ensuring consistent feature distributions across splits
Preprocessing Advanced Encoding One-hot encoding with automatic categorical detection
Intelligent Imputation Multiple strategies for handling missing values
Outlier Detection IQR and percentile-based detection with configurable treatment
Robust Scaling StandardScaler, RobustScaler, and MinMaxScaler support
Class Balancing SMOTE and ADASYN implementations for imbalanced datasets
Feature Engineering Automated Selection Mutual information, variance filtering, and multicollinearity detection
Advanced Techniques Boruta feature selection and L1 regularization
Custom Engineering Dataset-specific feature creation hooks
Modeling Ensemble Methods Stacked generalization with configurable base learners
Neural Networks Feed-forward architectures with epoch-wise monitoring
Model Comparison Cross-model evaluation with statistical significance testing
DevOps Real-time Monitoring Slack integration for training progress and alerts
Experiment Tracking Comprehensive CSV logging with metadata
Visualization Automated DAG generation for pipeline architecture

Installation

PyPI Installation (Recommended)

pip install efficient-classifier

Development Installation

git clone https://github.com/javidsegura/efficient-classifier.git
cd efficient-classifier
pip install -r requirements.txt

Environment Setup

For Slack bot integration, create a .env file:

SLACK_BOT_TOKEN=your_bot_token
SLACK_SIGNING_SECRET=your_signing_secret
SLACK_APP_TOKEN=your_app_token

Quick Start

Basic Usage

from efficient_classifier import PipelineManager

# Initialize with configuration
manager = PipelineManager('configurations.yaml')

# Execute complete pipeline
results = manager.run_all_pipelines()

# Access best model
best_model = results.get_best_model()
predictions = best_model.predict(X_test)

Custom Dataset Integration

  1. Configure dataset-specific cleaning in pipeline_runner.py:
def _clean_dataset_set_up_dataset_specific(self, df):
    # Your custom preprocessing logic
    return cleaned_df
  1. Implement feature engineering in featureAnalysis_runner.py:
def _run_feature_engineering_dataset_specific(self, df):
    # Your custom feature engineering
    return engineered_df
  1. Update boundary conditions in bound_config.py for data validation.

Configuration

The framework uses a comprehensive YAML configuration system. Key configuration sections:

Pipeline Definition

general:
  pipelines_names: ["baseline", "advanced", "ensemble"]
  max_plots_per_function: 10  # Control visualization output

Data Processing

phase_runners:
  dataset_runners:
    split_df:
      p: [0.7, 0.8, 0.9]  # Split ratios to evaluate
      step: 0.05          # Granularity of split analysis
    encoding:
      y_column: "target"  # Target variable name

Model Configuration

modelling_runner:
  class_weights:
    weights: {0: 1.0, 1: 2.0}  # Handle class imbalance
  models_to_include:
    baseline: ["Random Forest", "Logistic Regression"]
    advanced: ["XGBoost", "Neural Network"]
  optimization:
    method: "bayesian"  # grid, random, or bayesian
    cv_folds: 5

For complete configuration options, see the detailed documentation.

Supported Models & Metrics

Machine Learning Models

Tree-Based Algorithms:

  • Random Forest, Decision Trees, Gradient Boosting
  • XGBoost, LightGBM, CatBoost
  • AdaBoost with configurable base estimators

Linear Models:

  • Logistic Regression, Ridge Classifier
  • Linear/Non-linear SVM, SGD Classifier
  • Elastic Net with L1/L2 regularization

Advanced Methods:

  • Feed-Forward Neural Networks
  • Ensemble Stacking (meta-learning)
  • K-Nearest Neighbors, Gaussian Naive Bayes

Baseline Models:

  • Majority Class Classifier for benchmarking

Evaluation Metrics

  • Classification Accuracy - Overall correctness
  • Precision, Recall, F1-Score - Class-specific performance
  • Cohen's Kappa - Inter-rater reliability
  • Weighted Accuracy - Class-imbalance adjusted accuracy
  • ROC-AUC - Area under receiver operating characteristic
  • Calibration Metrics - Reliability diagrams and Brier score

Adding Custom Models

Extend model support by modifying modelling_runner.py:

def _model_initializers(self):
    models = {
        # Existing models...
        "Custom Model": YourCustomClassifier(
            param1=self.config['custom_param']
        )
    }
    return models

Use Cases

MANTIS: Cybersecurity Threat Detection

Our flagship application demonstrates the framework's capabilities in cybersecurity:

Dataset: CCCS-CIC-AndMal-2020 (Android malware detection) Performance: 92% F1-score with Random Forest + Stacking ensemble Scale: 200,000+ samples with 464 features Deployment: Production-ready model with 15ms inference time

Key Results:

  • Outperformed baseline approaches by 23%
  • Identified 847 critical features through automated selection
  • Achieved 99.1% precision for malware detection

Benchmark Datasets

Titanic Survival Prediction: View Results

  • 89.3% accuracy with ensemble methods
  • Comprehensive feature engineering pipeline

Iris Classification: View Results

  • 97.8% accuracy across all pipeline configurations
  • Validation of multi-class capabilities

Performance

Benchmarks

Dataset Samples Features Best Model F1-Score Training Time
CCCS-CIC-AndMal-2020 200K+ 464 Random Forest 92.0% 45 min
Titanic 891 12 Stacking Ensemble 89.3% 2 min
Iris 150 4 Neural Network 97.8% 30 sec

Optimization Features

  • Memory Management: Efficient handling of datasets up to 1M+ rows
  • Parallel Processing: Multi-core utilization for independent operations
  • Early Stopping: Automatic convergence detection for iterative algorithms
  • Caching: Intelligent result caching for repeated experiments

Model Deployment

Serialization & Inference

# Save trained pipeline
manager.serialize_model(best_pipeline, 'production_model.pkl')

# Load for inference
loaded_model = manager.load_model('production_model.pkl')

# Production predictions
predictions = loaded_model.model_sklearn.predict(X_new)
probabilities = loaded_model.model_sklearn.predict_proba(X_new)

Production Integration

The serialized models contain:

  • Trained sklearn estimator objects
  • Complete preprocessing pipelines
  • Feature engineering transformations
  • Model metadata and performance metrics

Monitoring & Visualization

Real-Time Notifications

SlackBot Integration

Slack bot provides real-time updates on training progress, model performance, and system alerts.

Pipeline Visualization

DAG Pipeline Visualizer

Automatically generated DAG visualization showing pipeline architecture, data flow, and performance metrics.

Roadmap & Known Limitations

Upcoming Features

  • Multi-label Classification Support
  • Cyclical Feature Encoding for temporal data
  • Cloud Deployment Integration (AWS, GCP, Azure)
  • Docker Containerization for production deployment
  • Advanced AutoML Capabilities with neural architecture search

Current Limitations

  • Missing Value Handling: Assumes preprocessed data (manual handling required)
  • Grid Search Configuration: Complex setup process for new parameter spaces
  • Stacking Visualization: Not included in DAG visualization
  • Per-Pipeline Feature Selection: Currently uses unified feature selection

Performance Considerations

Operations marked as 'MAJOR IMPACT IN PERFORMANCE' in configuration:

  • Bayesian optimization with large parameter spaces
  • Neural network training with extensive architectures
  • Cross-validation with high fold counts
  • Feature selection on high-dimensional datasets

Contributing

We welcome contributions from the community! Please follow these guidelines:

Development Process

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit changes: git commit -m 'Add amazing feature'
  4. Push to branch: git push origin feature/amazing-feature
  5. Open a Pull Request

Code Standards

  • Follow PEP 8 style guidelines
  • Include comprehensive docstrings
  • Add unit tests for new features
  • Update documentation for API changes

Pull Request Template

Please include:

  • Description of changes and motivation
  • Testing performed and results
  • Breaking Changes if applicable
  • Documentation updates

Documentation

Comprehensive Guides

Research Publications

Access our peer-reviewed research and detailed technical reports through the links provided in the Research & Validation section.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this framework in your research, please cite:

@software{efficient_classifier_2024,
  title={Efficient Classifier: A Dataset-Agnostic ML Framework},
  author={[Javier D., Caterina B, Juan A., Federica C., Irina I., Juliette J.]},
  year={2025},
  url={https://github.com/javidsegura/efficient-classifier}
}

Acknowledgments

  • Built with scikit-learn, XGBoost, and other open-source ML libraries
  • Validated on datasets from the Canadian Centre for Cyber Security
  • Community contributors and beta testers

Ready to accelerate your ML workflow? Install via pip install efficient-classifier and check out our Quick Start Guide.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

efficient_classifier-2.1.2.tar.gz (52.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

efficient_classifier-2.1.2-py3-none-any.whl (58.1 kB view details)

Uploaded Python 3

File details

Details for the file efficient_classifier-2.1.2.tar.gz.

File metadata

  • Download URL: efficient_classifier-2.1.2.tar.gz
  • Upload date:
  • Size: 52.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for efficient_classifier-2.1.2.tar.gz
Algorithm Hash digest
SHA256 8de365fd7b7426789f87c20687c1902e47ab46dbe5fc1ba172b53d8f530eed69
MD5 a9b72b1160bc0a685deb12d2470b5137
BLAKE2b-256 2f29709b16f26d8cb445cf424b44112c313fb7db0ab002e696762790b45933a1

See more details on using hashes here.

File details

Details for the file efficient_classifier-2.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for efficient_classifier-2.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 36e3dcf18eda355af7fc512e2e40fa9dbee8de538502130960d9f8d7fd5ecdd1
MD5 f9f57107f72b003adba387f8a5622141
BLAKE2b-256 fbc3956f8dc75dd090c9c236032f249e87064822961c1663ccdf6108f537518b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page