Dataset-agnostic ML classification library. Visualization tools, Slack integration, support for multiple-pipelines.

These details have not been verified by PyPI

Project links

Homepage

Project description

Efficient Classifier

A comprehensive, dataset-agnostic machine learning framework for rapid development and deployment of classification pipelines on tabular data. Advanced DevOps tools.

Overview
Key Features
Architecture
Installation
Quick Start
Configuration
Supported Models & Metrics
Use Cases
Performance
Contributing
Documentation
License

Overview

Efficient Classifier is an enterprise-grade machine learning framework designed to accelerate the development lifecycle from data preprocessing to model deployment. Built with scalability and reproducibility in mind, it provides a unified interface for experimenting with multiple classification pipelines while maintaining rigorous tracking of experiments and results.

The framework supports both binary and multiclass classification tasks and has been extensively validated on real-world datasets, including the CCCS-CIC-AndMal-2020 cybersecurity dataset where it achieved 92% F1-score performance.

Research & Validation

Our framework has been applied to cutting-edge cybersecurity research:

Research Paper - CCCS-CIC-AndMal-2020 Analysis
Complete Results - Plots, logs, and execution history
Technical Report - Methodology and findings
EDA Notebook - Exploratory data analysis
Presentation - Project overview

Key Features

🚀 Rapid Pipeline Development

Multi-pipeline orchestration with customizable architectures
Zero-boilerplate configuration through YAML
Automated hyperparameter optimization (Grid, Random, Bayesian)
One-command execution from data to deployment

🔬 Advanced Analytics & Visualization

Comprehensive residual analysis and confusion matrices
LIME-based feature importance with permutation testing
Model calibration with reliability diagrams
Cross-validation with stratified sampling
Real-time training progress monitoring

🛠 Production-Ready DevOps

Slack bot integration for real-time notifications
Automated DAG visualization of pipeline architectures
Model serialization with joblib/pickle support
Comprehensive logging and experiment tracking
Built-in testing framework integration

⚡ High-Performance Computing

Multithreaded processing where parallelization is beneficial
Memory-efficient data handling for large datasets
Optimized feature selection algorithms (Boruta, L1 regularization)
Smart caching mechanisms for repeated operations

Architecture

The framework follows a modular, stage-based architecture:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Data Loading  │ -> │  Preprocessing   │ -> │ Feature Analysis│
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                 |
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│     DevOps      │ <- │    Modeling      │ <- │   Evaluation    │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Stage-Specific Capabilities

Stage	Capability	Description
Data Management	Smart Splitting	Adaptive train/validation/test splits with distribution analysis
	Distribution Validation	Statistical tests ensuring consistent feature distributions across splits
Preprocessing	Advanced Encoding	One-hot encoding with automatic categorical detection
	Intelligent Imputation	Multiple strategies for handling missing values
	Outlier Detection	IQR and percentile-based detection with configurable treatment
	Robust Scaling	StandardScaler, RobustScaler, and MinMaxScaler support
	Class Balancing	SMOTE and ADASYN implementations for imbalanced datasets
Feature Engineering	Automated Selection	Mutual information, variance filtering, and multicollinearity detection
	Advanced Techniques	Boruta feature selection and L1 regularization
	Custom Engineering	Dataset-specific feature creation hooks
Modeling	Ensemble Methods	Stacked generalization with configurable base learners
	Neural Networks	Feed-forward architectures with epoch-wise monitoring
	Model Comparison	Cross-model evaluation with statistical significance testing
DevOps	Real-time Monitoring	Slack integration for training progress and alerts
	Experiment Tracking	Comprehensive CSV logging with metadata
	Visualization	Automated DAG generation for pipeline architecture

Installation

PyPI Installation (Recommended)

pip install efficient-classifier

Development Installation

git clone https://github.com/javidsegura/efficient-classifier.git
cd efficient-classifier
pip install -r requirements.txt

Environment Setup

For Slack bot integration, create a .env file:

SLACK_BOT_TOKEN=your_bot_token
SLACK_SIGNING_SECRET=your_signing_secret
SLACK_APP_TOKEN=your_app_token

Quick Start

Basic Usage

from efficient_classifier import PipelineManager

# Initialize with configuration
manager = PipelineManager('configurations.yaml')

# Execute complete pipeline
results = manager.run_all_pipelines()

# Access best model
best_model = results.get_best_model()
predictions = best_model.predict(X_test)

Custom Dataset Integration

Configure dataset-specific cleaning in pipeline_runner.py:

def _clean_dataset_set_up_dataset_specific(self, df):
    # Your custom preprocessing logic
    return cleaned_df

Implement feature engineering in featureAnalysis_runner.py:

def _run_feature_engineering_dataset_specific(self, df):
    # Your custom feature engineering
    return engineered_df

Update boundary conditions in bound_config.py for data validation.

Configuration

The framework uses a comprehensive YAML configuration system. Key configuration sections:

Pipeline Definition

general:
  pipelines_names: ["baseline", "advanced", "ensemble"]
  max_plots_per_function: 10  # Control visualization output

Data Processing

phase_runners:
  dataset_runners:
    split_df:
      p: [0.7, 0.8, 0.9]  # Split ratios to evaluate
      step: 0.05          # Granularity of split analysis
    encoding:
      y_column: "target"  # Target variable name

Model Configuration

modelling_runner:
  class_weights:
    weights: {0: 1.0, 1: 2.0}  # Handle class imbalance
  models_to_include:
    baseline: ["Random Forest", "Logistic Regression"]
    advanced: ["XGBoost", "Neural Network"]
  optimization:
    method: "bayesian"  # grid, random, or bayesian
    cv_folds: 5

For complete configuration options, see the detailed documentation.

Supported Models & Metrics

Machine Learning Models

Tree-Based Algorithms:

Random Forest, Decision Trees, Gradient Boosting
XGBoost, LightGBM, CatBoost
AdaBoost with configurable base estimators

Linear Models:

Logistic Regression, Ridge Classifier
Linear/Non-linear SVM, SGD Classifier
Elastic Net with L1/L2 regularization

Advanced Methods:

Feed-Forward Neural Networks
Ensemble Stacking (meta-learning)
K-Nearest Neighbors, Gaussian Naive Bayes

Baseline Models:

Majority Class Classifier for benchmarking

Evaluation Metrics

Classification Accuracy - Overall correctness
Precision, Recall, F1-Score - Class-specific performance
Cohen's Kappa - Inter-rater reliability
Weighted Accuracy - Class-imbalance adjusted accuracy
ROC-AUC - Area under receiver operating characteristic
Calibration Metrics - Reliability diagrams and Brier score

Adding Custom Models

Extend model support by modifying modelling_runner.py:

def _model_initializers(self):
    models = {
        # Existing models...
        "Custom Model": YourCustomClassifier(
            param1=self.config['custom_param']
        )
    }
    return models

Use Cases

MANTIS: Cybersecurity Threat Detection

Our flagship application demonstrates the framework's capabilities in cybersecurity:

Dataset: CCCS-CIC-AndMal-2020 (Android malware detection) Performance: 92% F1-score with Random Forest + Stacking ensemble Scale: 200,000+ samples with 464 features Deployment: Production-ready model with 15ms inference time

Key Results:

Outperformed baseline approaches by 23%
Identified 847 critical features through automated selection
Achieved 99.1% precision for malware detection

Benchmark Datasets

Titanic Survival Prediction: View Results

89.3% accuracy with ensemble methods
Comprehensive feature engineering pipeline

Iris Classification: View Results

97.8% accuracy across all pipeline configurations
Validation of multi-class capabilities

Performance

Benchmarks

Dataset	Samples	Features	Best Model	F1-Score	Training Time
CCCS-CIC-AndMal-2020	200K+	464	Random Forest	92.0%	45 min
Titanic	891	12	Stacking Ensemble	89.3%	2 min
Iris	150	4	Neural Network	97.8%	30 sec

Optimization Features

Memory Management: Efficient handling of datasets up to 1M+ rows
Parallel Processing: Multi-core utilization for independent operations
Early Stopping: Automatic convergence detection for iterative algorithms
Caching: Intelligent result caching for repeated experiments

Model Deployment

Serialization & Inference

# Save trained pipeline
manager.serialize_model(best_pipeline, 'production_model.pkl')

# Load for inference
loaded_model = manager.load_model('production_model.pkl')

# Production predictions
predictions = loaded_model.model_sklearn.predict(X_new)
probabilities = loaded_model.model_sklearn.predict_proba(X_new)

Production Integration

The serialized models contain:

Trained sklearn estimator objects
Complete preprocessing pipelines
Feature engineering transformations
Model metadata and performance metrics

Monitoring & Visualization

Real-Time Notifications

SlackBot Integration

Slack bot provides real-time updates on training progress, model performance, and system alerts.

Pipeline Visualization

DAG Pipeline Visualizer

Automatically generated DAG visualization showing pipeline architecture, data flow, and performance metrics.

Roadmap & Known Limitations

Upcoming Features

Multi-label Classification Support
Cyclical Feature Encoding for temporal data
Cloud Deployment Integration (AWS, GCP, Azure)
Docker Containerization for production deployment
Advanced AutoML Capabilities with neural architecture search

Current Limitations

Missing Value Handling: Assumes preprocessed data (manual handling required)
Grid Search Configuration: Complex setup process for new parameter spaces
Stacking Visualization: Not included in DAG visualization
Per-Pipeline Feature Selection: Currently uses unified feature selection

Performance Considerations

Operations marked as 'MAJOR IMPACT IN PERFORMANCE' in configuration:

Bayesian optimization with large parameter spaces
Neural network training with extensive architectures
Cross-validation with high fold counts
Feature selection on high-dimensional datasets

Contributing

We welcome contributions from the community! Please follow these guidelines:

Development Process

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit changes: git commit -m 'Add amazing feature'
Push to branch: git push origin feature/amazing-feature
Open a Pull Request

Code Standards

Follow PEP 8 style guidelines
Include comprehensive docstrings
Add unit tests for new features
Update documentation for API changes

Pull Request Template

Please include:

Description of changes and motivation
Testing performed and results
Breaking Changes if applicable
Documentation updates

Documentation

Comprehensive Guides

Library Architecture - Design decisions and implementation details
API Reference - Complete function and class documentation
Configuration Guide - YAML parameter explanations
Troubleshooting - Common issues and solutions

Research Publications

Access our peer-reviewed research and detailed technical reports through the links provided in the Research & Validation section.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this framework in your research, please cite:

@software{efficient_classifier_2024,
  title={Efficient Classifier: A Dataset-Agnostic ML Framework},
  author={[Javier D., Caterina B, Juan A., Federica C., Irina I., Juliette J.]},
  year={2025},
  url={https://github.com/javidsegura/efficient-classifier}
}

Acknowledgments

Built with scikit-learn, XGBoost, and other open-source ML libraries
Validated on datasets from the Canadian Centre for Cyber Security
Community contributors and beta testers

Ready to accelerate your ML workflow? Install via pip install efficient-classifier and check out our Quick Start Guide.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

2.1.2

Jun 2, 2025

2.1.1

Jun 2, 2025

2.1.0

Jun 2, 2025

2.0.1

Jun 2, 2025

1.0.2

May 24, 2025

1.0.1

May 19, 2025

1.0.0

May 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

efficient_classifier-2.1.2.tar.gz (52.8 kB view details)

Uploaded Jun 2, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

efficient_classifier-2.1.2-py3-none-any.whl (58.1 kB view details)

Uploaded Jun 2, 2025 Python 3

File details

Details for the file efficient_classifier-2.1.2.tar.gz.

File metadata

Download URL: efficient_classifier-2.1.2.tar.gz
Upload date: Jun 2, 2025
Size: 52.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for efficient_classifier-2.1.2.tar.gz
Algorithm	Hash digest
SHA256	`8de365fd7b7426789f87c20687c1902e47ab46dbe5fc1ba172b53d8f530eed69`
MD5	`a9b72b1160bc0a685deb12d2470b5137`
BLAKE2b-256	`2f29709b16f26d8cb445cf424b44112c313fb7db0ab002e696762790b45933a1`

See more details on using hashes here.

File details

Details for the file efficient_classifier-2.1.2-py3-none-any.whl.

File metadata

Download URL: efficient_classifier-2.1.2-py3-none-any.whl
Upload date: Jun 2, 2025
Size: 58.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for efficient_classifier-2.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36e3dcf18eda355af7fc512e2e40fa9dbee8de538502130960d9f8d7fd5ecdd1`
MD5	`f9f57107f72b003adba387f8a5622141`
BLAKE2b-256	`fbc3956f8dc75dd090c9c236032f249e87064822961c1663ccdf6108f537518b`

See more details on using hashes here.

efficient-classifier 2.1.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Efficient Classifier

Table of Contents

Overview

Research & Validation

Key Features

🚀 Rapid Pipeline Development

🔬 Advanced Analytics & Visualization

🛠 Production-Ready DevOps

⚡ High-Performance Computing

Architecture

Stage-Specific Capabilities

Installation

PyPI Installation (Recommended)

Development Installation

Environment Setup

Quick Start

Basic Usage

Custom Dataset Integration

Configuration

Pipeline Definition

Data Processing

Model Configuration

Supported Models & Metrics

Machine Learning Models

Evaluation Metrics

Adding Custom Models

Use Cases

MANTIS: Cybersecurity Threat Detection

Benchmark Datasets

Performance

Benchmarks

Optimization Features

Model Deployment

Serialization & Inference

Production Integration

Monitoring & Visualization

Real-Time Notifications

Pipeline Visualization

Roadmap & Known Limitations

Upcoming Features

Current Limitations

Performance Considerations

Contributing

Development Process

Code Standards

Pull Request Template

Documentation

Comprehensive Guides

Research Publications

License

Citation

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes