ML Fast Opt - Advanced ensemble optimization system for LightGBM hyperparameter tuning

These details have not been verified by PyPI

Project links

Project description

MLFastOpt

MLFastOpt is a comprehensive ensemble optimization system for Bayesian hyperparameter optimization of LightGBM models only. It provides automated machine learning capabilities with a focus on speed, accuracy, and ease of use.

Why LightGBM Only? We chose to focus exclusively on LightGBM because it offers similar performance to XGBoost but with significantly faster training times. Our objective is not to debate which gradient boosting framework is superior, but rather to provide the fastest possible hyperparameter optimization experience. By specializing in LightGBM, we can optimize the entire pipeline for maximum speed and efficiency.

Features

🚀 Fast Optimization: Advanced Bayesian optimization algorithms
🎯 LightGBM Ensembles: Automated ensemble model creation and tuning
🌐 Web Interface: Interactive visualization and analysis tools
⚙️ Flexible Configuration: Environment-based configuration system
📊 Rich Analytics: Comprehensive performance analysis and visualization
🔧 Easy CLI: Simple command-line interface for all operations

Installation

pip install mlfastopt

For development installation:

git clone https://github.com/your-repo/mlfastopt
cd mlfastopt
pip install -e .[dev]

Quick Start

MLFastOpt is a framework that requires you to provide your own configuration files. Here's how to get started:

1. Create Directory Structure

mkdir -p config/hyperparameters
mkdir -p data
# Note: Output directories (outputs/, outputs/runs/, etc.) are created automatically

2. Create Hyperparameter Space

Create a hyperparameter space file (e.g., config/hyperparameters/my_space.py):

This file needs to be included in config file(e.g., my_config.json)

# config/hyperparameters/my_space.py
PARAMETERS = [
    {"name": "boosting_type", "type": "choice", "values": ["gbdt", "dart"], "value_type": "str"},
    {"name": "num_leaves", "type": "range", "bounds": [20, 200], "value_type": "int"},
    {"name": "learning_rate", "type": "range", "bounds": [0.01, 0.3], "value_type": "float", "log_scale": True},
    {"name": "n_estimators", "type": "range", "bounds": [100, 300], "value_type": "int"},
    {"name": "subsample", "type": "range", "bounds": [0.3, 1.0], "value_type": "float"},
    {"name": "colsample_bytree", "type": "range", "bounds": [0.3, 1.0], "value_type": "float"},
    {"name": "reg_alpha", "type": "range", "bounds": [1e-8, 0.5], "value_type": "float", "log_scale": True},
    {"name": "reg_lambda", "type": "range", "bounds": [1e-8, 0.5], "value_type": "float", "log_scale": True},
    {"name": "is_unbalance", "type": "choice", "values": [True, False], "value_type": "bool"},
]

def get_parameter_space():
    return PARAMETERS

3. Create Configuration File

Create your optimization configuration file (e.g., my_config.json):

{
  "_description": "Example configuration for MLFastOpt optimization",
  "_hyperparameter_space": "Custom hyperparameter space for your use case",
  
  "DATA_PATH": "data/your_dataset.csv",
  "HYPERPARAMETER_PATH": "config/hyperparameters/my_space.py",
  "LABEL_COLUMN": "target",
  "FEATURES": ["feature1", "feature2", "feature3", "feature4", "feature5"],
  
  "CLASS_WEIGHT": {"0": 1, "1": 3},
  "UNDER_SAMPLE_MAJORITY_RATIO": 2,
  
  "N_ENSEMBLE_GROUP_NUMBER": 15,
  "AE_NUM_TRIALS": 30,
  "NUM_SOBOL_TRIALS": 10,
  "RANDOM_SEED": 42,
  "PARALLEL_TRAINING": true,
  "N_JOBS": -1,
  
  "OPTIMIZATION_METRICS": "soft_recall",
  "BEST_TRIAL_FILE_SUFFIX": "my_experiment",
  "SOFT_PREDICTION_THRESHOLD": 0.7,
  "F1_THRESHOLD": 0.7,
  "MIN_RECALL_THRESHOLD": 0.80,
  
  "ENABLE_DATA_IMPUTATION": false,
  "IMPUTE_TARGET_NULLS": true,
  
  "SAVE_THRESHOLD_ENABLED": false,
  "SAVE_THRESHOLD_METRIC": "soft_recall",
  "SAVE_THRESHOLD_VALUE": 0.85,
  "FALLBACK_TOP_K": 5,
  "FALLBACK_METRIC": "soft_recall"
}

Configuration Parameters Explained

Required Parameters

DATA_PATH: Path to your dataset (CSV, Parquet, etc.)
HYPERPARAMETER_PATH: Path to your hyperparameter space Python file
LABEL_COLUMN: Name of the label column in your dataset
FEATURES: List of feature column names to use for training
CLASS_WEIGHT: Dictionary mapping class labels to weights for imbalanced data
UNDER_SAMPLE_MAJORITY_RATIO: Ratio for undersampling majority class (1 = no undersampling)
N_ENSEMBLE_GROUP_NUMBER: Number of models in each ensemble (affects training time)
AE_NUM_TRIALS: Total number of optimization trials to run
NUM_SOBOL_TRIALS: Number of initial random exploration trials
RANDOM_SEED: Random seed for reproducibility
PARALLEL_TRAINING: Enable parallel model training
N_JOBS: Number of CPU cores to use (-1 = all available)
SOFT_PREDICTION_THRESHOLD: Threshold for converting probabilities to binary predictions
MIN_RECALL_THRESHOLD: Minimum recall threshold for trial validation

Optional Parameters

OPTIMIZATION_METRICS: Metric to optimize (default: "soft_recall")
F1_THRESHOLD: Target F1-score threshold (default: 0.7)
BEST_TRIAL_FILE_SUFFIX: Custom suffix for best trial filenames (default: auto-extract from dataset name)
ENABLE_DATA_IMPUTATION: Enable feature imputation (default: false)
IMPUTE_TARGET_NULLS: Handle null values in target column (default: true)

Advanced Trial Selection (Optional)

SAVE_THRESHOLD_ENABLED: Enable threshold-based trial selection (default: false)
SAVE_THRESHOLD_METRIC: Metric to use for threshold selection (default: "soft_recall")
SAVE_THRESHOLD_VALUE: Minimum value to save trials (default: 0.85)
FALLBACK_TOP_K: Number of top trials if none meet threshold (default: 5)
FALLBACK_METRIC: Metric for fallback ranking (default: "soft_recall")

4. Run Optimization

# Set threading environment variable (important!)
export OMP_NUM_THREADS=1

# Run optimization
OMP_NUM_THREADS=1 python -m mlfastopt.cli --config my_config.json

# Validate configuration first
python -m mlfastopt.cli --validate --config my_config.json

Architecture

MLFastOpt is organized into several key modules, all optimized specifically for LightGBM:

mlfastopt.core: Core optimization engine and configuration management for LightGBM ensembles
mlfastopt.cli: Command-line interface for LightGBM hyperparameter optimization
mlfastopt.web: Web-based visualization and analysis tools for LightGBM optimization results

Configuration System

MLFastOpt is a framework that requires user-provided configurations:

Configuration files: JSON files defining optimization parameters and data paths
Hyperparameter spaces: Python modules defining LightGBM parameter search spaces
Data files: Your datasets in CSV, Parquet, or other pandas-compatible formats

All output directories are created automatically by the framework.

Hyperparameter Tuning

MLFastOpt requires you to define custom LightGBM hyperparameter spaces for your specific use case:

Creating Parameter Spaces

You must create your own hyperparameter space files. Here's the syntax:

Parameter Types

Choice: {"name": "param", "type": "choice", "values": ["a", "b"], "value_type": "str"}
Range (Int): {"name": "param", "type": "range", "bounds": [1, 100], "value_type": "int"}
Range (Float): {"name": "param", "type": "range", "bounds": [0.1, 1.0], "value_type": "float"}
Log Scale: Add "log_scale": True for logarithmic parameter exploration
Boolean: {"name": "param", "type": "choice", "values": [True, False], "value_type": "bool"}

Example Parameter Space

# config/hyperparameters/my_space.py
PARAMETERS = [
    # Boosting algorithm
    {"name": "boosting_type", "type": "choice", "values": ["gbdt", "dart"], "value_type": "str"},
    
    # Tree structure
    {"name": "num_leaves", "type": "range", "bounds": [20, 200], "value_type": "int"},
    {"name": "max_depth", "type": "range", "bounds": [-1, 30], "value_type": "int"},
    
    # Learning parameters
    {"name": "learning_rate", "type": "range", "bounds": [0.01, 0.3], "value_type": "float", "log_scale": True},
    {"name": "n_estimators", "type": "range", "bounds": [100, 500], "value_type": "int"},
    
    # Regularization
    {"name": "reg_alpha", "type": "range", "bounds": [1e-8, 1.0], "value_type": "float", "log_scale": True},
    {"name": "reg_lambda", "type": "range", "bounds": [1e-8, 1.0], "value_type": "float", "log_scale": True},
    
    # Sampling
    {"name": "subsample", "type": "range", "bounds": [0.3, 1.0], "value_type": "float"},
    {"name": "colsample_bytree", "type": "range", "bounds": [0.3, 1.0], "value_type": "float"},
    
    # Class balance
    {"name": "is_unbalance", "type": "choice", "values": [True, False], "value_type": "bool"},
]

def get_parameter_space():
    """Required function that returns the parameter list"""
    return PARAMETERS

Configuration

Reference your parameter space in the config file:

{
  "HYPERPARAMETER_PATH": "config/hyperparameters/my_space.py",
  "DATA_PATH": "data/your_dataset.csv",
  "LABEL_COLUMN": "target",
  "AE_NUM_TRIALS": 50
}

Optimization Metrics

MLFastOpt allows you to choose which metric to optimize during hyperparameter tuning. By default, it optimizes soft_recall, but you can configure it to optimize any of the available metrics.

Configurable Optimization Metric

Set the OPTIMIZATION_METRICS parameter in your configuration file:

{
  "OPTIMIZATION_METRICS": "soft_f1_score",
  "..."
}

Available Metrics

soft_recall (default): Recall from soft voting ensemble predictions
hard_recall: Recall from hard voting ensemble predictions
soft_f1_score: F1-score from soft voting ensemble predictions
hard_f1_score: F1-score from hard voting ensemble predictions
soft_precision: Precision from soft voting ensemble predictions
hard_precision: Precision from hard voting ensemble predictions

Soft vs Hard Voting

Soft Voting: Averages predicted probabilities from all ensemble models, then applies threshold
Hard Voting: Averages binary predictions from all ensemble models

Soft voting typically provides better calibrated predictions and is recommended for most use cases.

Example Configurations

High Recall for Fraud Detection:

{
  "OPTIMIZATION_METRICS": "soft_recall",
  "SOFT_PREDICTION_THRESHOLD": 0.2,
  "MIN_RECALL_THRESHOLD": 0.95
}

Balanced Performance:

{
  "OPTIMIZATION_METRICS": "soft_f1_score", 
  "SOFT_PREDICTION_THRESHOLD": 0.5,
  "MIN_RECALL_THRESHOLD": 0.75
}

High Precision for Medical Diagnosis:

{
  "OPTIMIZATION_METRICS": "soft_precision",
  "SOFT_PREDICTION_THRESHOLD": 0.8,
  "MIN_RECALL_THRESHOLD": 0.70
}

Data Preprocessing Requirements

MLFastOpt expects preprocessed, numerical data only. You must handle all data preprocessing before running optimization.

Required Preprocessing Steps

Categorical Features: Must be encoded before optimization
- ✅ One-hot encoding: pd.get_dummies()
- ✅ Label encoding: LabelEncoder()
- ✅ Target encoding, ordinal encoding, etc.
- ❌ Raw categorical strings/text
Feature Engineering: Complete all feature engineering beforehand
- Feature scaling, normalization (optional - LightGBM handles this)
- Feature selection and dimensionality reduction
- Creating interaction features, polynomial features
Missing Values: Handle according to your domain requirements
- Set ENABLE_DATA_IMPUTATION: false to let LightGBM handle nulls
- Set ENABLE_DATA_IMPUTATION: true for median/mode imputation

Example Preprocessing Pipeline

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Load raw data
df = pd.read_csv('raw_data.csv')

# 1. Handle categorical features
categorical_cols = ['category_A', 'category_B']
df = pd.get_dummies(df, columns=categorical_cols, dtype=int)

# 2. Handle missing values (optional)
# df = df.fillna(df.median())  # or let LightGBM handle nulls

# 3. Save preprocessed data
df.to_parquet('preprocessed_data.parquet', index=False)

# 4. Update your config
config = {
    "DATA_PATH": "preprocessed_data.parquet",
    "FEATURES": df.columns.tolist(),  # All preprocessed features
    "LABEL_COLUMN": "target"
}

Why No Built-in Categorical Processing?

Performance: Preprocessing once vs. every optimization run
Flexibility: Full control over encoding strategies
Consistency: Same preprocessing for training and production
Domain Knowledge: Categorical encoding often requires domain expertise

Requirements

Python 3.8+
LightGBM 3.3.0+
Pandas, NumPy, Scikit-learn
Flask (for web interface)
Plotly, Matplotlib (for visualization)

Performance Considerations

Always set OMP_NUM_THREADS=1 for LightGBM to avoid thread conflicts
Parallel training is controlled via configuration parameters
Optimization algorithms benefit from multiple CPU cores

Examples

Development Run (Fast)

# 15 trials, 10 models (~15-20 minutes)
OMP_NUM_THREADS=1 python -m mlfastopt.cli --environment development

Production Run

# Full optimization with more trials
OMP_NUM_THREADS=1 python -m mlfastopt.cli --environment production

Validation

# Validate configuration without running optimization
python -m mlfastopt.cli --config config/environments/development.json --validate

Data Requirements

Input data should be in Parquet, CSV, or other pandas-compatible formats
Target column must be binary (0/1) for classification
Features are automatically handled by LightGBM (nulls, categorical encoding)
Categorical features should be specified in configuration

Output Structure

All outputs are organized under outputs/:

outputs/runs/: Individual optimization run results
outputs/best_trials/: Best performing configurations
outputs/logs/: Execution logs
outputs/visualizations/: Generated plots and analysis

Best Trial File Naming

Best trial files are automatically named to distinguish different experiments:

With Custom BEST_TRIAL_FILE_SUFFIX

{
  "BEST_TRIAL_FILE_SUFFIX": "fraud_experiment_v2",
  "DATA_PATH": "data/fraud_data.csv"
}

Output files:

2025-08-04_fraud_experiment_v2.json
2025-08-04_fraud_experiment_v2_threshold_soft_recall_0.85.json

Auto-extracted from Dataset Name

{
  "BEST_TRIAL_FILE_SUFFIX": "",
  "DATA_PATH": "data/customer_churn/processed_data.csv"
}

Output files:

2025-08-04_processed_data.json
2025-08-04_processed_data_top_5_soft_recall.json

This naming prevents different experiments from overwriting each other and makes results easily identifiable.

CLI Commands

The package provides several command-line entry points:

mlfastopt-optimize: Main optimization CLI
mlfastopt-web: Web interface launcher
mlfastopt-analyze: Analysis tools

Contributing

We welcome contributions! Please see our contributing guidelines for details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use MLFastOpt in your research, please cite:

@software{mlfastopt,
  title={MLFastOpt: Fast Ensemble Optimization with Advanced Bayesian Methods},
  author={MLFastOpt Development Team},
  url={https://github.com/your-repo/mlfastopt},
  version={0.0.9a1},
  year={2025}
}

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.0.10.2b1 pre-release

May 5, 2026

0.0.10.2b0 pre-release

May 5, 2026

0.0.10.2a0 pre-release

May 4, 2026

0.0.10.1

May 2, 2026

0.0.10.1b0 pre-release

May 4, 2026

0.0.10

Apr 27, 2026

0.0.9b5 pre-release

Mar 6, 2026

0.0.9b4 pre-release

Jan 24, 2026

0.0.9b3 pre-release

Jan 12, 2026

0.0.9b2 pre-release

Dec 12, 2025

0.0.9b1 pre-release

Dec 11, 2025

0.0.9a3 pre-release

Oct 25, 2025

0.0.9a2 pre-release

Aug 8, 2025

This version

0.0.9a1 pre-release

Aug 5, 2025

0.0.8.6

Jul 19, 2025

0.0.8.5

Jul 19, 2025

0.0.8.1

Jul 16, 2025

0.0.8

Jul 9, 2025

0.0.8a6 pre-release

Jul 9, 2025

0.0.8a5 pre-release

Jul 9, 2025

0.0.8a4 pre-release

Jul 9, 2025

0.0.8a3 pre-release

Jul 9, 2025

0.0.8a2 pre-release

Jul 9, 2025

0.0.8a1 pre-release

Jul 8, 2025

0.0.8a0 pre-release

Jul 8, 2025

0.0.7

Jul 8, 2025

0.0.6

Jul 8, 2025

0.0.5

Jul 7, 2025

0.0.4

Jul 7, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mlfastopt-0.0.9a1.tar.gz (59.7 kB view details)

Uploaded Aug 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mlfastopt-0.0.9a1-py3-none-any.whl (63.1 kB view details)

Uploaded Aug 5, 2025 Python 3

File details

Details for the file mlfastopt-0.0.9a1.tar.gz.

File metadata

Download URL: mlfastopt-0.0.9a1.tar.gz
Upload date: Aug 5, 2025
Size: 59.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for mlfastopt-0.0.9a1.tar.gz
Algorithm	Hash digest
SHA256	`26653df4ab013b5b5c15a06a658ea584ab96ca2a4fb9a9bd8adc09e6215ca555`
MD5	`282e1ac3c8f9043bbee8de54c063d7b6`
BLAKE2b-256	`7b62fee4617cf140cd85657fa1e34851382e2b2d9a60a4e8f1555c51d9ef5fcf`

See more details on using hashes here.

File details

Details for the file mlfastopt-0.0.9a1-py3-none-any.whl.

File metadata

Download URL: mlfastopt-0.0.9a1-py3-none-any.whl
Upload date: Aug 5, 2025
Size: 63.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for mlfastopt-0.0.9a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9e4e942f74ce634ef393a14ee4f467bef710a85fad61b0d7539ae0ab0df9e699`
MD5	`4e4faac483cd8aa0c46687794f4d3c06`
BLAKE2b-256	`96d5e6496c7c06b3d03739f14d5281b5067d32d452238f6f35ed04017031d3f9`

See more details on using hashes here.

mlfastopt 0.0.9a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLFastOpt

Features

Installation

Quick Start

1. Create Directory Structure

2. Create Hyperparameter Space

3. Create Configuration File

Configuration Parameters Explained

Required Parameters

Optional Parameters

Advanced Trial Selection (Optional)

4. Run Optimization

Architecture

Configuration System

Hyperparameter Tuning

Creating Parameter Spaces

Parameter Types

Example Parameter Space

Configuration

Optimization Metrics

Configurable Optimization Metric

Available Metrics

Soft vs Hard Voting

Example Configurations

Data Preprocessing Requirements

Required Preprocessing Steps

Example Preprocessing Pipeline

Why No Built-in Categorical Processing?

Requirements

Performance Considerations

Examples

Development Run (Fast)

Production Run

Validation

Data Requirements

Output Structure

Best Trial File Naming

With Custom BEST_TRIAL_FILE_SUFFIX

Auto-extracted from Dataset Name

CLI Commands

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes