Skip to main content

Educational binary classification with XGBoost for signal/background separation

Project description

SignalSeeker: Educational Binary Classification with XGBoost

SignalSeeker is a comprehensive, production-quality Python package that teaches how to build machine learning pipelines for binary classification using Boosted Decision Trees (XGBoost) in a signal/background separation context, inspired by particle physics applications.

GitHub Python 3.8+ License: MIT Code style: black

๐ŸŽฏ Overview

In particle physics experiments, signal refers to the desired particle interaction you're searching for, while background refers to all other processes that mimic the signal. The challenge is to build a classifier that:

  1. Identifies signal events with high efficiency (recall)
  2. Rejects background events with high purity (precision)
  3. Provides interpretable probability scores (BDT scores) for decision-making

SignalSeeker teaches all these concepts through a production-quality, fully-featured Python package.

โœจ Key Features

โœ… Complete ML Pipeline: Data generation โ†’ Preprocessing โ†’ Training โ†’ Tuning โ†’ Evaluation โ†’ Visualization

โœ… Realistic Imbalanced Data: Generates synthetic datasets with ~90% background, 10% signal (fully configurable)

โœ… XGBoost Implementation: Industry-standard gradient boosted decision trees with full hyperparameter control

โœ… Hyperparameter Tuning: Bayesian optimization using Optuna for automatic parameter discovery

โœ… Comprehensive Metrics: Accuracy, Precision, Recall, F1, AUC-ROC, AUC-PR, Confusion Matrix

โœ… Publication-Quality Visualizations:

  • Probability Score Distribution: Core visualization showing signal/background separation
  • ROC and Precision-Recall curves
  • Feature importance analysis
  • Confusion matrix heatmap
  • Summary dashboard

โœ… GPU Acceleration: Full CUDA support for XGBoost (optional)

โœ… Professional Package: Installable via pip, PyPI-ready, proper src/ layout

โœ… Comprehensive Testing: Full pytest coverage with CI/CD

โœ… Educational Comments: Detailed docstrings explaining why each step is performed

โœ… Professional Code Quality: Type hints, OOP design, proper error handling, Black-formatted

๐Ÿ“ฆ Installation

Option 1: Development Install (Recommended for Learning)

git clone https://github.com/yourusername/SignalSeeker.git
cd SignalSeeker
pip install -e .

This allows you to modify source files and see changes immediately.

Option 2: Regular Install from GitHub

pip install git+https://github.com/yourusername/SignalSeeker.git

Option 3: Install from PyPI (when published)

pip install signalseeker

Option 4: Install with Development Tools

pip install -e ".[dev]"

Includes testing, linting, and documentation tools.

Option 5: Install with GPU Support

pip install -e ".[gpu]"

Requires CUDA Toolkit to be installed. See GPU Setup Guide.

Option 6: Install from Requirements Files

# Core dependencies only
pip install -r requirements.txt

# With development tools
pip install -r requirements-dev.txt

# With GPU support
pip install -r requirements-gpu.txt

๐Ÿš€ Quick Start

Run the Complete Pipeline

python -m signalseeker.main

Or after installation:

signal-seeker

This will:

  1. Generate synthetic imbalanced data (10,000 samples, 10% signal)
  2. Preprocess features (scaling, missing value handling)
  3. Build and train an XGBoost model
  4. Evaluate on validation and test sets
  5. Generate publication-quality visualizations
  6. Save all results to ./results/run_TIMESTAMP/

With Hyperparameter Tuning (Better Performance, Slower)

signal-seeker --tune

Uses Bayesian optimization to find optimal hyperparameters (takes ~2-5 minutes).

Custom Configuration

signal-seeker --n-samples 50000 --signal-fraction 0.15 --output-dir ./my_results

Available options:

  • --tune: Enable hyperparameter tuning
  • --output-dir: Output directory for results
  • --n-samples: Total number of samples to generate
  • --signal-fraction: Fraction of signal samples (0-1)

Python API Usage

from signalseeker import SignalSeekerPipeline, DEFAULT_CONFIG

# Use default configuration
pipeline = SignalSeekerPipeline(DEFAULT_CONFIG)
results = pipeline.run(use_tuning=False)

# Or customize configuration
config = DEFAULT_CONFIG
config.data.n_samples = 50000
config.data.signal_fraction = 0.15
config.xgboost.max_depth = 8
config.xgboost.learning_rate = 0.05

pipeline = SignalSeekerPipeline(config)
results = pipeline.run(use_tuning=True)

# Access results
val_auc = results["validation_results"]["all_metrics"]["auc_roc"]
test_auc = results["test_results"]["all_metrics"]["auc_roc"]
model = results["model"]

print(f"Validation AUC: {val_auc:.4f}")
print(f"Test AUC: {test_auc:.4f}")

๐Ÿ“Š Understanding the Output

The Probability Score Distribution (Most Important Plot)

This is the fundamental visualization in signal/background separation:

Signal Distribution:   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘
Background Dist:      โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
                      0.0   Cut   1.0
                      (threshold)
  • Signal: Concentrated near 1.0 (high probability of being signal)
  • Background: Concentrated near 0.0 (low probability of being signal)
  • Cut: The threshold above which we classify events as signal

Good Model: Minimal overlap, clear separation
Poor Model: Heavy overlap, hard to separate

This plot is analogous to the "BDT score" or "discriminant" in particle physics.

Other Key Metrics

Metric Meaning Ideal Value
Accuracy (TP + TN) / Total 1.0 (but misleading for imbalanced data)
Precision TP / (TP + FP) - Of predicted signals, how many are true? 1.0
Recall (TPR) TP / (TP + FN) - Of true signals, how many did we find? 1.0
F1-Score Harmonic mean of precision and recall 1.0
AUC-ROC Area under ROC curve (threshold-independent) 1.0
AUC-PR Area under Precision-Recall curve (better for imbalanced) 1.0

๐Ÿ“ Package Structure

SignalSeeker/
โ”œโ”€โ”€ src/signalseeker/          # Main package code
โ”‚   โ”œโ”€โ”€ __init__.py            # Package initialization & exports
โ”‚   โ”œโ”€โ”€ config.py              # Configuration management
โ”‚   โ”œโ”€โ”€ data_loader.py         # Synthetic data generation
โ”‚   โ”œโ”€โ”€ preprocessor.py        # Feature scaling & normalization
โ”‚   โ”œโ”€โ”€ model_builder.py       # XGBoost model initialization
โ”‚   โ”œโ”€โ”€ trainer.py             # Training with early stopping
โ”‚   โ”œโ”€โ”€ tuner.py               # Bayesian hyperparameter optimization
โ”‚   โ”œโ”€โ”€ metrics.py             # Evaluation metrics
โ”‚   โ”œโ”€โ”€ visualizer.py          # Publication-quality plots
โ”‚   โ”œโ”€โ”€ utils.py               # Logging and utilities
โ”‚   โ””โ”€โ”€ main.py                # Pipeline orchestrator
โ”œโ”€โ”€ tests/                      # Pytest test suite
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ test_config.py
โ”‚   โ”œโ”€โ”€ test_data_loader.py
โ”‚   โ”œโ”€โ”€ test_preprocessor.py
โ”‚   โ”œโ”€โ”€ test_model_builder.py
โ”‚   โ”œโ”€โ”€ test_trainer.py
โ”‚   โ”œโ”€โ”€ test_tuner.py
โ”‚   โ”œโ”€โ”€ test_metrics.py
โ”‚   โ””โ”€โ”€ test_visualizer.py
โ”œโ”€โ”€ .github/workflows/          # CI/CD pipelines
โ”‚   โ”œโ”€โ”€ tests.yml              # Run tests on push/PR
โ”‚   โ””โ”€โ”€ publish.yml            # Auto-publish to PyPI on release
โ”œโ”€โ”€ examples/                   # Example scripts and notebooks
โ”‚   โ”œโ”€โ”€ example_usage.py
โ”‚   โ””โ”€โ”€ custom_data_example.py
โ”œโ”€โ”€ README.md                   # This file
โ”œโ”€โ”€ CONTRIBUTING.md            # Contributing guidelines
โ”œโ”€โ”€ LICENSE                     # MIT License
โ”œโ”€โ”€ setup.py                    # Package installation script
โ”œโ”€โ”€ pyproject.toml             # Modern Python packaging config
โ”œโ”€โ”€ requirements.txt           # Core dependencies
โ”œโ”€โ”€ requirements-dev.txt       # Development dependencies
โ”œโ”€โ”€ requirements-gpu.txt       # GPU support dependencies
โ””โ”€โ”€ .pre-commit-config.yaml   # Code quality checks

๐ŸŽ“ Educational Concepts Covered

1. Imbalanced Classification

  • Why accuracy alone is misleading
  • Class weighting and scale_pos_weight
  • Precision vs Recall trade-offs
  • ROC and Precision-Recall curves

2. Boosting & Decision Trees

  • How gradient boosting works
  • Why boosting is effective for this problem
  • Feature importance in tree ensembles
  • Overfitting and regularization

3. Cross-Validation & Early Stopping

  • Preventing overfitting
  • Validation-based model selection
  • Learning curves and training dynamics

4. Hyperparameter Tuning

  • Grid vs Random vs Bayesian search
  • Optuna for efficient optimization
  • Interpreting tuning results
  • Trade-offs between performance and training time

5. Model Evaluation

  • Multiple metrics for imbalanced data
  • ROC curves and operating points
  • Precision-Recall analysis
  • Threshold optimization

6. Signal/Background Separation

  • The "cut" concept
  • Probability score interpretation
  • Acceptance vs Purity trade-off
  • Real-world applications in physics

๐Ÿ–ฅ๏ธ Advanced Usage

Custom Data

import numpy as np
from signalseeker import DataPreprocessingPipeline, DataSplitter
from signalseeker import XGBoostModelBuilder, ModelTrainer
from signalseeker.config import PreprocessConfig, XGBoostConfig

# Load your own data
X = np.load("features.npy")
y = np.load("labels.npy")

# Split the data
splitter = DataSplitter(PreprocessConfig())
X_train, X_val, X_test, y_train, y_val, y_test = splitter.split(X, y)

# Preprocess
preprocessor = DataPreprocessingPipeline(PreprocessConfig())
X_train, X_val, X_test = preprocessor.fit(X_train, X_val, X_test)

# Build and train model
builder = XGBoostModelBuilder(XGBoostConfig())
builder.build_with_class_weights(y_train)
trainer = ModelTrainer(builder)
model, results = trainer.train(X_train, y_train, X_val, y_val)

# Make predictions
predictions = model.predict_proba(X_test)[:, 1]

Hyperparameter Tuning Only

from signalseeker import HyperparameterTuner
from signalseeker.config import TunerConfig, XGBoostConfig

tuner = HyperparameterTuner(TunerConfig(), XGBoostConfig())
results = tuner.optimize(X_train, y_train, n_trials=50)

print(f"Best AUC: {results['best_score']:.4f}")
print(f"Best params: {results['best_params']}")

# Create model with best params
best_model_builder = tuner.create_model_from_best()

Visualization Only

from signalseeker import ModelVisualizer
from signalseeker.config import VisualizerConfig

viz = ModelVisualizer(VisualizerConfig())

# Probability distribution
viz.plot_probability_score_distribution(y_test, predictions)

# ROC curve
viz.plot_roc_curve(y_test, predictions)

# Feature importance
viz.plot_feature_importance(model, top_n=15)

๐Ÿ“ˆ Performance Expectations

On the default 10,000 sample synthetic dataset:

Metric Without Tuning With Tuning
Test AUC-ROC ~0.92 ~0.95
Test AUC-PR ~0.88 ~0.92
Test F1-Score ~0.80 ~0.85
Training Time ~2 sec ~3-5 min

Performance improves with tuning, and improvements are typically more dramatic on real-world datasets.

๐Ÿ”ง GPU Acceleration Setup

Prerequisites

  1. NVIDIA GPU with CUDA Compute Capability 3.5+
  2. CUDA Toolkit 10.2+ (check with nvcc --version)
  3. cuDNN (optional, improves performance)

Installation

# 1. Install CUDA Toolkit (from NVIDIA website for your OS)

# 2. Verify CUDA installation
nvcc --version

# 3. Install signalseeker with GPU support
pip install -e ".[gpu]"

# OR manually:
pip install xgboost[gpu]

Enabling GPU in SignalSeeker

from signalseeker import SignalSeekerPipeline, DEFAULT_CONFIG

config = DEFAULT_CONFIG
config.use_gpu = True  # Enable GPU acceleration
config.xgboost.tree_method = "gpu_hist"  # Use GPU for tree building

pipeline = SignalSeekerPipeline(config)
results = pipeline.run()

Configuration Options

# GPU-specific XGBoost parameters
config.xgboost.tree_method = "gpu_hist"  # GPU tree building
config.xgboost.gpu_id = 0  # Which GPU to use (if multiple)
config.xgboost.predictor = "gpu_predictor"  # GPU prediction

Troubleshooting GPU

# Check if GPU is detected
python -c "import xgboost as xgb; print(xgb.get_config())"

# Test GPU acceleration
python -c "from xgboost import XGBClassifier; m = XGBClassifier(tree_method='gpu_hist'); print('GPU enabled!')"

๐Ÿงช Testing

Run the complete test suite:

# Install test dependencies
pip install -r requirements-dev.txt

# Run all tests
pytest

# Run with coverage report
pytest --cov=src/signalseeker --cov-report=html

# Run specific test file
pytest tests/test_model_builder.py

# Run specific test
pytest tests/test_model_builder.py::TestXGBoostModelBuilder::test_initialization

# Run only fast tests
pytest -m "not slow"

# Run in parallel (faster)
pytest -n auto

๐Ÿ”„ CI/CD

This project uses GitHub Actions for continuous integration:

  • Tests: Runs on Python 3.8-3.12 on every push/PR
  • Code Quality: Black formatting, isort, flake8, mypy
  • Coverage: Checks test coverage on push
  • Publishing: Auto-publishes to PyPI on version tag

See .github/workflows/ for configuration.

๐Ÿ“š Documentation

Docstrings

Every module, class, and function includes comprehensive docstrings explaining:

  • What the code does
  • Why each step is necessary
  • How to use it with examples

Read the docstrings in source files for detailed explanations.

Type Hints

All functions include type hints for better IDE support and code clarity.

def compute_class_weight(y: np.ndarray) -> float:
    """Compute scale_pos_weight for class imbalance."""

โ“ FAQ

Q: Why is accuracy not a good metric for imbalanced data?
A: With 90% background, a model predicting "all background" gets 90% accuracy without detecting any signal!

Q: What's the difference between AUC-ROC and AUC-PR?
A: For imbalanced data, AUC-PR is more informative because it focuses on the minority class (signal).

Q: Why use early stopping?
A: After some boosting rounds, the model starts overfitting to noise. Early stopping detects when validation loss stops improving and halts training.

Q: Can I use my own data?
A: Yes! Replace the DataLoader with code that loads your data, then use the rest of the pipeline unchanged.

Q: Is GPU acceleration supported?
A: Yes! Set config.use_gpu = True if you have a CUDA-capable GPU.

Q: How do I install this on Windows/Mac/Linux?
A: The installation process is identical across platforms. Use the standard pip commands.

Q: Can I train on larger datasets?
A: Yes! Increase n_samples in the config. For very large datasets (>1M samples), consider using GPU acceleration.

Q: How do I customize the configuration?
A: Edit config.py or pass a modified PipelineConfig object to SignalSeekerPipeline.

๐Ÿค Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

๐Ÿ“– References

๐Ÿ“„ License

This project is licensed under the MIT License - see LICENSE file for details.

๐Ÿ‘ค Author

Amid Nayerhoda
Email: amid.nayerhoda@gmail.com
GitHub: @yourusername

๐Ÿ™ Acknowledgments

This package was designed as an educational resource to teach:

  • Machine learning best practices
  • Professional Python package development
  • Signal/background separation techniques used in particle physics
  • Real-world ML pipeline design

๐Ÿ“ž Support

For issues, questions, or suggestions:

  1. Check the FAQ above
  2. Check existing GitHub Issues
  3. Create a new issue with clear description and example code
  4. Read the docstrings in the source code for detailed explanations

Made with โค๏ธ for science and education

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

signalseeker-0.2.0.tar.gz (44.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

signalseeker-0.2.0-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file signalseeker-0.2.0.tar.gz.

File metadata

  • Download URL: signalseeker-0.2.0.tar.gz
  • Upload date:
  • Size: 44.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for signalseeker-0.2.0.tar.gz
Algorithm Hash digest
SHA256 92d7505b43b50767182c7b4d8a9d5da629480362373fbfbe1f0206015dd4fa49
MD5 6347432ea600eed87535a974e07bda2f
BLAKE2b-256 75bbac0358ae2151236f3ed9f3ea69f8383ade5ec77d8e95b12fd0ba1bda57c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for signalseeker-0.2.0.tar.gz:

Publisher: publish.yml on Amidn/SignalSeeker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file signalseeker-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: signalseeker-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 37.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for signalseeker-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1784412f02644be424110a214d7f6e3e2872b7914d6728e54c83d1b81d4d4cef
MD5 092689476c863442c147f9e81ba31cde
BLAKE2b-256 3288589c0c1f7aaff73440f7fcab2527ff66f88f198f25a3197b9304ca61b87d

See more details on using hashes here.

Provenance

The following attestation bundles were made for signalseeker-0.2.0-py3-none-any.whl:

Publisher: publish.yml on Amidn/SignalSeeker

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page