Educational binary classification with XGBoost for signal/background separation
Project description
SignalSeeker: Educational Binary Classification with XGBoost
SignalSeeker is a comprehensive, production-quality Python package that teaches how to build machine learning pipelines for binary classification using Boosted Decision Trees (XGBoost) in a signal/background separation context, inspired by particle physics applications.
๐ฏ Overview
In particle physics experiments, signal refers to the desired particle interaction you're searching for, while background refers to all other processes that mimic the signal. The challenge is to build a classifier that:
- Identifies signal events with high efficiency (recall)
- Rejects background events with high purity (precision)
- Provides interpretable probability scores (BDT scores) for decision-making
SignalSeeker teaches all these concepts through a production-quality, fully-featured Python package.
โจ Key Features
โ Complete ML Pipeline: Data generation โ Preprocessing โ Training โ Tuning โ Evaluation โ Visualization
โ Realistic Imbalanced Data: Generates synthetic datasets with ~90% background, 10% signal (fully configurable)
โ XGBoost Implementation: Industry-standard gradient boosted decision trees with full hyperparameter control
โ Hyperparameter Tuning: Bayesian optimization using Optuna for automatic parameter discovery
โ Comprehensive Metrics: Accuracy, Precision, Recall, F1, AUC-ROC, AUC-PR, Confusion Matrix
โ Publication-Quality Visualizations:
- Probability Score Distribution: Core visualization showing signal/background separation
- ROC and Precision-Recall curves
- Feature importance analysis
- Confusion matrix heatmap
- Summary dashboard
โ GPU Acceleration: Full CUDA support for XGBoost (optional)
โ Professional Package: Installable via pip, PyPI-ready, proper src/ layout
โ Comprehensive Testing: Full pytest coverage with CI/CD
โ Educational Comments: Detailed docstrings explaining why each step is performed
โ Professional Code Quality: Type hints, OOP design, proper error handling, Black-formatted
๐ฆ Installation
Option 1: Development Install (Recommended for Learning)
git clone https://github.com/yourusername/SignalSeeker.git
cd SignalSeeker
pip install -e .
This allows you to modify source files and see changes immediately.
Option 2: Regular Install from GitHub
pip install git+https://github.com/yourusername/SignalSeeker.git
Option 3: Install from PyPI (when published)
pip install signalseeker
Option 4: Install with Development Tools
pip install -e ".[dev]"
Includes testing, linting, and documentation tools.
Option 5: Install with GPU Support
pip install -e ".[gpu]"
Requires CUDA Toolkit to be installed. See GPU Setup Guide.
Option 6: Install from Requirements Files
# Core dependencies only
pip install -r requirements.txt
# With development tools
pip install -r requirements-dev.txt
# With GPU support
pip install -r requirements-gpu.txt
๐ Quick Start
Run the Complete Pipeline
python -m signalseeker.main
Or after installation:
signal-seeker
This will:
- Generate synthetic imbalanced data (10,000 samples, 10% signal)
- Preprocess features (scaling, missing value handling)
- Build and train an XGBoost model
- Evaluate on validation and test sets
- Generate publication-quality visualizations
- Save all results to
./results/run_TIMESTAMP/
With Hyperparameter Tuning (Better Performance, Slower)
signal-seeker --tune
Uses Bayesian optimization to find optimal hyperparameters (takes ~2-5 minutes).
Custom Configuration
signal-seeker --n-samples 50000 --signal-fraction 0.15 --output-dir ./my_results
Available options:
--tune: Enable hyperparameter tuning--output-dir: Output directory for results--n-samples: Total number of samples to generate--signal-fraction: Fraction of signal samples (0-1)
Python API Usage
from signalseeker import SignalSeekerPipeline, DEFAULT_CONFIG
# Use default configuration
pipeline = SignalSeekerPipeline(DEFAULT_CONFIG)
results = pipeline.run(use_tuning=False)
# Or customize configuration
config = DEFAULT_CONFIG
config.data.n_samples = 50000
config.data.signal_fraction = 0.15
config.xgboost.max_depth = 8
config.xgboost.learning_rate = 0.05
pipeline = SignalSeekerPipeline(config)
results = pipeline.run(use_tuning=True)
# Access results
val_auc = results["validation_results"]["all_metrics"]["auc_roc"]
test_auc = results["test_results"]["all_metrics"]["auc_roc"]
model = results["model"]
print(f"Validation AUC: {val_auc:.4f}")
print(f"Test AUC: {test_auc:.4f}")
๐ Understanding the Output
The Probability Score Distribution (Most Important Plot)
This is the fundamental visualization in signal/background separation:
Signal Distribution: โโโโโโโโโโโโโโโโโโโโโโโโโ
Background Dist: โโโโโโโโโโโโโโโโโโโโโโโโโโโโ
0.0 Cut 1.0
(threshold)
- Signal: Concentrated near 1.0 (high probability of being signal)
- Background: Concentrated near 0.0 (low probability of being signal)
- Cut: The threshold above which we classify events as signal
Good Model: Minimal overlap, clear separation
Poor Model: Heavy overlap, hard to separate
This plot is analogous to the "BDT score" or "discriminant" in particle physics.
Other Key Metrics
| Metric | Meaning | Ideal Value |
|---|---|---|
| Accuracy | (TP + TN) / Total | 1.0 (but misleading for imbalanced data) |
| Precision | TP / (TP + FP) - Of predicted signals, how many are true? | 1.0 |
| Recall (TPR) | TP / (TP + FN) - Of true signals, how many did we find? | 1.0 |
| F1-Score | Harmonic mean of precision and recall | 1.0 |
| AUC-ROC | Area under ROC curve (threshold-independent) | 1.0 |
| AUC-PR | Area under Precision-Recall curve (better for imbalanced) | 1.0 |
๐ Package Structure
SignalSeeker/
โโโ src/signalseeker/ # Main package code
โ โโโ __init__.py # Package initialization & exports
โ โโโ config.py # Configuration management
โ โโโ data_loader.py # Synthetic data generation
โ โโโ preprocessor.py # Feature scaling & normalization
โ โโโ model_builder.py # XGBoost model initialization
โ โโโ trainer.py # Training with early stopping
โ โโโ tuner.py # Bayesian hyperparameter optimization
โ โโโ metrics.py # Evaluation metrics
โ โโโ visualizer.py # Publication-quality plots
โ โโโ utils.py # Logging and utilities
โ โโโ main.py # Pipeline orchestrator
โโโ tests/ # Pytest test suite
โ โโโ __init__.py
โ โโโ test_config.py
โ โโโ test_data_loader.py
โ โโโ test_preprocessor.py
โ โโโ test_model_builder.py
โ โโโ test_trainer.py
โ โโโ test_tuner.py
โ โโโ test_metrics.py
โ โโโ test_visualizer.py
โโโ .github/workflows/ # CI/CD pipelines
โ โโโ tests.yml # Run tests on push/PR
โ โโโ publish.yml # Auto-publish to PyPI on release
โโโ examples/ # Example scripts and notebooks
โ โโโ example_usage.py
โ โโโ custom_data_example.py
โโโ README.md # This file
โโโ CONTRIBUTING.md # Contributing guidelines
โโโ LICENSE # MIT License
โโโ setup.py # Package installation script
โโโ pyproject.toml # Modern Python packaging config
โโโ requirements.txt # Core dependencies
โโโ requirements-dev.txt # Development dependencies
โโโ requirements-gpu.txt # GPU support dependencies
โโโ .pre-commit-config.yaml # Code quality checks
๐ Educational Concepts Covered
1. Imbalanced Classification
- Why accuracy alone is misleading
- Class weighting and scale_pos_weight
- Precision vs Recall trade-offs
- ROC and Precision-Recall curves
2. Boosting & Decision Trees
- How gradient boosting works
- Why boosting is effective for this problem
- Feature importance in tree ensembles
- Overfitting and regularization
3. Cross-Validation & Early Stopping
- Preventing overfitting
- Validation-based model selection
- Learning curves and training dynamics
4. Hyperparameter Tuning
- Grid vs Random vs Bayesian search
- Optuna for efficient optimization
- Interpreting tuning results
- Trade-offs between performance and training time
5. Model Evaluation
- Multiple metrics for imbalanced data
- ROC curves and operating points
- Precision-Recall analysis
- Threshold optimization
6. Signal/Background Separation
- The "cut" concept
- Probability score interpretation
- Acceptance vs Purity trade-off
- Real-world applications in physics
๐ฅ๏ธ Advanced Usage
Custom Data
import numpy as np
from signalseeker import DataPreprocessingPipeline, DataSplitter
from signalseeker import XGBoostModelBuilder, ModelTrainer
from signalseeker.config import PreprocessConfig, XGBoostConfig
# Load your own data
X = np.load("features.npy")
y = np.load("labels.npy")
# Split the data
splitter = DataSplitter(PreprocessConfig())
X_train, X_val, X_test, y_train, y_val, y_test = splitter.split(X, y)
# Preprocess
preprocessor = DataPreprocessingPipeline(PreprocessConfig())
X_train, X_val, X_test = preprocessor.fit(X_train, X_val, X_test)
# Build and train model
builder = XGBoostModelBuilder(XGBoostConfig())
builder.build_with_class_weights(y_train)
trainer = ModelTrainer(builder)
model, results = trainer.train(X_train, y_train, X_val, y_val)
# Make predictions
predictions = model.predict_proba(X_test)[:, 1]
Hyperparameter Tuning Only
from signalseeker import HyperparameterTuner
from signalseeker.config import TunerConfig, XGBoostConfig
tuner = HyperparameterTuner(TunerConfig(), XGBoostConfig())
results = tuner.optimize(X_train, y_train, n_trials=50)
print(f"Best AUC: {results['best_score']:.4f}")
print(f"Best params: {results['best_params']}")
# Create model with best params
best_model_builder = tuner.create_model_from_best()
Visualization Only
from signalseeker import ModelVisualizer
from signalseeker.config import VisualizerConfig
viz = ModelVisualizer(VisualizerConfig())
# Probability distribution
viz.plot_probability_score_distribution(y_test, predictions)
# ROC curve
viz.plot_roc_curve(y_test, predictions)
# Feature importance
viz.plot_feature_importance(model, top_n=15)
๐ Performance Expectations
On the default 10,000 sample synthetic dataset:
| Metric | Without Tuning | With Tuning |
|---|---|---|
| Test AUC-ROC | ~0.92 | ~0.95 |
| Test AUC-PR | ~0.88 | ~0.92 |
| Test F1-Score | ~0.80 | ~0.85 |
| Training Time | ~2 sec | ~3-5 min |
Performance improves with tuning, and improvements are typically more dramatic on real-world datasets.
๐ง GPU Acceleration Setup
Prerequisites
- NVIDIA GPU with CUDA Compute Capability 3.5+
- CUDA Toolkit 10.2+ (check with
nvcc --version) - cuDNN (optional, improves performance)
Installation
# 1. Install CUDA Toolkit (from NVIDIA website for your OS)
# 2. Verify CUDA installation
nvcc --version
# 3. Install signalseeker with GPU support
pip install -e ".[gpu]"
# OR manually:
pip install xgboost[gpu]
Enabling GPU in SignalSeeker
from signalseeker import SignalSeekerPipeline, DEFAULT_CONFIG
config = DEFAULT_CONFIG
config.use_gpu = True # Enable GPU acceleration
config.xgboost.tree_method = "gpu_hist" # Use GPU for tree building
pipeline = SignalSeekerPipeline(config)
results = pipeline.run()
Configuration Options
# GPU-specific XGBoost parameters
config.xgboost.tree_method = "gpu_hist" # GPU tree building
config.xgboost.gpu_id = 0 # Which GPU to use (if multiple)
config.xgboost.predictor = "gpu_predictor" # GPU prediction
Troubleshooting GPU
# Check if GPU is detected
python -c "import xgboost as xgb; print(xgb.get_config())"
# Test GPU acceleration
python -c "from xgboost import XGBClassifier; m = XGBClassifier(tree_method='gpu_hist'); print('GPU enabled!')"
๐งช Testing
Run the complete test suite:
# Install test dependencies
pip install -r requirements-dev.txt
# Run all tests
pytest
# Run with coverage report
pytest --cov=src/signalseeker --cov-report=html
# Run specific test file
pytest tests/test_model_builder.py
# Run specific test
pytest tests/test_model_builder.py::TestXGBoostModelBuilder::test_initialization
# Run only fast tests
pytest -m "not slow"
# Run in parallel (faster)
pytest -n auto
๐ CI/CD
This project uses GitHub Actions for continuous integration:
- Tests: Runs on Python 3.8-3.12 on every push/PR
- Code Quality: Black formatting, isort, flake8, mypy
- Coverage: Checks test coverage on push
- Publishing: Auto-publishes to PyPI on version tag
See .github/workflows/ for configuration.
๐ Documentation
Docstrings
Every module, class, and function includes comprehensive docstrings explaining:
- What the code does
- Why each step is necessary
- How to use it with examples
Read the docstrings in source files for detailed explanations.
Type Hints
All functions include type hints for better IDE support and code clarity.
def compute_class_weight(y: np.ndarray) -> float:
"""Compute scale_pos_weight for class imbalance."""
โ FAQ
Q: Why is accuracy not a good metric for imbalanced data?
A: With 90% background, a model predicting "all background" gets 90% accuracy without detecting any signal!
Q: What's the difference between AUC-ROC and AUC-PR?
A: For imbalanced data, AUC-PR is more informative because it focuses on the minority class (signal).
Q: Why use early stopping?
A: After some boosting rounds, the model starts overfitting to noise. Early stopping detects when validation loss stops improving and halts training.
Q: Can I use my own data?
A: Yes! Replace the DataLoader with code that loads your data, then use the rest of the pipeline unchanged.
Q: Is GPU acceleration supported?
A: Yes! Set config.use_gpu = True if you have a CUDA-capable GPU.
Q: How do I install this on Windows/Mac/Linux?
A: The installation process is identical across platforms. Use the standard pip commands.
Q: Can I train on larger datasets?
A: Yes! Increase n_samples in the config. For very large datasets (>1M samples), consider using GPU acceleration.
Q: How do I customize the configuration?
A: Edit config.py or pass a modified PipelineConfig object to SignalSeekerPipeline.
๐ค Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
๐ References
- XGBoost Paper: Chen & Guestrin (2016) - "XGBoost: A Scalable Tree Boosting System" - Link
- Particle Physics ML: Baldi et al. (2016) - "Searching for exotic particles in high-energy physics with deep learning" - Link
- Imbalanced Learning: He & Garcia (2009) - "Learning from Imbalanced Data" - Link
- XGBoost Documentation: https://xgboost.readthedocs.io/
- Optuna Documentation: https://optuna.readthedocs.io/
- Scikit-learn: https://scikit-learn.org/
๐ License
This project is licensed under the MIT License - see LICENSE file for details.
๐ค Author
Amid Nayerhoda
Email: amid.nayerhoda@gmail.com
GitHub: @yourusername
๐ Acknowledgments
This package was designed as an educational resource to teach:
- Machine learning best practices
- Professional Python package development
- Signal/background separation techniques used in particle physics
- Real-world ML pipeline design
๐ Support
For issues, questions, or suggestions:
- Check the FAQ above
- Check existing GitHub Issues
- Create a new issue with clear description and example code
- Read the docstrings in the source code for detailed explanations
Made with โค๏ธ for science and education
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file signalseeker-0.2.0.tar.gz.
File metadata
- Download URL: signalseeker-0.2.0.tar.gz
- Upload date:
- Size: 44.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92d7505b43b50767182c7b4d8a9d5da629480362373fbfbe1f0206015dd4fa49
|
|
| MD5 |
6347432ea600eed87535a974e07bda2f
|
|
| BLAKE2b-256 |
75bbac0358ae2151236f3ed9f3ea69f8383ade5ec77d8e95b12fd0ba1bda57c2
|
Provenance
The following attestation bundles were made for signalseeker-0.2.0.tar.gz:
Publisher:
publish.yml on Amidn/SignalSeeker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
signalseeker-0.2.0.tar.gz -
Subject digest:
92d7505b43b50767182c7b4d8a9d5da629480362373fbfbe1f0206015dd4fa49 - Sigstore transparency entry: 1756014331
- Sigstore integration time:
-
Permalink:
Amidn/SignalSeeker@98ddd81c5fbe87a0c1b5ab9aa17176a0f37ab528 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Amidn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@98ddd81c5fbe87a0c1b5ab9aa17176a0f37ab528 -
Trigger Event:
push
-
Statement type:
File details
Details for the file signalseeker-0.2.0-py3-none-any.whl.
File metadata
- Download URL: signalseeker-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1784412f02644be424110a214d7f6e3e2872b7914d6728e54c83d1b81d4d4cef
|
|
| MD5 |
092689476c863442c147f9e81ba31cde
|
|
| BLAKE2b-256 |
3288589c0c1f7aaff73440f7fcab2527ff66f88f198f25a3197b9304ca61b87d
|
Provenance
The following attestation bundles were made for signalseeker-0.2.0-py3-none-any.whl:
Publisher:
publish.yml on Amidn/SignalSeeker
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
signalseeker-0.2.0-py3-none-any.whl -
Subject digest:
1784412f02644be424110a214d7f6e3e2872b7914d6728e54c83d1b81d4d4cef - Sigstore transparency entry: 1756014383
- Sigstore integration time:
-
Permalink:
Amidn/SignalSeeker@98ddd81c5fbe87a0c1b5ab9aa17176a0f37ab528 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/Amidn
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@98ddd81c5fbe87a0c1b5ab9aa17176a0f37ab528 -
Trigger Event:
push
-
Statement type: