Comprehensive diagnostic and evaluation framework for quantitative finance ML workflows

These details have not been verified by PyPI

Project links

Project description

ML4T Diagnostic: Comprehensive Diagnostics for Quantitative Finance

Statistical rigor meets actionable insights for ML trading strategies

What is ML4T Diagnostic?

ML4T Diagnostic is a comprehensive evaluation library for quantitative trading strategies, spanning the entire ML workflow from feature analysis to portfolio performance.

Key Improvements

Capability	What's New
Performance	Polars-powered for 10-100x faster analysis
Visualizations	Interactive Plotly charts
Insights	Auto-interpretation with warnings
Statistics	DSR, CPCV, RAS, PBO, FDR corrections
Exploratory	Stationarity, ACF, volatility, distribution tests
Signal Analysis	Multi-signal comparison and selection
Trade Diagnostics	SHAP-based error pattern discovery
Binary Metrics	Precision, recall, lift, coverage with Wilson intervals
Threshold Analysis	Threshold sweep, optimization, monotonicity checks

Quick Start

Installation

# Core library
pip install ml4t-diagnostic

# With ML dependencies (for SHAP, importance, interactions)
pip install ml4t-diagnostic[ml]

# With visualization (for interactive reports)
pip install ml4t-diagnostic[viz]

# Everything (ML + viz + dashboard)
pip install ml4t-diagnostic[all]

Example 1: Trade Diagnostics

Close the ML→Trading feedback loop: Understand why specific trades fail and get actionable improvement suggestions.

from ml4t.diagnostic.evaluation import TradeAnalysis, TradeShapAnalyzer
from ml4t.diagnostic.config import TradeShapConfig

# 1. Identify worst trades from backtest
analyzer = TradeAnalysis(trade_records)
worst_trades = analyzer.worst_trades(n=20)

# 2. Explain with SHAP
config = TradeShapConfig.for_quick_diagnostics()
shap_analyzer = TradeShapAnalyzer(
    model=trained_model,
    features_df=features_df,  # Features with timestamps
    shap_values=shap_values,   # Precomputed SHAP values
    config=config
)

# 3. Discover error patterns
result = shap_analyzer.explain_worst_trades(worst_trades)

# 4. Get actionable hypotheses
for pattern in result.error_patterns:
    print(f"Pattern {pattern.cluster_id}: {pattern.hypothesis}")
    print(f"  Actions: {pattern.actions}")
    print(f"  Confidence: {pattern.confidence:.2%}")
    print(f"  Potential savings: ${pattern.potential_impact:,.2f}")

Output example:

Pattern 1: High momentum + Low volatility → Reversals
  Actions: ['Add volatility regime filter', 'Shorten holding period in low vol']
  Confidence: 85%
  Potential savings: $12,450.00

Pattern 2: Low liquidity + Wide spreads → Poor execution
  Actions: ['Add minimum liquidity filter', 'Widen entry criteria']
  Confidence: 78%
  Potential savings: $8,230.00

See examples/trade_diagnostics_example.ipynb for complete end-to-end workflow.

Example 2: Feature Importance Analysis

import polars as pl
from ml4t.diagnostic.evaluation import analyze_ml_importance

# Your data
X = pl.read_parquet("features.parquet")
y = pl.read_parquet("labels.parquet")

# Analyze feature importance (combines MDI, PFI, MDA, SHAP)
results = analyze_ml_importance(model, X, y)

# Get consensus ranking
print(results.consensus_ranking)
# [('momentum', 1.2), ('volatility', 2.1), ...]

# Check warnings
print(results.warnings)
# ["High SHAP importance but low PFI for 'spread' - possible overfitting"]

# Get interpretation
print(results.interpretation)
# "Strong consensus across methods. Top 3 features: momentum, volatility, volume..."

Example 3: Feature Interactions

from ml4t.diagnostic.evaluation import analyze_interactions

# Detect feature interactions (Conditional IC, H-stat, SHAP)
results = analyze_interactions(model, X, y)

# Top interactions by consensus
print(results.top_interactions_consensus)
# [('momentum', 'volatility'), ('volume', 'spread'), ...]

# Method agreement
print(results.method_agreement)
# {('h_statistic', 'shap'): 0.85, ...}  # High agreement = robust finding

Example 4: Statistical Validation (DSR)

from ml4t.diagnostic.evaluation import stats

# Your backtest results
returns = strategy.compute_returns()

# Statistical validation with multiple testing correction
dsr_result = stats.compute_dsr(
    returns=returns,
    benchmark_sr=0.0,
    n_trials=100,  # Number of strategies tested
    expected_max_sharpe=1.5
)

print(f"Sharpe Ratio: {dsr_result['sr']:.2f}")
print(f"Deflated Sharpe: {dsr_result['dsr']:.2f}")  # Accounts for multiple testing
print(f"p-value: {dsr_result['pval']:.4f}")
print(f"Significant: {dsr_result['is_significant']}")

Example 5: Binary Classification Metrics

Evaluate discrete trading signals with proper statistical inference:

from ml4t.diagnostic.evaluation import (
    binary_classification_report,
    precision, recall, lift, coverage, f1_score,
    wilson_score_interval,
    binomial_test_precision,
)

# Your signals and outcomes
signals = momentum > threshold  # Binary signals
labels = forward_returns > 0     # Binary outcomes (profitable or not)

# Comprehensive report with confidence intervals
report = binary_classification_report(signals, labels)

print(f"Precision: {report['precision']:.2%} ± {report['precision_ci_width']:.2%}")
print(f"Lift: {report['lift']:.2f}x (vs random)")
print(f"Coverage: {report['coverage']:.1%} of observations")
print(f"Statistically significant: {report['binomial_pvalue'] < 0.05}")

Key metrics:

Precision: When you signal, how often are you right?
Lift: How much better than random selection?
Coverage: What fraction of time are you in a position?
Wilson interval: Accurate confidence bounds for proportions

Example 6: Threshold Optimization

Find optimal signal thresholds with train-only selection:

from ml4t.diagnostic.evaluation import (
    evaluate_threshold_sweep,
    find_optimal_threshold,
    check_monotonicity,
)

# Sweep thresholds and compute metrics at each
results = evaluate_threshold_sweep(
    indicator=momentum_values,
    label=future_profitable,
    thresholds=[0.1, 0.3, 0.5, 0.7, 0.9],
    direction='above'
)

# Find optimal with constraints
optimal = find_optimal_threshold(
    indicator=momentum_values,
    label=future_profitable,
    metric="f1_score",
    min_coverage=0.02,           # At least 2% signal frequency
    require_significant=True     # Must pass binomial test
)

print(f"Optimal threshold: {optimal['threshold']:.2f}")
print(f"F1 Score: {optimal['f1_score']:.2%}")

# Check if relationship is monotonic (good) or non-monotonic (investigate)
mono = check_monotonicity(results, metric="precision")
print(f"Monotonicity score: {mono['score']:.2f}")

Critical: Use train-only threshold selection in cross-validation to prevent leakage.

Library Overview

ML4T Diagnostic provides three complementary capabilities across four application domains:

Three Pillars of Analysis

Pillar	Purpose	Examples
Explore	Understand patterns before modeling	Stationarity tests, ACF/PACF, distribution analysis
Validate	Test significance and prevent overfitting	DSR, CPCV, RAS, FDR corrections
Visualize	Communicate findings effectively	Interactive Plotly charts, dashboards, reports

Four Application Domains

Domain	Stage	Key Classes
Features & Data	Pre-modeling	`FeatureDiagnostics`, `analyze_stationarity()`
Signals & Models	Modeling	`SignalAnalysis`, `MultiSignalAnalysis`
Trades & Backtest	Post-modeling	`TradeAnalysis`, `TradeShapAnalyzer`
Portfolio	Production	`PortfolioAnalysis`, rolling metrics

This architecture ensures you can explore, validate, and visualize at every stage of the ML workflow.

Architecture: Four-Tier Diagnostic Framework

ML4T Diagnostic covers four tiers of the quantitative workflow:

┌──────────────────────────────────────────────────────────────┐
│ Tier 1: Feature Analysis (Pre-Modeling)                     │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Time series diagnostics (stationarity, ACF, volatility)    │
│ • Distribution analysis (moments, normality, tails)          │
│ • Feature-outcome predictiveness (IC, MI, quantiles)         │
│ • Feature importance (MDI, PFI, MDA, SHAP consensus)         │
│ • Feature interactions (Conditional IC, H-stat, SHAP)        │
│ • Drift detection (PSI, domain classifier)                   │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Tier 2: Signal Analysis (Model Outputs)                     │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • IC analysis (time series, histogram, heatmap)              │
│ • Quantile returns (bar, violin, cumulative)                 │
│ • Turnover analysis (top/bottom basket, autocorrelation)     │
│ • Multi-signal comparison and ranking                        │
│ • Signal selection framework                                 │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Tier 3: Backtest Analysis (Post-Modeling)                   │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Trade analysis (win/loss, PnL, holding periods)            │
│ • Statistical validity (DSR, RAS, PBO, FDR corrections)      │
│ • Trade-SHAP diagnostics (error pattern discovery)           │
│ • Excursion analysis (TP/SL parameter optimization)          │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Tier 4: Portfolio Analysis (Production)                     │
│ ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ │
│ • Performance metrics (Sharpe, Sortino, Calmar, Omega)       │
│ • Drawdown analysis (underwater curve, top drawdowns)        │
│ • Rolling metrics (Sharpe, volatility, beta windows)         │
│ • Risk metrics (VaR, CVaR, tail ratio)                       │
│ • Monthly/annual returns visualization                       │
└──────────────────────────────────────────────────────────────┘

See docs/architecture.md for complete technical details.

Key Features

Trade-Level Diagnostics

Connect SHAP explanations to trade outcomes for systematic continuous improvement.

Core workflow:

Extract worst trades from backtest
Align SHAP values to trade entry timestamps
Cluster trades by SHAP similarity (hierarchical clustering)
Generate actionable hypotheses for improvement
Iterate: Adjust features/model based on insights

Benefits:

Systematic debugging: Understand exactly why trades fail
Pattern discovery: Find recurring error modes
Actionable insights: Get specific improvement suggestions
Continuous improvement: Close the ML→trading feedback loop

Performance (10-100x Faster)

Polars + Numba optimization for blazing fast analysis:

Operation	Dataset	Time
5-fold CV	1M rows	<10 seconds
Feature importance	100 features	<5 seconds
CPCV backtest	100K bars	<30 seconds
DSR calculation	252 returns	<50ms

Interactive Visualizations

Modern Plotly charts (not outdated matplotlib):

Hover for details
Zoom and pan
Responsive design
Publication-ready
Export to HTML/PDF

Auto-Interpretation

Human-readable insights, not just numbers:

results.warnings
# ["High Conditional IC but low H-statistic for (momentum, volatility)",
#  "Suggests regime-specific interaction - investigate market conditions"]

results.interpretation
# "Strong consensus across 3 methods. Top interaction: momentum × volatility.
#  High agreement (Spearman 0.85+) indicates robust finding."

Advanced Statistics

State-of-the-art methods from López de Prado and others:

DSR (Deflated Sharpe Ratio) - Corrects for multiple testing
CPCV (Combinatorial Purged Cross-Validation) - Leak-free validation
RAS (Rademacher Anti-Serum) - Backtest overfitting detection
PBO (Probability of Backtest Overfitting) - Overfitting probability
HAC-adjusted IC - Autocorrelation-robust information coefficient
FDR control (Benjamini-Hochberg) - Multiple comparisons
SHAP-based diagnostics - Trade-level error analysis

Time Series Diagnostics

Understand your data before making decisions:

from ml4t.diagnostic.evaluation import (
    analyze_stationarity,
    analyze_autocorrelation,
    analyze_volatility,
    analyze_distribution,
)

# Stationarity: ADF, KPSS, Phillips-Perron with consensus
result = analyze_stationarity(returns)
print(f"Consensus: {result.consensus}")  # 'stationary', 'non_stationary', 'inconclusive'
print(f"ADF p-value: {result.adf_result.pvalue:.4f}")

# Autocorrelation: ACF/PACF with significance bands
acf_result = analyze_autocorrelation(returns, nlags=20)
print(f"Significant lags: {acf_result.significant_lags}")

# Volatility: ARCH-LM test, GARCH(1,1) fitting
vol_result = analyze_volatility(returns)
print(f"ARCH effects: {vol_result.has_arch_effects}")

# Distribution: moments, normality, tail analysis
dist_result = analyze_distribution(returns)
print(f"Skewness: {dist_result.skewness:.3f}")
print(f"Jarque-Bera p-value: {dist_result.jb_pvalue:.4f}")

Signal Analysis

Full signal evaluation framework:

from ml4t.diagnostic.evaluation import SignalAnalysis, MultiSignalAnalysis

# Single signal analysis
signal_analyzer = SignalAnalysis(
    signal=factor_data,
    returns=forward_returns,
    periods=[1, 5, 21],  # 1D, 1W, 1M
)

# IC analysis with HAC adjustment
ic_result = signal_analyzer.compute_ic_analysis()
print(f"IC Mean: {ic_result.ic_mean:.4f}")
print(f"IC IR: {ic_result.ic_ir:.4f}")
print(f"HAC t-stat: {ic_result.hac_tstat:.2f}")

# Quantile returns
quantile_result = signal_analyzer.compute_quantile_analysis()
print(f"Q5-Q1 spread: {quantile_result.spread:.2%}")

# Turnover analysis
turnover = signal_analyzer.compute_turnover_analysis()

# Multi-signal comparison
multi_analyzer = MultiSignalAnalysis(signals_dict, returns)
ranking = multi_analyzer.rank_signals(metric='ic_ir')

Portfolio Analysis

Full portfolio tear sheet with modern visualizations:

from ml4t.diagnostic.evaluation import PortfolioAnalysis

# Initialize with returns
portfolio = PortfolioAnalysis(returns, benchmark=spy_returns)

# Summary statistics
metrics = portfolio.compute_summary_stats()
print(f"Sharpe: {metrics.sharpe_ratio:.2f}")
print(f"Sortino: {metrics.sortino_ratio:.2f}")
print(f"Calmar: {metrics.calmar_ratio:.2f}")
print(f"Omega: {metrics.omega_ratio:.2f}")
print(f"Max Drawdown: {metrics.max_drawdown:.2%}")

# Rolling metrics
rolling = portfolio.compute_rolling_metrics(window=252)
rolling_df = rolling.to_dataframe()  # rolling Sharpe, vol, beta

# Drawdown analysis
drawdowns = portfolio.compute_drawdown_analysis(top_n=5)
print(f"Worst drawdown: {drawdowns.max_drawdown:.2%}")
print(f"Recovery days: {drawdowns.max_duration}")

# Generate tear sheet
portfolio.generate_tear_sheet()  # Interactive Plotly dashboard

Seamless Integration

Works with your existing tools:

# Supports pandas, polars, numpy
X_pandas = pd.DataFrame(...)
X_polars = pl.DataFrame(...)
X_numpy = np.array(...)

# All work seamlessly
analyze_ml_importance(model, X_pandas, y)
analyze_ml_importance(model, X_polars, y)
analyze_ml_importance(model, X_numpy, y)

Integrates with popular backtesting engines:

ml4t-backtest (native support)
zipline-reloaded (via adapter)
VectorBT (via adapter)
Custom engines (implement TradeRecord schema)

Modular Design

Like AlphaLens, every function works standalone or composed:

# Use individual metrics
from ml4t.diagnostic.evaluation import compute_ic_series, compute_h_statistic

ic = compute_ic_series(features, returns)
h_stat = compute_h_statistic(model, X)

# Or use tear sheets (combines multiple metrics)
from ml4t.diagnostic.evaluation import analyze_ml_importance

importance = analyze_ml_importance(model, X, y)
# → Combines MDI, PFI, MDA, SHAP
# → Consensus ranking
# → Warnings and interpretation

# Or use full workflow
from ml4t.diagnostic.evaluation import TradeShapAnalyzer

analyzer = TradeShapAnalyzer(model, features_df, shap_values, config)
result = analyzer.explain_worst_trades(worst_trades)
# → Trade analysis + SHAP + clustering + hypotheses

Documentation

User Guides

Trade Diagnostics Example - Complete end-to-end tutorial
Architecture Guide - Technical deep dive
Visualization Strategy - Plotly + reporting
Data Schemas - Integration contracts
Installation Guide - Detailed installation options

Academic References

Academic References - Comprehensive citations for all implemented methods

Integration Guides

Book Integration - ML4T 3rd Edition alignment
Backtest Integration - Backtesting engine
Engineer Integration - Feature engineering

Technical Documentation

Optional Dependencies - ML libraries and graceful degradation
Dashboard Guide - Interactive Streamlit dashboard
Error Handling - Best practices
Logging - Structured logging with structlog

Migration

Migration Guide - Upgrade from ml4t.evaluation to ml4t.diagnostic

Optional Dependencies

ML4T Diagnostic is designed with minimal required dependencies. Optional ML libraries enhance functionality but are NOT required:

Available Features:

Core Analysis - Always available (IC, statistics, distributions, DSR, RAS)
ML Importance - Requires lightgbm or xgboost
SHAP Analysis - Requires shap (interpretability)
Deep Learning (v1.1+) - Requires tensorflow or pytorch
GPU Acceleration (v1.1+) - Requires cupy
Dashboards - Requires streamlit (interactive viz)

Quick Check:

from ml4t.diagnostic.utils import get_dependency_summary
print(get_dependency_summary())

Installation Options:

# Core library (no ML dependencies)
pip install ml4t-diagnostic

# Standard ML support (Tree, Linear, Kernel explainers)
pip install ml4t-diagnostic[ml]      # LightGBM, XGBoost, SHAP

# Neural network support (adds Deep explainer)
pip install ml4t-diagnostic[deep]    # + TensorFlow

# GPU acceleration (10-50x speedup for large datasets)
pip install ml4t-diagnostic[gpu]     # + cupy

# Visualization and dashboards
pip install ml4t-diagnostic[viz]     # + Plotly, Streamlit

# Everything (all explainers + GPU + viz)
pip install ml4t-diagnostic[all-ml]  # ml + deep + gpu
pip install ml4t-diagnostic[all]     # all-ml + viz

Explainer Availability (v1.1):

Explainer	Dependency Group	Required Packages
TreeExplainer	`[ml]`	shap, lightgbm/xgboost
LinearExplainer	`[ml]`	shap, scikit-learn
KernelExplainer	`[ml]`	shap, scikit-learn
DeepExplainer	`[deep]`	shap, tensorflow or pytorch
GPU Support	`[gpu]`	cupy

Graceful Degradation: Missing dependencies trigger clear warnings, not crashes. See docs/OPTIONAL_DEPENDENCIES.md for details.

API Stability

ML4T Diagnostic follows Semantic Versioning.

Version Type	API Changes	Examples
Patch (1.3.x)	Bug fixes only	Performance improvements, docs
Minor (1.x.0)	Backward compatible	New features, new config options
Major (x.0.0)	Breaking changes	Removed functions, renamed params

Public API: Everything in __all__ exports is considered stable. Internal modules (prefixed with _) may change without notice.

Current Stability: As of v1.3.0, the API is considered stable for production use.

Development Status

Current: v0.1.0a1

v1.3 - Module Decomposition & UX Improvements

Major Feature: Large monolithic modules decomposed into focused submodules for better maintainability.

Key improvements:

Module Decomposition: 5 large modules (~12,000 lines) split into focused submodules
- metrics.py (5,643 lines) → 13 modules in metrics/
- distribution.py, drift.py, stationarity.py → dedicated packages
ValidatedCrossValidation: One-step CPCV + DSR validation (20 lines → 5 lines)
Result.interpret(): Human-readable insights on all key result classes
Data Quality Integration: DataQualityReport contract with ml4t-data
Backward Compatible: All old imports still work via __init__.py exports
Type Stubs: Added py.typed marker for better IDE support

New Usage Pattern:

from ml4t.diagnostic import ValidatedCrossValidation

# One-step validated cross-validation (combines CPCV + DSR)
vcv = ValidatedCrossValidation(n_splits=10, embargo_pct=0.01)
result = vcv.fit_evaluate(X, y, model, times=times)

if result.is_significant:
    print(f"Strategy passes DSR at {result.significance_level:.0%} confidence")
    print(result.summary())
else:
    print("Strategy may be overfit - DSR test failed")
    for insight in result.interpretation:
        print(f"  • {insight}")

v1.2 - Configuration Consolidation

Major Feature: Reduced 61+ config classes to 10 primary configs with single-level nesting.

Key improvements:

Config Consolidation: DiagnosticConfig, StatisticalConfig, TradeConfig, etc.
Single-Level Nesting: config.stationarity.enabled (not deeply nested)
Presets Preserved: for_quick_analysis(), for_research(), for_production()
Backward Compatible: Old class names work as deprecated aliases

v1.1 - Model-Agnostic SHAP Support

Major Feature: SHAP importance now works with ANY sklearn-compatible model, not just tree models!

Key improvements:

Multi-Explainer Support: Auto-selects best explainer (Tree, Linear, Kernel, Deep)
Universal Compatibility: Works with SVM, KNN, neural networks, ANY model
Smart Performance: Automatic cascade (Tree → Linear → Kernel)
GPU Acceleration: Optional GPU support for large datasets
Backward Compatible: 100% compatible with v1.0 API

Explainer Comparison:

Explainer	Models	Speed	Quality	Installation
Tree	LightGBM, XGBoost, RF	<10ms/sample	Exact	`[ml]`
Linear	LogisticReg, Ridge, Lasso	<100ms/sample	Exact	`[ml]`
Deep	TensorFlow, PyTorch	<500ms/sample	Approx	`[deep]`
Kernel	ANY sklearn model	100-5000ms/sample	Approx	`[ml]`

Installation:

# Standard ML support (Tree, Linear, Kernel explainers)
pip install ml4t-diagnostic[ml]

# Neural network support (adds Deep explainer)
pip install ml4t-diagnostic[deep]

# GPU acceleration (10-50x speedup for large datasets)
pip install ml4t-diagnostic[gpu]

# Everything (all explainers + GPU)
pip install ml4t-diagnostic[all-ml]

Migration from v1.0:

Zero changes required - All v1.0 code works unchanged
Auto-selection - Tree models automatically use TreeExplainer
New models supported - Linear and other models now work automatically
Explicit control - Set explainer_type='kernel' to force specific explainer
Check explainer - Use result['explainer_type'] to see which was used

Example (New in v1.1):

from sklearn.svm import SVC
from ml4t.diagnostic.evaluation import compute_shap_importance

# Train ANY model (SVM example - not supported in v1.0!)
model = SVC(kernel='rbf', probability=True)
model.fit(X_train, y_train)

# Compute SHAP importance (auto-selects KernelExplainer)
result = compute_shap_importance(model, X_test, max_samples=100)
print(f"Explainer used: {result['explainer_type']}")  # 'kernel'

# Works with linear models too
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
result = compute_shap_importance(model, X_test)  # Auto-selects LinearExplainer

v1.0 - Trade Diagnostics Framework

Trade analysis framework (TradeAnalysis, TradeMetrics)
Trade-SHAP diagnostics (TradeShapAnalyzer)
Error pattern clustering (hierarchical clustering)
Hypothesis generation (rule-based templates)
Interactive dashboard (Streamlit)
Feature importance analysis (MDI, PFI, MDA, SHAP consensus)
Feature interactions (Conditional IC, H-statistic, SHAP)
Statistical framework (CPCV, DSR, RAS, FDR, HAC-adjusted IC)
Time-series cross-validation (purging, embargo)
Comprehensive example notebook

Roadmap

v0.1: Alpha release - Core diagnostics framework
v0.2: Event studies and barrier analysis
v1.0: Full book integration (ML4T 3rd Edition)

Performance Benchmarks

Rigorous time-series validation (After Numba optimization):

Operation	Dataset Size	Time	vs Pandas
Maximum Drawdown	10K points	2ms	6x faster
Block Bootstrap	100K samples	30ms	5x faster
Rolling Sharpe	50K window	8ms	12x faster
Information Coefficient	1M points	10ms	5x faster
DSR Calculation	252 returns	50ms	10x faster

Target achieved: 5-fold CV on 1M rows < 10 seconds

Leakage Prevention

Information leakage in validation causes inflated performance estimates. ML4T Diagnostic provides tools to prevent common validation pitfalls:

1. Cross-Validation Leakage

Wrong: Using standard k-fold on time-series data

# BAD - future data leaks into training
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for train, test in kf.split(X):
    model.fit(X[train], y[train])  # WRONG: Train may contain future data

Right: Purged walk-forward or CPCV

# GOOD - proper temporal separation with purging
from ml4t.diagnostic.splitters import PurgedWalkForwardCV

cv = PurgedWalkForwardCV(
    n_splits=5,
    embargo_pct=0.01,  # Gap between train/test
    purge_pct=0.02     # Remove overlapping labels
)
for train, test in cv.split(X, y, times):
    model.fit(X[train], y[train])  # Strictly past data only

2. Threshold Selection Leakage

Wrong: Optimizing thresholds on full dataset

# BAD - uses test data to select threshold
from sklearn.metrics import f1_score
best_threshold = max(thresholds, key=lambda t: f1_score(y, pred > t))  # WRONG

Right: Train-only threshold optimization

# GOOD - optimize on training fold only
from ml4t.diagnostic.evaluation import find_optimal_threshold

for train_idx, test_idx in cv.split(X, y, times):
    # Find optimal threshold using ONLY training data
    optimal = find_optimal_threshold(
        indicator=predictions[train_idx],
        label=y[train_idx],
        metric="f1_score",
        min_coverage=0.02
    )
    # Apply to test set
    test_signals = predictions[test_idx] > optimal['threshold']  # OK

3. Multiple Testing Correction

Wrong: Ignoring number of strategies tested

# BAD - reports raw Sharpe without correction
sharpe = returns.mean() / returns.std() * np.sqrt(252)
print(f"Sharpe: {sharpe:.2f}")  # WRONG: May be spurious from many trials

Right: Deflated Sharpe Ratio accounts for trials

# GOOD - corrects for multiple testing
from ml4t.diagnostic.evaluation import comprehensive_sharpe_evaluation

results = comprehensive_sharpe_evaluation(
    returns=strategy_returns,
    SR_benchmark=0.0,
    K_trials=100,         # Account for all strategies tested
    variance_trials=0.1,  # Variance across trials
    alpha=0.05
)
print(f"Raw Sharpe: {results['SR_observed']:.2f}")
print(f"Deflated Sharpe: {results['DSR']:.2f}")  # Adjusted for trials
print(f"Significant: {results['is_significant']}")

Best Practice: Use CPCV for All Validation

The CombinatorialPurgedCV ensures leak-proof validation by construction:

from ml4t.diagnostic.splitters import CombinatorialPurgedCV

cv = CombinatorialPurgedCV(
    n_splits=10,
    embargo_pct=0.01,   # Gap after test period
    purge_pct=0.05      # Remove label overlap
)

# Each fold is leak-proof by design
for train_idx, test_idx in cv.split(X, y, timestamps):
    # Training data strictly precedes test data
    # Embargo prevents information bleeding
    # Purging handles overlapping label windows
    pass

For ML4T Book Readers

ML4T Diagnostic is the reference implementation for the ML4T 3rd Edition book.

Chapter mapping (ML4T 3rd Edition):

Chapter 6 (Alpha Factor Engineering) → FeatureDiagnostics, feature importance, interactions
Chapter 7 (Evaluating Alpha Factors) → SignalAnalysis, IC analysis, RAS
Chapter 9 (Backtesting) → TradeAnalysis, DSR, CPCV, TradeShapAnalyzer
Chapter 10 (Portfolio Construction) → PortfolioAnalysis, rolling metrics, drawdowns
Chapter 12 (Risk Management) → Risk metrics, VaR, stress tests

See docs/book_integration.md for complete mapping.

Contributing

We welcome contributions! See CLAUDE.md for:

Development setup
Code standards (ruff, mypy, pytest)
Architecture principles
How to add new metrics/tear sheets

Citation

If you use ML4T Diagnostic in your research, please cite:

@software{ml4t_diagnostic2025,
  author = {Stefan Jansen},
  title = {ML4T Diagnostic: Comprehensive Diagnostics for Quantitative Finance},
  year = {2025},
  version = {0.1.0a1},
  publisher = {GitHub},
  url = {https://github.com/stefan-jansen/ml4t-diagnostic}
}

For academic references to the statistical methods implemented in this library, see docs/REFERENCES.md.

License

MIT License - See LICENSE for details.

Related Projects

Part of the ML4T ecosystem:

ml4t-data - Market data infrastructure
ml4t-engineer - Feature engineering toolkit
ml4t-backtest - Event-driven backtest engine
ml4t-diagnostic - This library

Ready to get started? See Quick Start above or dive into the Trade Diagnostics Example.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.0b19 pre-release

May 5, 2026

0.1.0b18 pre-release

May 5, 2026

0.1.0b17 pre-release

May 5, 2026

0.1.0b16 pre-release

May 5, 2026

0.1.0b15 pre-release

Apr 30, 2026

0.1.0b14 pre-release

Apr 23, 2026

0.1.0b13 pre-release

Apr 22, 2026

0.1.0b12 pre-release

Apr 10, 2026

0.1.0b10 pre-release

Apr 2, 2026

0.1.0b8 pre-release

Apr 1, 2026

0.1.0b7 pre-release

Mar 28, 2026

0.1.0b6 pre-release

Mar 27, 2026

0.1.0b5 pre-release

Mar 27, 2026

0.1.0b4 pre-release

Mar 27, 2026

0.1.0b3 pre-release

Mar 24, 2026

0.1.0b2 pre-release

Mar 4, 2026

0.1.0b1 pre-release

Mar 3, 2026

0.1.0a11 pre-release

Mar 3, 2026

0.1.0a10 pre-release

Mar 2, 2026

0.1.0a9 pre-release

Mar 1, 2026

0.1.0a8 pre-release

Feb 27, 2026

0.1.0a7 pre-release

Feb 22, 2026

0.1.0a5 pre-release

Feb 10, 2026

0.1.0a4 pre-release

Feb 9, 2026

0.1.0a3 pre-release

Feb 4, 2026

0.1.0a2 pre-release

Feb 2, 2026

This version

0.1.0a1 pre-release

Jan 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml4t_diagnostic-0.1.0a1.tar.gz (1.2 MB view details)

Uploaded Jan 19, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ml4t_diagnostic-0.1.0a1-py3-none-any.whl (770.0 kB view details)

Uploaded Jan 19, 2026 Python 3

File details

Details for the file ml4t_diagnostic-0.1.0a1.tar.gz.

File metadata

Download URL: ml4t_diagnostic-0.1.0a1.tar.gz
Upload date: Jan 19, 2026
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for ml4t_diagnostic-0.1.0a1.tar.gz
Algorithm	Hash digest
SHA256	`9f75db87021a0e2fe37028cb3f046707e56d621b8d1671de426748774a05fdf6`
MD5	`b1e7ff9e295e99ca113569cf93cbb67c`
BLAKE2b-256	`635da7f25c33fc9e4b52a665af58070c47794acd447d02921255abda16c561f4`

See more details on using hashes here.

File details

Details for the file ml4t_diagnostic-0.1.0a1-py3-none-any.whl.

File metadata

Download URL: ml4t_diagnostic-0.1.0a1-py3-none-any.whl
Upload date: Jan 19, 2026
Size: 770.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.13

File hashes

Hashes for ml4t_diagnostic-0.1.0a1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9ecc49a94b536649fd88236b5185c2872903aca1f840bc32332ae7e7506ad2e4`
MD5	`e14b053ba97bfe73ed30e2b2a3a6aeb1`
BLAKE2b-256	`8d4bd52a92a2f2303f71eedfa835e4b94d1d7e6b3dfb83e7b01257d58e38e3ce`

See more details on using hashes here.

ml4t-diagnostic 0.1.0a1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ML4T Diagnostic: Comprehensive Diagnostics for Quantitative Finance

What is ML4T Diagnostic?

Key Improvements

Quick Start

Installation

Example 1: Trade Diagnostics

Example 2: Feature Importance Analysis

Example 3: Feature Interactions

Example 4: Statistical Validation (DSR)

Example 5: Binary Classification Metrics

Example 6: Threshold Optimization

Library Overview

Three Pillars of Analysis

Four Application Domains

Architecture: Four-Tier Diagnostic Framework

Key Features

Trade-Level Diagnostics

Performance (10-100x Faster)

Interactive Visualizations

Auto-Interpretation

Advanced Statistics

Time Series Diagnostics

Signal Analysis

Portfolio Analysis

Seamless Integration

Modular Design

Documentation

User Guides

Academic References

Integration Guides

Technical Documentation

Migration

Optional Dependencies

API Stability

Development Status

v1.3 - Module Decomposition & UX Improvements

v1.2 - Configuration Consolidation

v1.1 - Model-Agnostic SHAP Support

v1.0 - Trade Diagnostics Framework

Roadmap

Performance Benchmarks

Leakage Prevention

1. Cross-Validation Leakage

2. Threshold Selection Leakage

3. Multiple Testing Correction

Best Practice: Use CPCV for All Validation

For ML4T Book Readers

Contributing

Citation

License

Related Projects

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes