scikit-learn-compatible cross-validation for time-series and financial machine learning: purging, embargoes, combinatorial purged CV, and deflated Sharpe ratios.
Project description
Purged cross validation
scikit-learn-compatible cross-validation for time-series machine learning: purging, embargoes, and combinatorial backtest paths.
Documentation → · Example notebooks → — purge/embargo, walk-forward, and CPCV with PSR/DSR worked end to end on real ICU-mortality, turbofan-RUL, rainfall, and electricity-demand data.
Cite this software: see CITATION.cff and paper/paper.md (JOSS paper).
The problem
Standard k-fold cross-validation assumes the rows are independent. Time-series data is not. When a label resolves over the next few days, it overlaps the labels sitting right next to it, so an ordinary shuffle-split leaks tomorrow's answer back into training. The rows immediately after a test window leak too, because they are serially correlated with it. Both effects quietly inflate backtested Sharpe ratios and hand you strategies that look great on a chart and bleed money once they go live. This library removes both.
Why write another one? People have asked scikit-learn, auto-sklearn, and mlpack for purging and embargo support and been turned down or left waiting for years. The one mature implementation, mlfinlab, went closed-source and paid. The free alternative has been unmaintained since 2018. That gap is the reason this exists.
Does it actually catch leakage?
A controlled check on synthetic data whose target is built so that no feature can predict it. The honest out-of-sample score must never be positive. Naive shuffled k-fold runs against PurgedKFold side by side (examples/synthetic_leakage_proof.ipynb, deterministic, no download):
| model | naive shuffled KFold R² | PurgedKFold R² |
|---|---|---|
| predict-the-mean (reference) | -0.01 | -0.13 |
| k-NN | 0.83 | -1.31 |
| RandomForest | 0.91 | -1.94 |
Train/test label overlap: 100% under naive → 0% under PurgedKFold.
Naive CV reports R² ≈ 0.83–0.91 on a target nothing can predict. That is pure leakage from the overlap. PurgedKFold removes the overlap and the fabricated skill collapses below a predict-the-mean baseline. The negative number is not the point; no positive skill is the correct answer, and only the purged split reports it. The library does not make models look better; it stops them looking better than they are.
Installation
pip install purgedcv
# Directly from the repository
pip install git+https://github.com/eslazarev/purged-cross-validation.git
Quickstart
1. The core primitive: purge
purge removes training observations that share data with the test set. Here a model uses a 5-day sliding feature window to predict the next day, so every observation occupies a 5-day span and the spans of neighbours overlap. Any training observation whose window reaches into the test period has already seen test data and must be dropped.
import numpy as np
import pandas as pd
from purgedcv import purge
WINDOW = 5 # feature look-back in days
# 16 days of data; each observation uses a 5-day window to predict the next day
days = pd.date_range("2024-01-01", periods=16, freq="D")
predict_day = np.arange(WINDOW + 1, len(days) + 1) # 11 observations
pred = pd.Series([days[d - WINDOW - 1] for d in predict_day]) # first feature day
evalu = pd.Series([days[d - 1] for d in predict_day]) # label day
train_idx = np.arange(0, 7) # observations predicting days 6..12
test_idx = np.arange(7, 11) # observations predicting days 13..16
# Drop training observations whose 5-day feature window overlaps the test window
kept_idx = purge(train_idx, test_idx, pred, evalu)
purged_idx = np.setdiff1d(train_idx, kept_idx)
print(f"Kept: {kept_idx.tolist()}") # [0, 1, 2] -> predict days 6, 7, 8
print(f"Purged: {purged_idx.tolist()}") # [3, 4, 5, 6] -> predict days 9, 10, 11, 12
Each bar below is one observation's 5-day feature window. The four red bars cross into the test window (dashed line) — their features overlap the test period, so purge drops them. The three green bars stay fully before it; → day 8 only touches the boundary and is kept, because label horizons are half-open.
2. Splitters with scikit-learn: PurgedKFold inside cross_val_score
Drop-in replacement for KFold that applies purge and embargo automatically on every fold.
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from purgedcv import PurgedKFold
rng = np.random.default_rng(0)
n = 200
pred = pd.Series(pd.date_range("2022-01-01", periods=n, freq="D"))
evalu = pred + pd.Timedelta(days=3)
X = rng.standard_normal((n, 5))
y = X @ rng.standard_normal(5) + rng.standard_normal(n) * 0.5
cv = PurgedKFold(
n_splits=5,
prediction_times=pred,
evaluation_times=evalu,
purge_horizon="3D", # matches label horizon
embargo="1D", # 1-day post-test buffer
)
scores = cross_val_score(Ridge(), X, y, cv=cv, scoring="r2")
print(f"R² per fold: {scores.round(3)}")
All four splitters (WalkForwardSplit, PurgedKFold, PurgedGroupKFold, CombinatorialPurgedCV) satisfy the sklearn splitter protocol and work inside GridSearchCV and Pipeline.
3. CPCV + path reconstruction + metrics: the full workflow
Combinatorial Purged CV produces C(N, K) folds that tile into multiple out-of-sample backtest paths. Use PSR and DSR to evaluate them with corrections for non-normality and selection bias.
import numpy as np
import pandas as pd
from sklearn.dummy import DummyRegressor
from purgedcv import (
CombinatorialPurgedCV,
probabilistic_sharpe_ratio,
deflated_sharpe_ratio,
min_track_record_length,
)
rng = np.random.default_rng(42)
n = 120
pred = pd.Series(pd.date_range("2023-01-01", periods=n, freq="D"))
evalu = pred + pd.Timedelta(days=2)
X = rng.standard_normal((n, 3))
y = X @ np.array([0.5, -0.3, 0.2]) + rng.standard_normal(n) * 0.1
# N=6, K=2 → C(6,2) = 15 folds → C(5,1) = 5 backtest paths
cv = CombinatorialPurgedCV(
n_splits=6,
n_test_groups=2,
prediction_times=pred,
evaluation_times=evalu,
)
# paths.shape == (n_paths, n_samples); NaN only if a fold could not be fit
paths = cv.backtest_paths(DummyRegressor(strategy="mean"), X, y)
print(f"Backtest paths: {paths.shape}") # (5, 120)
# Derive a toy "return" series and compute per-path PSR
per_path_returns = paths - y[np.newaxis, :]
per_path_psr = [
probabilistic_sharpe_ratio(row[np.isfinite(row)], benchmark_skill=0.0)
for row in per_path_returns
]
print(f"PSR per path: {[round(p, 3) for p in per_path_psr]}")
# DSR corrects for testing 5 paths simultaneously
first = per_path_returns[0]
dsr = deflated_sharpe_ratio(first[np.isfinite(first)], n_trials=5, var_sharpe=0.01**2)
print(f"Deflated SR (first path): {dsr:.3f}")
# Minimum observations needed to prove SR=0.7 beats benchmark SR=0.5 at 95% confidence
n_min = min_track_record_length(
observed_sharpe=0.7, target_sharpe=0.5, alpha=0.05, skew=0.0, kurtosis=3.0
)
print(f"MinTRL: {int(n_min)} observations")
API summary
| Symbol | Domain | Description |
|---|---|---|
purge |
D2 | Remove overlapping-horizon training rows |
apply_embargo |
D3 | Remove post-test buffer rows |
WalkForwardSplit |
D5.1 | Sliding / expanding walk-forward CV |
PurgedKFold |
D5.2 | Contiguous test folds with purge + embargo |
PurgedGroupKFold |
D5.3 | Group-aware purged k-fold |
CombinatorialPurgedCV |
D5.4 | C(N,K) combinatorial folds |
reconstruct_paths |
D6 | Assemble CPCV folds into backtest paths |
probabilistic_sharpe_ratio |
D7 | PSR: P(true SR > benchmark) |
deflated_sharpe_ratio |
D7 | DSR: PSR corrected for multiple testing |
min_track_record_length |
D7 | Minimum observations to establish SR |
diagnostics.* |
D8 | Leakage and embargo audit functions |
Methodology references
- Lopez de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapters 7 (purge/embargo) and 12 (CPCV).
- Bailey, D. H., & Lopez de Prado, M. (2012). The Sharpe Ratio Efficient Frontier. Journal of Risk, 15(2).
- Bailey, D. H., & Lopez de Prado, M. (2014). The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting and Non-Normality. Journal of Portfolio Management, 40(5).
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file purgedcv-0.0.9.tar.gz.
File metadata
- Download URL: purgedcv-0.0.9.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad3764d08f09076de002d9898b4ce6ac4287fb82a6524690ad18de552f68944a
|
|
| MD5 |
455a67b754e875523fe091dc4e98f8fe
|
|
| BLAKE2b-256 |
514eb293f44bd3e43d002ca3794e4cae38d085947b560c7c6cc80c715b2cdf3e
|
File details
Details for the file purgedcv-0.0.9-py3-none-any.whl.
File metadata
- Download URL: purgedcv-0.0.9-py3-none-any.whl
- Upload date:
- Size: 31.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bab34530c716974309f0c79362d93dc3fbf999b75a04bb7c9a726e46b3f5c66b
|
|
| MD5 |
18164db1861d2401687ae582a078f9cc
|
|
| BLAKE2b-256 |
3b420a1e1dc1b8aad8e497d52552a592fd36fe8e0a6a575dbae9fc170a756907
|