Time based cross validation
Project description
Time based cross validation
timebasedcv is a Python codebase that provides a cross validation strategy based on time.
Documentation | Repository | Issue Tracker
Disclaimer ⚠️
This codebase is experimental and is working for my use cases. It is very probable that there are cases not covered and for which it could break (badly). If you find them, please feel free to open an issue in the issue page of the repo.
Description
The current implementation of scikit-learn TimeSeriesSplit lacks the flexibility of having multiple samples within the same time period or time unit.
timebasedcv addresses such problem by providing a cross validation strategy based on a time unit rather than the number of samples. This is useful when the data is time dependent, and the split should keep together samples within the same time window.
Temporal data leakage is an issue and we want to prevent that from happening by providing splits to make sure that models can train on past data and tested on future data, independently from the number of observations present within a given time period.
Features ✨
We introduce two main classes:
-
TimeBasedSplit
allows to define a split based on time unit (frequency), train size, test size, gap, stride, window type and mode. Remark thatTimeBasedSplit
is not compatible with scikit-learn CV Splitters. In fact, we have made the (opinioned) choice to:- Return the sliced arrays from
.split(...)
, while scikit-learn CV Splitters return train and test indices of the split. - Require to pass the time series as input to
.split(...)
method, while scikit-learn CV Splitters require to provide onlyX, y, groups
to.split(...)
. - Such time series is used to generate the boolean masks with which we slice the original arrays into train and test for each split.
- Return the sliced arrays from
-
Considering the above choices, we also provide a scikit-learn compatible splitter:
TimeBasedCVSplitter
. Considering the signature that.split(...)
requires and the fact that CV Splitters need to know a priori the number of splits,TimeBasedCVSplitter
is initialized with the time series containing the time information used to generate the train and test indices of each split.
Installation
TL;DR:
python -m pip install timebasedcv
For further information, please refer to the dedicated installation section.
Quickstart
The following code snippet is all you need to get started, yet consider checking out the getting started section of the documentation for a detailed guide on how to use the library.
The main takeaway should be that TimeBasedSplit
allows for a lot of flexibility at the cost of having to specify a long list of parameters. This is what makes the library so powerful and flexible to cover the large majority of use cases.
First let's generate some data with different number of points per day:
import numpy as np
import pandas as pd
RNG = np.random.default_rng(seed=42)
dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))
size = len(dates)
df = (pd.concat([
pd.DataFrame({
"time": pd.date_range(start, end, periods=_size, inclusive="left"),
"a": RNG.normal(size=_size-1),
"b": RNG.normal(size=_size-1),
})
for start, end, _size in zip(dates[:-1], dates[1:], RNG.integers(2, 24, size-1))
])
.reset_index(drop=True)
.assign(y=lambda t: t[["a", "b"]].sum(axis=1) + RNG.normal(size=t.shape[0])/25)
)
df.set_index("time").resample("D").agg(count=("y", np.size)).head(5)
count
time
2023-01-01 2
2023-01-02 18
2023-01-03 15
2023-01-04 10
2023-01-05 10
Then lets instantiate the TimeBasedSplit
class:
from timebasedcv import TimeBasedSplit
tbs = TimeBasedSplit(
frequency="days",
train_size=10,
forecast_horizon=5,
gap=1,
stride=3,
window="rolling",
mode="forward",
)
Now let's run split the data with the provided TimeBasedSplit
instance:
X, y, time_series = df.loc[:, ["a", "b"]], df["y"], df["time"]
for X_train, X_forecast, y_train, y_forecast in tbs.split(X, y, time_series=time_series):
print(f"Train: {X_train.shape}, Forecast: {X_forecast.shape}")
Train: (100, 2), Forecast: (51, 2)
Train: (114, 2), Forecast: (50, 2)
...
Train: (124, 2), Forecast: (40, 2)
Train: (137, 2), Forecast: (22, 2)
As we can see, each split does not necessarely have the same number of points, this is because the time series has a different number of points per day.
A picture is worth a thousand words, let's visualize the splits (blue dots represent the train points, while the red dots represent the forecastng points):
Contributing
Please read the Contributing guidelines in the documentation site.
License
The project has a MIT Licence
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file timebasedcv-0.2.0.tar.gz
.
File metadata
- Download URL: timebasedcv-0.2.0.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.27.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 825fc8a04af74b125ded6ac5800f04e55ccc6baee161b5089194b6fd2a034590 |
|
MD5 | 85e66b31770a38a89e1ec3a8e7edfd50 |
|
BLAKE2b-256 | efae48215667dcda87c2f620ceb6f2e0b187c89a7e26640c8bbd7e42acaa79d3 |
File details
Details for the file timebasedcv-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: timebasedcv-0.2.0-py3-none-any.whl
- Upload date:
- Size: 17.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.27.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a88c86b8c4a39ff2c0dac58662ebe40bc800c38f4687bbb6289434010f6a39c8 |
|
MD5 | b5c5d05c18c6935b43d26b9223f9f661 |
|
BLAKE2b-256 | 015edc3f52d411c512e68a31274105412e2e6c2aad16552dd9c09eb9f6c87b99 |