Time based cross validation
Project description
Time based cross validation
timebasedcv is a Python codebase that provides a cross validation strategy based on time.
Documentation: https://fbruzzesi.github.io/timebasedcv
Source Code: https://github.com/fbruzzesi/timebasedcv
Alpha Notice
This codebase is experimental and is working for my use cases. It is very probable that there are cases not covered and for which it breaks (badly). If you find them, please feel free to open an issue in the issue page of the repo.
Description
The current implementation of scikit-learn TimeSeriesSplit lacks the flexibility of having multiple samples within the same time period/unit.
This codebase addresses such problem by providing a cross validation strategy based on a time period rather than the number of samples. This is useful when the data is time dependent, and the model should be trained on past data and tested on future data, independently from the number of observations present within a given time period.
We introduce two main classes:
TimeBasedSplit
: a class that allows to define a time based split with a given frequency, train size, test size, gap, stride and window type. It's core methodsplit
requires to pass a time series as input to create the boolean masks for train and test from the instance information defined above. Therefore it is not compatible with scikit-learn CV Splitters.TimeBasedCVSplitter
: a class that conforms with scikit-learn CV Splitters but requires to pass the time series as input to the instance. That is because a CV Splitter needs to know a priori the number of splits and thesplit
method shouldn't take any extra arguments as input other than the arrays to split.
Installation
timebasedcv is a published Python package on pypi, therefore it can be installed directly via pip, as well as from source using pip and git, or with a local clone:
-
pip:
python -m pip install timebasedcv
(suggested) -
pip + source/git:
python -m pip install git+https://github.com/FBruzzesi/timebasedcv.git
-
local clone:
git clone https://github.com/FBruzzesi/timebasedcv.git cd timebasedcv python -m pip install .
Quickstart
As a quickstart, you can use the following code snippet to get started. Consider checkout out the Getting Started section of for a detailed guide on how to use the library.
First let's generate some data with different number of points per day:
import pandas as pd
import numpy as np
np.random.seed(42)
dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))
size = len(dates)
df = pd.concat([
pd.DataFrame({
"time": pd.date_range(start, end, periods=_size, inclusive="left"),
"value": np.random.randn(_size-1)/25,
})
for start, end, _size in zip(dates[:size], dates[1:], np.random.randint(2, 24, size-1))
]).reset_index(drop=True)
time_series, X = df["time"], df["value"]
df.set_index("time").resample("D").count().head(5)
# time value
# 2023-01-01 14
# 2023-01-02 2
# 2023-01-03 22
# 2023-01-04 11
# 2023-01-05 1
Now let's run the split with a given frequency, train size, test size, gap, stride and window type:
from timebasedcv import TimeBasedSplit
configs = [
{
"frequency": "days",
"train_size": 14,
"forecast_horizon": 7,
"gap": 2,
"stride": 5,
"window": "expanding"
},
...
]
tbs = TimeBasedSplit(
**config,
)
fmt = "%Y-%m-%d"
for train_set, forecast_set in tbs.split(X, time_series=time_series):
# Do some magic here
Let's see how train_set
and forecasting_set
splits would look likes for different split strategies (or configurations).
The green dots represent the train points, while the red dots represent the forecastng points.
Contributing
Please read the Contributing guidelines in the documentation site.
License
The project has a MIT Licence
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file timebasedcv-0.0.2.tar.gz
.
File metadata
- Download URL: timebasedcv-0.0.2.tar.gz
- Upload date:
- Size: 204.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.25.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | af1513674a96747cefea72fb116318224392e60ae18fd6d193cc44dd78b6fd83 |
|
MD5 | f6d4d89a093e4670a15f6ea73b0dc4c8 |
|
BLAKE2b-256 | c349f23b4463fec30213b9cd0f34f95f782ca33c2c661398e63e63892fe5c548 |
File details
Details for the file timebasedcv-0.0.2-py3-none-any.whl
.
File metadata
- Download URL: timebasedcv-0.0.2-py3-none-any.whl
- Upload date:
- Size: 14.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.25.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c260cfb660a3b14d5573871c2b51151dd3d130c9380cda00f25ad4e2e4ed2460 |
|
MD5 | 7f3df10edb429d123a7f66984bdfe4a8 |
|
BLAKE2b-256 | c18a9277ec7372c4840ba1cd6cb6a5ac7e62553618bddf9653b28b6937dac5d6 |