Skip to main content

Time based cross validation

Project description

license-shield interrogate-badge Ruff coverage-badge versions-shield

Time based cross validation

timebasedcv is a Python codebase that provides a cross validation strategy based on time.


Documentation | Repository | Issue Tracker


Disclaimer ⚠️

This codebase is experimental and is working for my use cases. It is very probable that there are cases not entirely covered and for which it could break (badly). If you find them, please feel free to open an issue in the issue page{:target="_blank"} of the repo.

Description ✨

The current implementation of scikit-learn TimeSeriesSplit{:target="_blank"} lacks the flexibility of having multiple samples within the same time period (or time unit).

timebasedcv addresses such problem by providing a cross validation strategy based on a time period rather than the number of samples. This is useful when the data is time dependent, and the split should keep together samples within the same time window.

Temporal data leakage is an issue and we want to prevent it from happening by providing splits that make sure the past and the future are well separated, so that data leakage does not spoil in a model cross validation.

Again, these splits points solely depend on the time period and not the number of observations.

Features 📜

We introduce two main classes:

  • TimeBasedSplit allows to define a split based on time unit (frequency), train size, test size, gap, stride, window type and mode. Remark that TimeBasedSplit is not compatible with scikit-learn CV Splitters. In fact, we have made the (opinioned) choice to:

    • Return the sliced arrays from .split(...), while scikit-learn CV Splitters return train and test indices of the split.
    • Require to pass the time series as input to .split(...) method, while scikit-learn CV Splitters require to provide only X, y, groups to .split(...).
    • Such time series is used to generate the boolean masks with which we slice the original arrays into train and test for each split.
  • Considering the above choices, we also provide a scikit-learn compatible splitter: TimeBasedCVSplitter. Considering the signature that .split(...) requires and the fact that CV Splitters need to know a priori the number of splits, TimeBasedCVSplitter is initialized with the time series containing the time information used to generate the train and test indices of each split.

Installation 💻

TL;DR:

python -m pip install timebasedcv

For further information, please refer to the dedicated installation section.

Quickstart 🏃

The following code snippet is all you need to get started, yet consider checking out the getting started section of the documentation for a detailed guide on how to use the library.

The main takeaway should be that TimeBasedSplit allows for a lot of flexibility at the cost of having to specify a long list of parameters. This is what makes the library so powerful and flexible to cover the large majority of use cases.

First let's generate some data with different number of points per day:

import numpy as np
import pandas as pd

RNG = np.random.default_rng(seed=42)

dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))
size = len(dates)

df = (pd.concat([
        pd.DataFrame({
            "time": pd.date_range(start, end, periods=_size, inclusive="left"),
            "a": RNG.normal(size=_size-1),
            "b": RNG.normal(size=_size-1),
        })
        for start, end, _size in zip(dates[:-1], dates[1:], RNG.integers(2, 24, size-1))
    ])
    .reset_index(drop=True)
    .assign(y=lambda t: t[["a", "b"]].sum(axis=1) + RNG.normal(size=t.shape[0])/25)
)

df.set_index("time").resample("D").agg(count=("y", np.size)).head(5)
            count
time
2023-01-01      2
2023-01-02     18
2023-01-03     15
2023-01-04     10
2023-01-05     10

Then lets instantiate the TimeBasedSplit class:

from timebasedcv import TimeBasedSplit

tbs = TimeBasedSplit(
    frequency="days",
    train_size=10,
    forecast_horizon=5,
    gap=1,
    stride=3,
    window="rolling",
    mode="forward",
)

Now let's run split the data with the provided TimeBasedSplit instance:

X, y, time_series = df.loc[:, ["a", "b"]], df["y"], df["time"]

for X_train, X_forecast, y_train, y_forecast in tbs.split(X, y, time_series=time_series):
    print(f"Train: {X_train.shape}, Forecast: {X_forecast.shape}")
Train: (100, 2), Forecast: (51, 2)
Train: (114, 2), Forecast: (50, 2)
...
Train: (124, 2), Forecast: (40, 2)
Train: (137, 2), Forecast: (22, 2)

As we can see, each split does not necessarely have the same number of points, this is because the time series has a different number of points per day.

A picture is worth a thousand words, let's visualize the splits (blue dots represent the train points, while the red dots represent the forecastng points):

cross-validation

Contributing ✌️

Please read the Contributing guidelines in the documentation site.

License 👀

The project has a MIT Licence

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

timebasedcv-0.2.1.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

timebasedcv-0.2.1-py3-none-any.whl (19.1 kB view details)

Uploaded Python 3

File details

Details for the file timebasedcv-0.2.1.tar.gz.

File metadata

  • Download URL: timebasedcv-0.2.1.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.0

File hashes

Hashes for timebasedcv-0.2.1.tar.gz
Algorithm Hash digest
SHA256 e8cb8aada09ff3603d3907872d9d80c19c403e723561bbd6bcb7904eceaa1fdb
MD5 2a55b5b6e3f364d66cf3a575369d73b0
BLAKE2b-256 571cec2e0e771422f99a6b25d4c414d8a80162e3cab78d96558174e796fe445d

See more details on using hashes here.

File details

Details for the file timebasedcv-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for timebasedcv-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9e445d4ac24b1e7866ae50b506ae8c6a387e57c38bc088140e7bbf93c634aac5
MD5 ef078720151a25c9e05796cf9eb1beba
BLAKE2b-256 d5b10c818749b4d223c1a63881aaf1538741d1c5b7bb57870505ce6ce8e8c63e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page