Skip to main content

Time based cross validation

Project description

license-shield interrogate-badge Ruff coverage-badge versions-shield

Time based cross validation

timebasedcv is a Python codebase that provides a cross validation strategy based on time.


Documentation | Repository | Issue Tracker


Alpha Notice

This codebase is experimental and is working for my use cases. It is very probable that there are cases not covered and for which it breaks (badly). If you find them, please feel free to open an issue in the issue page of the repo.

Description

The current implementation of scikit-learn TimeSeriesSplit lacks the flexibility of having multiple samples within the same time period/unit.

This codebase addresses such problem by providing a cross validation strategy based on a time period rather than the number of samples. This is useful when the data is time dependent, and the model should be trained on past data and tested on future data, independently from the number of observations present within a given time period.

Temporal data leakage is an issue and we want to prevent that from happening!

We introduce two main classes:

  • TimeBasedSplit allows to define a time based split with a given frequency, train size, test size, gap, stride and window type. Its core method split requires to pass a time series as input to create the boolean masks for train and test from the instance information defined above. Therefore it is not compatible with scikit-learn CV Splitters.
  • TimeBasedCVSplitter conforms with scikit-learn CV Splitters but requires to pass the time series as input to the instance. That is because a CV Splitter needs to know a priori the number of splits, and the split method shouldn't take any extra arguments as input other than the arrays to split.

Installation

timebasedcv is a published Python package on pypi, therefore it can be installed directly via pip, as well as from source using pip and git, or with a local clone:

pip (suggested)
python -m pip install timebasedcv
pip + source/git
python -m pip install git+https://github.com/FBruzzesi/timebasedcv.git
local clone
git clone https://github.com/FBruzzesi/timebasedcv.git
cd timebasedcv
python -m pip install .

Dependencies

As of timebasecv v0.1.0, the only two dependencies are numpy and narwhals>=0.7.15.

The latter allows to have a compatibility layer between polars, pandas and other dataframe libraries. Therefore, as long as narwhals supports such dataframe object, we will as well.

Quickstart

The following code snippet is all you need to get started, yet consider checking out the Getting Started section of the documentation for a detailed guide on how to use the library.

First let's generate some data with different number of points per day:

import pandas as pd
import numpy as np
np.random.seed(42)

dates = pd.Series(pd.date_range("2023-01-01", "2023-01-31", freq="D"))
size = len(dates)

df = pd.concat([
    pd.DataFrame({
        "time": pd.date_range(start, end, periods=_size, inclusive="left"),
        "value": np.random.randn(_size-1)/25,
    })
    for start, end, _size in zip(dates[:size], dates[1:], np.random.randint(2, 24, size-1))
]).reset_index(drop=True)

time_series, X = df["time"], df["value"]
df.set_index("time").resample("D").count().head(5)
time	        value
2023-01-01	14
2023-01-02	2
2023-01-03	22
2023-01-04	11
2023-01-05	1

Now let's run the split with a given frequency, train size, test size, gap, stride and window type:

from timebasedcv import TimeBasedSplit

configs = [
    {
        "frequency": "days",
        "train_size": 14,
        "forecast_horizon": 7,
        "gap": 2,
        "stride": 5,
        "window": "expanding"
    },
    ...
]

tbs = TimeBasedSplit(**config)


fmt = "%Y-%m-%d"
for train_set, forecast_set in tbs.split(X, time_series=time_series):

    # Do some magic here

Let's see how train_set and forecasting_set splits would look likes for different split strategies (or configurations).

The blue dots represent the train points, while the red dots represent the forecastng points.

cross-validation

Contributing

Please read the Contributing guidelines in the documentation site.

License

The project has a MIT Licence

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

timebasedcv-0.1.0.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

timebasedcv-0.1.0-py3-none-any.whl (15.1 kB view details)

Uploaded Python 3

File details

Details for the file timebasedcv-0.1.0.tar.gz.

File metadata

  • Download URL: timebasedcv-0.1.0.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.27.0

File hashes

Hashes for timebasedcv-0.1.0.tar.gz
Algorithm Hash digest
SHA256 874d7a4a4aee6c88a7918cf6022d023b8008719db3f06ce7e36fd7850ac5349e
MD5 6df32a0072c8ac94558e179e12029399
BLAKE2b-256 185ff30a50cb931b2021cc5bfb3921affb5226d76383e98c823eb646004e9a63

See more details on using hashes here.

File details

Details for the file timebasedcv-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for timebasedcv-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 faab9d10c834d57f6784e086dab1c277794840559e7c2b31196441ee28073855
MD5 059c69f530e0f4dd98095f3797b9e871
BLAKE2b-256 fd35efcae190b359b0584808155be086e71872cf2e374ced4ec23fbf2e61bda6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page