A machine learning library based on sklearn that supports grouped time series cross-validation

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Grouped Time Series Cross-Validation

This repository provides tools for classifying and predicting event-based time series data using pipelines, parameter tuning, cross-validation, and model evaluation. By automating the trial-and-error tasks in model selection, these tools help developers save significant time. Leveraging Scikit-learn's robust tools, this approach enhances model performance even in data-constrained environments.

Model training is performed using cross-validation, where predictions are made on independent data for each date, with the remaining dates used as training data.

1. Pipelines Definition

We define pipelines for three classification models: Gaussian Naive Bayes, Decision Tree, and Logistic Regression. However, you can easily swap these for other classifiers such as Support Vector Machines, Neural Networks, XGBoost, or Random Forests. Each pipeline includes the following steps:

Scaling: Standardization of features.
Feature Selection: Selecting the top features.
Model

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

pipelines = [
    Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('model', GaussianNB())
    ]),
    Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('model', DecisionTreeClassifier())
    ]),
    Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('model', LogisticRegression())
    ])
]

2. Parameter Grids

Each pipeline requires a corresponding parameter grid to define the hyperparameters for tuning. Below are the grids for the Gaussian Naive Bayes, Decision Tree, and Logistic Regression models.

from sklearn.feature_selection import mutual_info_classif
param_grids = [
    # GaussianNB
    {
        'selector__k': [3, 5, 'all'],
        'selector__score_func': [mutual_info_classif],
        'model__var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6]
    },
    
    # DecisionTreeClassifier
    {
        'selector__k': [3, 5, 'all'],
        'selector__score_func': [mutual_info_classif],
        'model__criterion': ['gini', 'entropy'],
        'model__splitter': ['best', 'random'],
        'model__max_depth': [None, 10, 20, 30],
        'model__min_samples_split': [2, 5, 10],
        'model__min_samples_leaf': [1, 2, 4],
        'model__random_state': [0, 12, 22, 42]
    },
    
    # LogisticRegression
    {
        'selector__k': [3, 5, 'all'],
        'selector__score_func': [mutual_info_classif],
        'model__penalty': ['l1', 'l2', 'elasticnet', 'none'],
        'model__C': [0.1, 1.0, 10.0],
        'model__solver': ['lbfgs', 'liblinear', 'saga'],
        'model__max_iter': [100, 200, 500],
        'model__random_state': [0, 12, 22, 42]
    }
]

3. Load the Dataset

Load the dataset from a CSV file and ensure the 'DateTime' column is converted to a datetime object.

import pandas as pd

data = pd.read_csv('model_data.csv')
data['DateTime'] = pd.to_datetime(data['DateTime'])

4. Classification

Perform classification using the grouped time series cross-validation with the defined pipelines and parameter grids. The GroupedTimeSerieCV class handles the cross-validation logic.

from grouped_timeserie_cv import GroupedTimeSerieCV

grouped_cv = GroupedTimeSerieCV()
result = grouped_cv.classify(data, pipelines, param_grids, 'D', 'DateTime', 'Label', 'accuracy')

Optional parameters:

Frequency ('D'): Resample data at daily intervals.
DateTime column ('DateTime'): The column containing timestamps.
Label column ('Label'): The target label for classification.
Scoring method ('accuracy'): Metric for evaluating model performance.

Note: If a group contains more than one unique label, it may negatively impact the model's performance.

5. Expected Output

During training, you will see output like the following in the console:

Process model: GaussianNB
Score: 0.781
Process model: DecisionTreeClassifier
Score: 0.811
Process model: LogisticRegression
Score: 0.836
Best model: LogisticRegression
Best parameters: {
 'model__C': 1,
 'model__class_weight': 'balanced',
 'model__max_iter': 1000,
 'model__penalty': 'l2',
 'model__solver': 'liblinear',
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'selector__k': 3
}
Selected features: ['Moisture', 'Temperature', 'MeanTemperaturePeak']

In this example, Logistic Regression is the best model with an accuracy of 83.6%.

6. Define Result Data Class

The CrossValidationResult class encapsulates the results from the cross-validation process, including confusion matrices, model performance, and selected features.

class CrossValidationResult:
    confusion_matrices: np.ndarray
    class_labels: list
    train_sizes: np.ndarray
    train_mean: np.ndarray
    train_std: np.ndarray
    test_mean: np.ndarray
    test_std: np.ndarray
    best_model: object
    selected_feature_names: list
    best_params: dict
    incorrect_dates: np.ndarray
    actual_values: np.ndarray
    predicted_values: np.ndarray

7. Plot Results

Once the classification is complete, use the plotting utilities to visualize the results, such as the confusion matrix and learning curve.

# Plot confusion matrix
grouped_cv.plotter.plot_confusion_matrix(result.confusion_matrices, result.class_labels)

# Plot learning curve
grouped_cv.plotter.plot_learning_curve(result.train_sizes, result.train_mean, result.train_std, result.test_mean, result.test_std)

8. Regression

In addition to classification, the framework supports regression models. Below is an example using Multilayer perceptron (MLP), KNeighbors and Linear Regression models .

from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import mutual_info_regression

pipelines = [
    Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('model', MLPRegressor())
    ]),
    Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('model', KNeighborsRegressor())
    ]),
    Pipeline([
        ('scaler', StandardScaler()),
        ('selector', SelectKBest()),
        ('model', LinearRegression())
    ])
]

param_grids = [
    # MLPRegressor
    {
        'selector__k': [3, 5, 'all'],
        'selector__score_func': [mutual_info_regression],
        'model__hidden_layer_sizes': [(50,), (100,), (50, 50)],
        'model__activation': ['relu', 'tanh', 'logistic'],
        'model__solver': ['adam', 'sgd'],
        'model__alpha': [0.0001, 0.001, 0.01],
        'model__learning_rate': ['constant', 'adaptive'],
        'model__random_state': [0, 12, 22, 42]
    },
    
    # KNeighborsRegressor
    {
        'selector__k': [3, 5, 'all'],
        'selector__score_func': [mutual_info_regression],
        'model__n_neighbors': [3, 5, 7, 9],
        'model__weights': ['uniform', 'distance'],
        'model__algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
        'model__p': [1, 2]
    },
    
    # LinearRegression
    {
        'selector__k': [3, 5, 'all'],
        'selector__score_func': [mutual_info_regression]
    }
]

grouped_cv = GroupedTimeSerieCV()
result = grouped_cv.predict(data, pipelines, param_grids, 'D', 'DateTime', 'Label', 'neg_mean_squared_error')

# Plot predictions vs. actual values
grouped_cv.plotter.plot_prediction_vs_actual(result.actual_values, result.predicted_values)

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.1

Oct 12, 2025

0.1

Sep 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

grouped_timeserie_cv-0.1.1.tar.gz (10.4 kB view details)

Uploaded Oct 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

grouped_timeserie_cv-0.1.1-py3-none-any.whl (10.1 kB view details)

Uploaded Oct 12, 2025 Python 3

File details

Details for the file grouped_timeserie_cv-0.1.1.tar.gz.

File metadata

Download URL: grouped_timeserie_cv-0.1.1.tar.gz
Upload date: Oct 12, 2025
Size: 10.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for grouped_timeserie_cv-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`9ff583b5d6c1659cf2888b9974318c6ae53c2cf74b5b5270499f8175abfa1fa8`
MD5	`2e1559c855ca36b8c92d6a440f8b2728`
BLAKE2b-256	`63a9c13de4ec3dc5c62fd1c5e31e328cdde03024c6a96e059bb3dda116772e92`

See more details on using hashes here.

File details

Details for the file grouped_timeserie_cv-0.1.1-py3-none-any.whl.

File metadata

Download URL: grouped_timeserie_cv-0.1.1-py3-none-any.whl
Upload date: Oct 12, 2025
Size: 10.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for grouped_timeserie_cv-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`248de4ea2562736c09b515031dd15941ea61d93c931a1d4e91450a6c91f57859`
MD5	`170d9c4d085babbb8a27d35c95c5d04e`
BLAKE2b-256	`652c2492db1d443ff5ab47b21f6deb09d57c0d885b14c9d2c75340d64aea83d0`

See more details on using hashes here.

grouped-timeserie-cv 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Grouped Time Series Cross-Validation

1. Pipelines Definition

2. Parameter Grids

3. Load the Dataset

4. Classification

5. Expected Output

6. Define Result Data Class

7. Plot Results

8. Regression

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes