Skip to main content

A tool for panel data analysis.

Project description

PyPI version DOI

PanelSplit: a tool for panel data analysis

PanelSplit is a Python package designed to facilitate time series cross-validation when working with multiple entities (aka panel data). This tool is useful for handling panel data in various stages throughout the data pipeline, including feature engineering, hyper-parameter tuning, and model estimation.

Installation

You can install PanelSplit using pip:

pip install panelsplit

Documentation

Initialization Parameters

  • periods: Pandas Series. Represents the time series of the DataFrame.
  • unique_periods: Pandas Series. Contains unique periods. Default is None, in which case unique periods are derived from periods and then sorted.
  • snapshots: Pandas Series, default=None. Defines the snapshot for the observation, i.e. when the observation was updated.
  • n_splits: int, default=5. Number of splits for the underlying TimeSeriesSplit.
  • gap: int, default=0. Gap between train and test sets in TimeSeriesSplit.
  • test_size: int, default=1. Size of the test set in TimeSeriesSplit.
  • max_train_size: int, default=None. Maximum size for a single training set in TimeSeriesSplit.
  • plot: bool, default=False. Flag to visualize time series splits.
  • drop_splits: bool, default=False. Flag to drop splits with either empty or single unique values in train or test sets.
  • y: Pandas Series, default=None Target variable. Required if drop_splits is set to True.

Methods

split(X=None, y=None, groups=None, init=False)

Generate train/test indices based on unique periods.

Parameters
  • X, y, groups: Always ignored, exist for compatibility.
  • init: bool, default=False. Flag indicating initialization phase, when n_splits is modified depending on whether or not drop_splits is True. When split is called apart from initialization, this should be set to False.
Returns

List of train/test indices.

get_n_splits(X=None, y=None, groups=None)

Returns the number of splitting iterations in the cross-validator.

Parameters
  • X, y, groups: Always ignored, exist for compatibility.
Returns

Number of splits.

gen_snapshots(data, period_col = None)

Generate snapshots for each split.

Parameters
  • data: Pandas DataFrame. DataFrame from which snapshots are generated.
  • period_col: str, default=None. The column in data from which the column snapshot_period is created.
Returns

A pandas DataFrame where each split has its own set of observations.

gen_train_labels(labels)

Generate train labels for each split.

Parameters
  • labels: Pandas DataFrame or Series. The labels used to identify observations.
Returns

The labels of each fold's train set as a single DataFrame.

gen_test_labels(labels)

Generate test labels for each split.

Parameters
  • labels: Pandas DataFrame or Series. The labels used to identify observations.
Returns

The labels of each fold's test set as a single DataFrame.

cross_val_fit(estimator, X, y, sample_weight=None, n_jobs=1)

Perform cross-validated predictions using a given predictor model.

Parameters
  • estimator: estimator object implementing ‘fit’. The object to use to fit the data.
  • X: Pandas DataFrame. Features.
  • y: Pandas Series. Target variable.
  • sample_weight: Pandas Series. Sample weights for the training data.
  • n_jobs: Optional int (default=1). The number of jobs to run in parallel. See the n_jobs argument for the Parallel class in the joblib package for further details.
Returns

fitted_estimators: A list containing fitted estimators for each split.

cross_val_predict(fitted_estimators, X, prediction_method='predict', return_train_preds=False, n_jobs=1, )

Perform cross-validated predictions using a list of fitted estimators.

Parameters
  • fitted_estimators: A list of fitted estimators, one for each split.
  • X: Pandas DataFrame. Features.
  • prediction_method: The prediction method to use. It can be 'predict', 'predict_proba', or 'predict_log_proba'. Default is 'predict'.
  • return_train_preds: **Optional bool (default=False)*. If True, return predictions for the training set as well.
  • n_jobs: Optional int (default=1). The number of jobs to run in parallel. See the n_jobs argument for the Parallel class in the joblib package for further details.
Returns

y: ndarray of shape (n_samples,) or (n_samples, n_outputs). The predicted values concatenated across folds. If return_train_preds is True, the output will be y_test, y_train.

cross_val_fit_predict(estimator, X, y, prediction_method='predict', sample_weight=None, n_jobs=1)

Perform cross-validated predictions using a given predictor model.

Parameters
  • estimator: estimator object.
  • X: Pandas DataFrame. Features.
  • y: Pandas Series. Target variable.
  • prediction_method: The prediction method to use. It can be 'predict', 'predict_proba', or 'predict_log_proba'. Default is 'predict'.
  • sample_weight: Pandas Series. Sample weights for the training data.
  • n_jobs: Optional int (default=1). The number of jobs to run in parallel. See the n_jobs argument for the Parallel class in the joblib package for further details.
Returns

y, fitted_estimators: The predicted values concatenated across folds as well as a list containing fitted estimators for each split. If return_train_preds is True, the output will be y_test, y_train, fitted_estimators.

cross_val_fit_transform(transformer, X, include_test_in_fit=False, transform_train=False)

Perform cross-validated transformation using a given transformer.

Parameters
  • transformer: Transformer object.
  • X: Features.
  • include_test_in_fit: bool (default=False). Whether to include test data in fitting for each split.
  • transform_train: bool (default=False). Whether to transform train set as well as the test set.
Returns

X, fitted_transformers: DataFrame containing transformed values during cross-validation as well as a list containing fitted transformers for each split.


Examples

For more examples and detailed usage instructions, refer to the examples directory in this repository. Also feel free to check out an article I wrote about PanelSplit.

Background

Work on panelsplit started at EconAI in December 2023 and has been under active development since then.

Contributing

Contributions to PanelSplit are welcome! If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request on GitHub.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

panelsplit-0.4.2.tar.gz (62.8 kB view details)

Uploaded Source

Built Distribution

panelsplit-0.4.2-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file panelsplit-0.4.2.tar.gz.

File metadata

  • Download URL: panelsplit-0.4.2.tar.gz
  • Upload date:
  • Size: 62.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.16

File hashes

Hashes for panelsplit-0.4.2.tar.gz
Algorithm Hash digest
SHA256 4c199aa5c0fa894f6beb5210bfe2f792aa5eb7d495ef914a3a2a36c96a6b74ec
MD5 f005149edae698b8150fb669f3dadd42
BLAKE2b-256 19658fe0bd296a0d558a7f5779e38da639be917de64a87ab449cabe0c98857a2

See more details on using hashes here.

File details

Details for the file panelsplit-0.4.2-py3-none-any.whl.

File metadata

  • Download URL: panelsplit-0.4.2-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.9.16

File hashes

Hashes for panelsplit-0.4.2-py3-none-any.whl
Algorithm Hash digest
SHA256 375e4794b15d9490bdef6f8df6c1107a9d5ee8775d4525780e35f2fc50cc6c77
MD5 f6f9f71764f3fa85fc6966ea740adca2
BLAKE2b-256 ef94574661a6cdc73c80949db0db629adfda799f0605906c97173b1643ea2887

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page