A tool for panel data analysis.
Project description
PanelSplit: a tool for panel data analysis
PanelSplit is a Python package designed to facilitate time series cross-validation when working with multiple entities (aka panel data). This tool is useful for handling panel data in various stages throughout the data pipeline, including feature engineering, hyper-parameter tuning, and model estimation.
Installation
You can install PanelSplit using pip:
pip install panelsplit
Documentation
Initialization Parameters
- periods: Pandas Series. Represents the time series of the DataFrame.
- unique_periods: Pandas Series. Contains unique periods. Default is
None
, in which case unique periods are derived fromperiods
and then sorted. - snapshots: Pandas Series, default=None. Defines the snapshot for the observation, i.e. when the observation was updated.
- n_splits: int, default=5. Number of splits for the underlying
TimeSeriesSplit
. - gap: int, default=0. Gap between train and test sets in
TimeSeriesSplit
. - test_size: int, default=1. Size of the test set in
TimeSeriesSplit
. - max_train_size: int, default=None. Maximum size for a single training set in
TimeSeriesSplit
. - plot: bool, default=False. Flag to visualize time series splits.
- drop_splits: bool, default=False. Flag to drop splits with either empty or single unique values in train or test sets.
- y: Pandas Series, default=None Target variable. Required if
drop_splits
is set toTrue
.
Methods
split(X=None, y=None, groups=None, init=False)
Generate train/test indices based on unique periods.
Parameters
- X, y, groups: Always ignored, exist for compatibility.
- init: bool, default=False. Flag indicating initialization phase, when n_splits is modified depending on whether or not drop_splits is True. When split is called apart from initialization, this should be set to False.
Returns
List of train/test indices.
get_n_splits(X=None, y=None, groups=None)
Returns the number of splitting iterations in the cross-validator.
Parameters
- X, y, groups: Always ignored, exist for compatibility.
Returns
Number of splits.
gen_snapshots(data, period_col = None)
Generate snapshots for each split.
Parameters
- data: Pandas DataFrame. DataFrame from which snapshots are generated.
- period_col: str, default=None. The column in data from which the column snapshot_period is created.
Returns
A pandas DataFrame where each split has its own set of observations.
gen_train_labels(labels)
Generate train labels for each split.
Parameters
- labels: Pandas DataFrame or Series. The labels used to identify observations.
Returns
The labels of each fold's train set as a single DataFrame.
gen_test_labels(labels)
Generate test labels for each split.
Parameters
- labels: Pandas DataFrame or Series. The labels used to identify observations.
Returns
The labels of each fold's test set as a single DataFrame.
cross_val_fit(estimator, X, y, sample_weight=None, n_jobs=1)
Perform cross-validated predictions using a given predictor model.
Parameters
- estimator: estimator object implementing ‘fit’. The object to use to fit the data.
- X: Pandas DataFrame. Features.
- y: Pandas Series. Target variable.
- sample_weight: Pandas Series. Sample weights for the training data.
- n_jobs: Optional int (default=1). The number of jobs to run in parallel. See the n_jobs argument for the Parallel class in the joblib package for further details.
Returns
fitted_estimators: A list containing fitted estimators for each split.
cross_val_predict(fitted_estimators, X, prediction_method='predict', return_train_preds=False, n_jobs=1, )
Perform cross-validated predictions using a list of fitted estimators.
Parameters
- fitted_estimators: A list of fitted estimators, one for each split.
- X: Pandas DataFrame. Features.
- prediction_method: The prediction method to use. It can be 'predict', 'predict_proba', or 'predict_log_proba'. Default is
'predict'
.- return_train_preds: **Optional bool (default=False)*. If True, return predictions for the training set as well.
- n_jobs: Optional int (default=1). The number of jobs to run in parallel. See the n_jobs argument for the Parallel class in the joblib package for further details.
Returns
y: ndarray of shape (n_samples,) or (n_samples, n_outputs). The predicted values concatenated across folds. If return_train_preds is True, the output will be y_test, y_train.
cross_val_fit_predict(estimator, X, y, prediction_method='predict', sample_weight=None, n_jobs=1)
Perform cross-validated predictions using a given predictor model.
Parameters
- estimator: estimator object.
- X: Pandas DataFrame. Features.
- y: Pandas Series. Target variable.
- prediction_method: The prediction method to use. It can be 'predict', 'predict_proba', or 'predict_log_proba'. Default is
'predict'
.- sample_weight: Pandas Series. Sample weights for the training data.
- n_jobs: Optional int (default=1). The number of jobs to run in parallel. See the n_jobs argument for the Parallel class in the joblib package for further details.
Returns
y, fitted_estimators: The predicted values concatenated across folds as well as a list containing fitted estimators for each split. If return_train_preds is True, the output will be y_test, y_train, fitted_estimators.
cross_val_fit_transform(transformer, X, include_test_in_fit=False, transform_train=False)
Perform cross-validated transformation using a given transformer.
Parameters
- transformer: Transformer object.
- X: Features.
- include_test_in_fit: bool (default=False). Whether to include test data in fitting for each split.
- transform_train: bool (default=False). Whether to transform train set as well as the test set.
Returns
X, fitted_transformers: DataFrame containing transformed values during cross-validation as well as a list containing fitted transformers for each split.
Examples
For more examples and detailed usage instructions, refer to the examples directory in this repository. Also feel free to check out an article I wrote about PanelSplit.
Background
Work on panelsplit started at EconAI in December 2023 and has been under active development since then.
Contributing
Contributions to PanelSplit are welcome! If you encounter any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request on GitHub.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file panelsplit-0.4.2.tar.gz
.
File metadata
- Download URL: panelsplit-0.4.2.tar.gz
- Upload date:
- Size: 62.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4c199aa5c0fa894f6beb5210bfe2f792aa5eb7d495ef914a3a2a36c96a6b74ec |
|
MD5 | f005149edae698b8150fb669f3dadd42 |
|
BLAKE2b-256 | 19658fe0bd296a0d558a7f5779e38da639be917de64a87ab449cabe0c98857a2 |
File details
Details for the file panelsplit-0.4.2-py3-none-any.whl
.
File metadata
- Download URL: panelsplit-0.4.2-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.9.16
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 375e4794b15d9490bdef6f8df6c1107a9d5ee8775d4525780e35f2fc50cc6c77 |
|
MD5 | f6f9f71764f3fa85fc6966ea740adca2 |
|
BLAKE2b-256 | ef94574661a6cdc73c80949db0db629adfda799f0605906c97173b1643ea2887 |