A package for imputing missing data in time series
Project description
timefiller
timefiller
is a Python package for time series imputation and forecasting. When applied to a set of correlated time series, each series is processed individually, leveraging correlations with the other series as well as its own auto-regressive patterns. The package is designed to be easy to use, even for non-experts.
Installation
You can get timefiller
from PyPi:
pip install timefiller
But also from conda-forge:
conda install -c conda-forge timefiller
mamba install timefiller
Why this package?
While there are other Python packages for similar tasks, this one is lightweight with a straightforward and simple API. Currently, its speed may be a limitation for large datasets, but it can still be quite useful in many cases.
Basic Usage
The simplest usage example:
from timefiller import TimeSeriesImputer
df = load_your_dataset()
tsi = TimeSeriesImputer()
df_imputed = tsi(X=df)
Advanced Usage
from sklearn.linear_model import LassoCV
from timefiller import PositiveOutput, TimeSeriesImputer
df = load_your_dataset()
tsi = TimeSeriesImputer(estimator=LassoCV(),
ar_lags=(1, 2, 3, 6, 24),
multivariate_lags=6,
preprocessing=PositiveOutput())
df_imputed = tsi(X=df,
subset_cols=['col_1', 'col_17'],
after='2024-06-14',
n_nearest_features=35)
Check out the documentation for details on available options to customize your imputation.
Real data example
Let's evaluate how timefiller
performs on a real-world dataset, the PeMS-Bay traffic data. A sensor ID is selected for the experiment, and a contiguous block of missing values is introduced. To increase the complexity, additional Missing At Random (MAR) data is simulated, representing 1% of the entire dataset:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from timefiller import TimeSeriesImputer
from timefiller.utils import add_mar_nan, fetch_pems_bay
# Fetch the time series dataset (e.g., PeMS-Bay traffic data)
df = fetch_pems_bay()
dfm = df.copy() # Create a copy to introduce missing values later
# Randomly select one column (sensor ID) to introduce missing values
k = np.random.randint(df.shape[1])
col = df.columns[k]
i, j = 20_000, 22_500 # Define a range in the dataset to set as NaN (missing values)
dfm.iloc[i:j, k] = np.nan # Introduce missing values in this range for the selected column
# Add more missing values randomly across the dataset (1% of the data)
dfm = add_mar_nan(dfm, ratio=0.01)
# Initialize the TimeSeriesImputer with AR lags and multivariate lags
tsi = TimeSeriesImputer(ar_lags=48, multivariate_lags=6)
# Apply the imputation method on the modified dataframe
df_imputed = tsi(dfm, subset_cols=col, n_nearest_features=75)
# Plot the imputed data alongside the data with missing values
df_imputed[col].rename('imputation').plot(figsize=(10, 3), lw=0.8, c='C0')
dfm[col].rename('data to impute').plot(ax=plt.gca(), lw=0.8, c='C1')
plt.title(f'sensor_id {col}')
plt.legend()
plt.show()
# Plot the imputed data vs the original complete data for comparison
df_imputed[col].rename('imputation').plot(figsize=(10, 3), lw=0.8, c='C0')
df[col].rename('complete data').plot(ax=plt.gca(), lw=0.8, c='C2')
plt.xlim(dfm.index[i], dfm.index[j]) # Focus on the region where data was missing
plt.legend()
plt.show()
Algorithmic Approach
timefiller
relies heavily on scikit-learn for the learning process and uses optimask to create NaN-free train and predict matrices for the estimator.
For each column requiring imputation, the algorithm differentiates between rows with valid data and those with missing values. For rows with missing data, it identifies the available sets of other columns (features). For each set, OptiMask is called to train the chosen sklearn estimator on the largest possible submatrix without any NaNs. This process can become computationally expensive if the available sets of features vary greatly or occur infrequently. In such cases, multiple calls to OptiMask and repeated fitting and predicting using the estimator may be necessary.
One important point to keep in mind is that within a single column, two different rows (timestamps) may be imputed using different estimators (regressors), each trained on distinct sets of columns (covariate features) and samples (rows/timestamps).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file timefiller-0.1.6.tar.gz
.
File metadata
- Download URL: timefiller-0.1.6.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a5cfee9b8c2b960882e724232ab35298d64b026e32f31d82ac70a634e260893c |
|
MD5 | f25473ae18a8e9090768c6dec01aeabf |
|
BLAKE2b-256 | 2103f45da24ad0eccb8dcb8eb16c87ba02c38a414cbc4243c92dde05339ee8a8 |