Skip to main content

A small toolbox for mlops

Project description

TinyShift

tinyshift_full_logo

TinyShift is a lightweight, sklearn-compatible Python library designed for data drift detection, outlier identification, and MLOps monitoring in production machine learning systems. The library provides modular, easy-to-use tools for detecting when data distributions or model performance change over time, with comprehensive visualization capabilities.

For enterprise-grade solutions, consider Nannyml.

Features

  • Data Drift Detection: Categorical and continuous data drift monitoring with multiple distance metrics
  • Outlier Detection: HBOS, PCA-based and SPAD outlier detection algorithms
  • Time Series Analysis: Seasonality decomposition, trend analysis, forecasting diagnostics, and forecast stabilization
  • Forecast Stability: Metrics and interpolation methods for stable forecasting

Technologies Used

  • Python 3.10+
  • Scikit-learn 1.3.0+
  • Pandas 2.3.0+
  • NumPy
  • SciPy
  • Statsmodels 0.14.5+
  • Plotly 5.22.0+ (optional, for plotting)

๐Ÿ“ฆ Installation

Install TinyShift using pip:

pip install tinyshift

Development Installation

Clone and install from source:

git clone https://github.com/HeyLucasLeao/tinyshift.git
cd tinyshift
pip install -e .

๐Ÿ“– Quick Start

1. Categorical Data Drift Detection

TinyShift provides sklearn-compatible drift detectors that follow the familiar fit() and score() pattern:

import pandas as pd
from tinyshift.drift import CatDrift

# Load your data
df = pd.read_csv("data.csv")
reference_data = df[df["date"] < '2024-07-01']
analysis_data = df[df["date"] >= '2024-07-01'] 

# Initialize and fit the drift detector
detector = CatDrift(
    freq="D",                    # Daily frequency
    func="chebyshev",           # Distance metric
    drift_limit="auto",         # Automatic threshold detection
    method="expanding"          # Comparison method
)

# Fit on reference data
detector.fit(reference_data)

# Score new data for drift
drift_scores = detector.predict(analysis_data)
print(drift_scores)

Available distance metrics for categorical data:

  • "chebyshev": Maximum absolute difference between distributions
  • "jensenshannon": Jensen-Shannon divergence
  • "psi": Population Stability Index

2. Continuous Data Drift Detection

For numerical features, use the continuous drift detector:

from tinyshift.drift import ConDrift

# Initialize continuous drift detector
detector = ConDrift(
    freq="W",                   # Weekly frequency  
    func="ws",                  # Wasserstein distance
    drift_limit="auto",
    method="expanding"
)

# Fit and score
detector.fit(reference_data)
drift_predicts = detector.predict(analysis_data)

3. Outlier Detection

TinyShift includes sklearn-compatible outlier detection algorithms:

from tinyshift.outlier import SPAD, HBOS, PCAReconstructionError

# SPAD (Simple Probabilistic Anomaly Detector)
spad = SPAD(plus=True)
spad.fit(X_train)

outlier_scores = spad.decision_function(X_test)
outlier_labels = spad.predict(X_test)

# HBOS (Histogram-Based Outlier Score)
hbos = HBOS(dynamic_bins=True)
hbos.fit(X_train, nbins="fd")
scores = hbos.predict(X_test)

# PCA-based outlier detection
pca_detector = PCAReconstructionError()
pca_detector.fit(X_train)
pca_scores = pca_detector.predict(X_test)

4. Time Series Analysis and Diagnostics

TinyShift provides comprehensive time series analysis capabilities:

from tinyshift.plot import seasonal_decompose
from tinyshift.series import (
    trend_significance, 
    foreca, 
    sample_entropy,
    permutation_entropy,
    theoretical_limit,
    hurst_exponent,
    hampel_filter,
    bollinger_bands
)

seasonal_decompose(
    time_series, 
    periods=[7, 365],  # Weekly and yearly patterns
    width=1200, 
    height=800
)

# Test for significant trends
r_squared, p_value = trend_significance(time_series)

# Assess forecastability
forecastability = foreca(time_series)
print(f"Forecastability (Omega): {forecastability}")

# Measure complexity and regularity
complexity = sample_entropy(time_series, m=2, tolerance=0.2)
print(f"Sample Entropy: {complexity}")

# Measure ordinal complexity
perm_entropy = permutation_entropy(time_series, m=3, delay=1, normalize=True)
print(f"Permutation Entropy: {perm_entropy}")

# Calculate theoretical predictability limit
theo_limit = theoretical_limit(time_series, m=3, delay=1)
print(f"Theoretical Limit (ฮ max): {theo_limit}")

# Detect long-term memory
hurst, p_value = hurst_exponent(time_series)
print(f"Hurst Exponent: {hurst}, P-value: {p_value}")

# Outlier detection in time series
outliers = hampel_filter(time_series, window_size=5)
outliers = bollinger_bands(time_series, window_size=20)

# Plot lag analysis with PAMI (Permutation Auto-Mutual Information)
from tinyshift.plot import pami
pami(time_series, nlags=20, m=3, delay=1, normalize=False)

5. Forecast Stability and Interpolation

TinyShift includes forecast stability metrics and interpolation methods:

from tinyshift.series import (
    macv, mach,           # Mean Absolute Change metrics
    mascv, masch,         # Mean Absolute Scaled Change metrics
    rmsscv, rmssch,       # Root Mean Squared Scaled Change metrics
    vi, hpi, hfi          # Interpolation methods
)

# Calculate forecast stability metrics
vertical_stability = macv(y_hat, y_hat_t_minus_1)
horizontal_stability = mach(y_hat) 

# Scaled stability metrics
scaled_v_stability = mascv(y_train, y_hat, y_hat_t_minus_1, seasonality=12)
scaled_h_stability = masch(y_train, y_hat, seasonality=12)

# Apply forecast stabilization techniques
# Vertical Interpolation
stable_forecast = vi(y_hat, anchor, w_s=0.3)

# Horizontal Partial Interpolation
smooth_forecast = hpi(y_hat, w_s=0.4)

# Horizontal Full Interpolation
fully_stable_forecast = hfi(y_hat, w_s=0.5)

6. Advanced Modeling Tools

from tinyshift.modelling import filter_features_by_vif
from tinyshift.stats import bootstrap_bca_interval

#Residualizer
residualizer = FeatureResidualizer()
residualizer.fit(X_train[preprocess_columns], corrcoef=0.70)

#Train
X_train = X_train.astype({x: float for x in preprocess_columns})
X_train.loc[:, preprocess_columns] = residualizer.transform(X_train[preprocess_columns])

# Detect multicollinearity
mask = filter_features_by_vif(X_train, trehshold=5, verbose=True)
X_train.columns = X_train.columns[mask]
X_test.columns = X_test.columns[mask]

#Test
X_test = X_test.astype({x: float for x in preprocess_columns})
X_test.loc[:, preprocess_columns] = residualizer.transform(X_test[preprocess_columns])

# Bootstrap confidence intervals
confidence_interval = bootstrap_bca_interval(
    data, 
    statistic=np.mean, 
    alpha=0.05, 
    n_bootstrap=1000
)

๐Ÿ“ Project Structure

tinyshift/
โ”œโ”€โ”€ association_mining/          # Market basket analysis tools
โ”‚   โ””โ”€โ”€ README.md              # Module documentation
โ”‚   โ”œโ”€โ”€ analyzer.py             # Transaction pattern analysis
โ”‚   โ””โ”€โ”€ encoder.py              # Data encoder
โ”œโ”€โ”€ drift/                      # Data drift detection 
โ”‚   โ””โ”€โ”€ README.md              # Module documentation
โ”‚   โ”œโ”€โ”€ base.py                 # Base drift detection classes  
โ”‚   โ”œโ”€โ”€ categorical.py          # CatDrift for categorical features
โ”‚   โ””โ”€โ”€ continuous.py           # ConDrift for numerical features
โ”œโ”€โ”€ examples/                   # Jupyter notebook examples
โ”‚   โ”œโ”€โ”€ decomp_mstl_ml.ipynb   # MSTL decomposition and ML examples
โ”‚   โ”œโ”€โ”€ drift.ipynb            # Drift detection examples
โ”‚   โ”œโ”€โ”€ outlier.ipynb          # Outlier detection demos
โ”‚   โ”œโ”€โ”€ series.ipynb           # Time series analysis
โ”‚   โ”œโ”€โ”€ transaction_analyzer.ipynb # Transaction analysis examples
โ”‚   โ””โ”€โ”€ ts_diagnostics.ipynb   # Time series diagnostics
โ”œโ”€โ”€ modelling/                  # ML modeling utilities
โ”‚   โ”œโ”€โ”€ README.md              # Module documentation
โ”‚   โ”œโ”€โ”€ multicollinearity.py   # VIF-based multicollinearity detection
โ”‚   โ”œโ”€โ”€ residualizer.py        # Residualizer Feature
โ”‚   โ””โ”€โ”€ scaler.py              # Custom scaling transformations
โ”œโ”€โ”€ outlier/                    # Outlier detection algorithms
โ”‚   โ””โ”€โ”€ README.md              # Module documentation
โ”‚   โ”œโ”€โ”€ base.py                 # Base outlier detection classes
โ”‚   โ”œโ”€โ”€ hbos.py                 # Histogram-Based Outlier Score
โ”‚   โ”œโ”€โ”€ pca.py                  # PCA-based outlier detection  
โ”‚   โ””โ”€โ”€ spad.py                 # Simple Probabilistic Anomaly Detector
โ”œโ”€โ”€ plot/                       # Visualization capabilities  
โ”‚   โ”œโ”€โ”€ README.md              # Module documentation
โ”‚   โ”œโ”€โ”€ correlation.py          # Correlation analysis plots
โ”‚   โ””โ”€โ”€ diagnostic.py           # Time series diagnostics plots
โ”œโ”€โ”€ series/                     # Time series analysis tools
โ”‚   โ””โ”€โ”€ README.md              # Module documentation
โ”‚   โ”œโ”€โ”€ forecastability.py     # Forecast quality and complexity metrics
โ”‚   โ”œโ”€โ”€ interpolation.py       # Forecast stabilization methods
โ”‚   โ”œโ”€โ”€ outlier.py             # Time series outlier detection
โ”‚   โ”œโ”€โ”€ stability.py           # Forecast stability metrics
โ”‚   โ””โ”€โ”€ stats.py               # Statistical analysis functions
โ””โ”€โ”€ stats/                      # Statistical utilities
    โ”œโ”€โ”€ bootstrap_bca.py        # Bootstrap confidence intervals
    โ”œโ”€โ”€ statistical_interval.py # Statistical interval estimation
    โ””โ”€โ”€ utils.py               # General statistical utilities

Development Setup

git clone https://github.com/HeyLucasLeao/tinyshift.git
cd tinyshift
pip install -e ".[all]"

๐Ÿ“‹ Requirements

  • Python: 3.10+
  • Core Dependencies:
    • pandas (>2.3.0)
    • scikit-learn (>1.3.0)
    • statsmodels (>=0.14.5)
  • Optional Dependencies:
    • plotly (>5.22.0) - for visualization
    • kaleido (<=0.2.1) - for static plot export
    • nbformat (>=5.10.4) - for notebook support

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinyshift-1.2.0.tar.gz (55.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinyshift-1.2.0-py3-none-any.whl (70.9 kB view details)

Uploaded Python 3

File details

Details for the file tinyshift-1.2.0.tar.gz.

File metadata

  • Download URL: tinyshift-1.2.0.tar.gz
  • Upload date:
  • Size: 55.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tinyshift-1.2.0.tar.gz
Algorithm Hash digest
SHA256 e703760fe265ea3bd445de937b753e879b75835227ed60e433f8fa22d7729b67
MD5 173feb2b04804ce16b31e27366f6eefc
BLAKE2b-256 82509085f76f27a657b97e6b51498e3cab74ece2829c86583a397fe1b0307f92

See more details on using hashes here.

File details

Details for the file tinyshift-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: tinyshift-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 70.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tinyshift-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ecd19cc479a80743c5dfed9f7961d10a530481d42d26186cb7928572c879760
MD5 f5101ba3bafd899232658dae85c6d1c8
BLAKE2b-256 e3536f0855fee303f80e3112da120c49c317f40863fd39b1b2a4787d6c8369f1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page