Skip to main content

A small toolbox for mlops

Project description

TinyShift

TinyShift is a lightweight, sklearn-compatible Python library designed for data drift detection, outlier identification, and MLOps monitoring in production machine learning systems. The library provides modular, easy-to-use tools for detecting when data distributions or model performance change over time, with comprehensive visualization capabilities.

For enterprise-grade solutions, consider Nannyml.

Features

  • Data Drift Detection: Categorical and continuous data drift monitoring with multiple distance metrics
  • Outlier Detection: HBOS, PCA-based and SPAD outlier detection algorithms
  • Time Series Analysis: Seasonality decomposition, trend analysis, and forecasting diagnostics

Technologies Used

  • Python 3.10+
  • Scikit-learn 1.3.0+
  • Pandas 2.3.0+
  • NumPy
  • SciPy
  • Statsmodels 0.14.5+
  • Plotly 5.22.0+ (optional, for plotting)

๐Ÿ“ฆ Installation

Install TinyShift using pip:

pip install tinyshift

Development Installation

Clone and install from source:

git clone https://github.com/HeyLucasLeao/tinyshift.git
cd tinyshift
pip install -e .

๐Ÿ“– Quick Start

1. Categorical Data Drift Detection

TinyShift provides sklearn-compatible drift detectors that follow the familiar fit() and score() pattern:

import pandas as pd
from tinyshift.drift import CatDrift

# Load your data
df = pd.read_csv("data.csv")
reference_data = df[df["date"] < '2024-07-01']
analysis_data = df[df["date"] >= '2024-07-01'] 

# Initialize and fit the drift detector
detector = CatDrift(
    freq="D",                    # Daily frequency
    func="chebyshev",           # Distance metric
    drift_limit="auto",         # Automatic threshold detection
    method="expanding"          # Comparison method
)

# Fit on reference data
detector.fit(reference_data)

# Score new data for drift
drift_scores = detector.predict(analysis_data)
print(drift_scores)

Available distance metrics for categorical data:

  • "chebyshev": Maximum absolute difference between distributions
  • "jensenshannon": Jensen-Shannon divergence
  • "psi": Population Stability Index

2. Continuous Data Drift Detection

For numerical features, use the continuous drift detector:

from tinyshift.drift import ConDrift

# Initialize continuous drift detector
detector = ConDrift(
    freq="W",                   # Weekly frequency  
    func="ws",                  # Wasserstein distance
    drift_limit="auto",
    method="expanding"
)

# Fit and score
detector.fit(reference_data)
drift_scores = detector.score(analysis_data)

3. Outlier Detection

TinyShift includes sklearn-compatible outlier detection algorithms:

from tinyshift.outlier import SPAD, HBOS, PCAReconstructionError

# SPAD (Simple Probabilistic Anomaly Detector)
spad = SPAD(plus=True)
spad.fit(X_train)

outlier_scores = spad.decision_function(X_test)
outlier_labels = spad.predict(X_test)

# HBOS (Histogram-Based Outlier Score)
hbos = HBOS(dynamic_bins=True)
hbos.fit(X_train, nbins="fd")
scores = hbos.decision_function(X_test)

# PCA-based outlier detection
pca_detector = PCAReconstructionError()
pca_detector.fit(X_train)
pca_scores = pca_detector.decision_function(X_test)

4. Time Series Analysis and Diagnostics

TinyShift provides time series analysis capabilities:

from tinyshift.plot import seasonal_decompose
from tinyshift.series import trend_significance, permutation_auto_mutual_information

# Seasonal decomposition with multiple periods
seasonal_decompose(
    time_series, 
    periods=[7, 365],  # Weekly and yearly patterns
    width=1200, 
    height=800
)

# Test for significant trends
trend_result = trend_significance(time_series, alpha=0.05)
print(f"Significant trend: {trend_result}")

# Stationary Analysis
fig = stationarity_analysis(time_series)

5. Advanced Modeling Tools

from tinyshift.modelling import filter_features_by_vif
from tinyshift.stats import bootstrap_bca_interval

# Detect multicollinearity
mask = filter_features_by_vif(X, trehshold=5, verbose=True)
X.columns[mask]

# Bootstrap confidence intervals
confidence_interval = bootstrap_bca_interval(
    data, 
    statistic=np.mean, 
    alpha=0.05, 
    n_bootstrap=1000
)

๐Ÿ“ Project Structure

tinyshift/
โ”œโ”€โ”€ association_mining/          # Market basket analysis tools
โ”‚   โ”œโ”€โ”€ analyzer.py             # Transaction pattern analysis
โ”‚   โ””โ”€โ”€ encoder.py              # Data encoder
โ”œโ”€โ”€ drift/                      # Data drift detection 
โ”‚   โ”œโ”€โ”€ base.py                 # Base drift detection classes  
โ”‚   โ”œโ”€โ”€ categorical.py          # CatDrift for categorical features
โ”‚   โ””โ”€โ”€ continuous.py           # ConDrift for numerical features
โ”œโ”€โ”€ examples/                   # Jupyter notebook examples
โ”‚   โ”œโ”€โ”€ drift.ipynb            # Drift detection examples
โ”‚   โ”œโ”€โ”€ outlier.ipynb          # Outlier detection demos
โ”‚   โ”œโ”€โ”€ series.ipynb           # Time series analysis
โ”‚   โ””โ”€โ”€ transaction_analyzer.ipynb
โ”œโ”€โ”€ modelling/                  # ML modeling utilities
โ”‚   โ”œโ”€โ”€ multicollinearity.py   # VIF-based multicollinearity detection
โ”‚   โ”œโ”€โ”€ residualizer.py        # Residualizer Feature
โ”‚   โ””โ”€โ”€ scaler.py              # Custom scaling transformations
โ”œโ”€โ”€ outlier/                    # Outlier detection algorithms
โ”‚   โ”œโ”€โ”€ base.py                 # Base outlier detection classes
โ”‚   โ”œโ”€โ”€ hbos.py                 # Histogram-Based Outlier Score
โ”‚   โ”œโ”€โ”€ pca.py                  # PCA-based outlier detection  
โ”‚   โ””โ”€โ”€ spad.py                 # Simple Probabilistic Anomaly Detector
โ”œโ”€โ”€ plot/                       # Visualization capabilities  
โ”‚   โ”œโ”€โ”€ correlation.py          # Correlation analysis plots
โ”‚   โ””โ”€โ”€ diagnostic.py           # Time series diagnostics plots
โ”œโ”€โ”€ series/                     # Time series analysis tools
โ”‚   โ”œโ”€โ”€ forecastability.py     # Forecast quality metrics
โ”‚   โ”œโ”€โ”€ outlier.py             # Time series outlier detection
โ”‚   โ””โ”€โ”€ stats.py               # Statistical analysis functions
โ””โ”€โ”€ stats/                      # Statistical utilities
    โ”œโ”€โ”€ bootstrap_bca.py        # Bootstrap confidence intervals
    โ”œโ”€โ”€ statistical_interval.py # Statistical interval estimation
    โ””โ”€โ”€ utils.py               # General statistical utilities
tinyshift
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ poetry.lock
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ tinyshift
โ”‚ย ย  โ”œโ”€โ”€ association_mining
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ analyzer.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ encoder.py
โ”‚ย ย  โ”œโ”€โ”€ examples
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ outlier.ipynb
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ tracker.ipynb
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ transaction_analyzer.ipynb
โ”‚ย ย  โ”œโ”€โ”€ modelling
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ multicollinearity.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ residualizer.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ scaler.py
โ”‚ย ย  โ”œโ”€โ”€ outlier
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ base.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ hbos.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ pca.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ spad.py
โ”‚ย ย  โ”œโ”€โ”€ plot
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ correlation.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ plot.py
โ”‚ย ย  โ”œโ”€โ”€ series
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ forecastability.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ outlier.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ stats.py
โ”‚ย ย  โ”œโ”€โ”€ stats
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ bootstrap_bca.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ series.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ statistical_interval.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ utils.py
โ”‚ย ย  โ”œโ”€โ”€ tests
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ test.pca.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ test_hbos.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ test_spad.py
โ”‚ย ย  โ””โ”€โ”€ tracker
โ”‚ย ย      โ”œโ”€โ”€ __init__.py
โ”‚ย ย      โ”œโ”€โ”€ anomaly.py
โ”‚ย ย      โ”œโ”€โ”€ base.py
โ”‚ย ย      โ”œโ”€โ”€ categorical.py
โ”‚ย ย      โ”œโ”€โ”€ continuous.py
โ”‚ย ย      โ””โ”€โ”€ performance.py

Development Setup

git clone https://github.com/HeyLucasLeao/tinyshift.git
cd tinyshift
pip install -e ".[all]"

๐Ÿ“‹ Requirements

  • Python: 3.10+
  • Core Dependencies:
    • pandas (>2.3.0)
    • scikit-learn (>1.3.0)
    • statsmodels (>=0.14.5)
  • Optional Dependencies:
    • plotly (>5.22.0) - for visualization
    • kaleido (<=0.2.1) - for static plot export
    • nbformat (>=5.10.4) - for notebook support

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinyshift-1.0.0.tar.gz (46.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinyshift-1.0.0-py3-none-any.whl (60.3 kB view details)

Uploaded Python 3

File details

Details for the file tinyshift-1.0.0.tar.gz.

File metadata

  • Download URL: tinyshift-1.0.0.tar.gz
  • Upload date:
  • Size: 46.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tinyshift-1.0.0.tar.gz
Algorithm Hash digest
SHA256 60dc0c029d61aee672a5bbe29246bcc65938e2bfa5f4a29d324513fee515d43f
MD5 427b7dcb81ac01ce638d65b4c86aff16
BLAKE2b-256 f78753417a4f0c509fa9c449024af4163dcc50bba7812923cc50cd489afb46f6

See more details on using hashes here.

File details

Details for the file tinyshift-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: tinyshift-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 60.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tinyshift-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fd3ff25b75ed8ddc4b537e43d4946081d3f31fe2ba545f8e3716ad8c9c6ec927
MD5 a7eaaf5188d0c0c6b3eec627be6a3ab9
BLAKE2b-256 c4a2c6b95108b606038c33837da74f7794ea484cab7b5ac7a92b1c2b5d540f59

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page