Skip to main content

A small toolbox for mlops

Project description

TinyShift

TinyShift is a lightweight, sklearn-compatible Python library designed for data drift detection, outlier identification, and MLOps monitoring in production machine learning systems. The library provides modular, easy-to-use tools for detecting when data distributions or model performance change over time, with comprehensive visualization capabilities.

For enterprise-grade solutions, consider Nannyml.

Features

  • Data Drift Detection: Categorical and continuous data drift monitoring with multiple distance metrics
  • Outlier Detection: HBOS, PCA-based and SPAD outlier detection algorithms
  • Time Series Analysis: Seasonality decomposition, trend analysis, and forecasting diagnostics

Technologies Used

  • Python 3.10+
  • Scikit-learn 1.3.0+
  • Pandas 2.3.0+
  • NumPy
  • SciPy
  • Statsmodels 0.14.5+
  • Plotly 5.22.0+ (optional, for plotting)

๐Ÿ“ฆ Installation

Install TinyShift using pip:

pip install tinyshift

Development Installation

Clone and install from source:

git clone https://github.com/HeyLucasLeao/tinyshift.git
cd tinyshift
pip install -e .

๐Ÿ“– Quick Start

1. Categorical Data Drift Detection

TinyShift provides sklearn-compatible drift detectors that follow the familiar fit() and score() pattern:

import pandas as pd
from tinyshift.drift import CatDrift

# Load your data
df = pd.read_csv("data.csv")
reference_data = df[df["date"] < '2024-07-01']
analysis_data = df[df["date"] >= '2024-07-01'] 

# Initialize and fit the drift detector
detector = CatDrift(
    freq="D",                    # Daily frequency
    func="chebyshev",           # Distance metric
    drift_limit="auto",         # Automatic threshold detection
    method="expanding"          # Comparison method
)

# Fit on reference data
detector.fit(reference_data)

# Score new data for drift
drift_scores = detector.predict(analysis_data)
print(drift_scores)

Available distance metrics for categorical data:

  • "chebyshev": Maximum absolute difference between distributions
  • "jensenshannon": Jensen-Shannon divergence
  • "psi": Population Stability Index

2. Continuous Data Drift Detection

For numerical features, use the continuous drift detector:

from tinyshift.drift import ConDrift

# Initialize continuous drift detector
detector = ConDrift(
    freq="W",                   # Weekly frequency  
    func="ws",                  # Wasserstein distance
    drift_limit="auto",
    method="expanding"
)

# Fit and score
detector.fit(reference_data)
drift_scores = detector.score(analysis_data)

3. Outlier Detection

TinyShift includes sklearn-compatible outlier detection algorithms:

from tinyshift.outlier import SPAD, HBOS, PCAReconstructionError

# SPAD (Simple Probabilistic Anomaly Detector)
spad = SPAD(plus=True)
spad.fit(X_train)

outlier_scores = spad.decision_function(X_test)
outlier_labels = spad.predict(X_test)

# HBOS (Histogram-Based Outlier Score)
hbos = HBOS(dynamic_bins=True)
hbos.fit(X_train, nbins="fd")
scores = hbos.decision_function(X_test)

# PCA-based outlier detection
pca_detector = PCAReconstructionError()
pca_detector.fit(X_train)
pca_scores = pca_detector.decision_function(X_test)

4. Time Series Analysis and Diagnostics

TinyShift provides time series analysis capabilities:

from tinyshift.plot import seasonal_decompose
from tinyshift.series import trend_significance, permutation_auto_mutual_information

# Seasonal decomposition with multiple periods
seasonal_decompose(
    time_series, 
    periods=[7, 365],  # Weekly and yearly patterns
    width=1200, 
    height=800
)

# Test for significant trends
trend_result = trend_significance(time_series, alpha=0.05)
print(f"Significant trend: {trend_result}")

# Stationary Analysis
fig = stationarity_analysis(time_series)

5. Advanced Modeling Tools

from tinyshift.modelling import filter_features_by_vif
from tinyshift.stats import bootstrap_bca_interval

# Detect multicollinearity
mask = filter_features_by_vif(X, trehshold=5, verbose=True)
X.columns[mask]

# Bootstrap confidence intervals
confidence_interval = bootstrap_bca_interval(
    data, 
    statistic=np.mean, 
    alpha=0.05, 
    n_bootstrap=1000
)

๐Ÿ“ Project Structure

tinyshift/
โ”œโ”€โ”€ association_mining/          # Market basket analysis tools
โ”‚   โ”œโ”€โ”€ analyzer.py             # Transaction pattern analysis
โ”‚   โ””โ”€โ”€ encoder.py              # Data encoder
โ”œโ”€โ”€ drift/                      # Data drift detection 
โ”‚   โ”œโ”€โ”€ base.py                 # Base drift detection classes  
โ”‚   โ”œโ”€โ”€ categorical.py          # CatDrift for categorical features
โ”‚   โ””โ”€โ”€ continuous.py           # ConDrift for numerical features
โ”œโ”€โ”€ examples/                   # Jupyter notebook examples
โ”‚   โ”œโ”€โ”€ drift.ipynb            # Drift detection examples
โ”‚   โ”œโ”€โ”€ outlier.ipynb          # Outlier detection demos
โ”‚   โ”œโ”€โ”€ series.ipynb           # Time series analysis
โ”‚   โ””โ”€โ”€ transaction_analyzer.ipynb
โ”œโ”€โ”€ modelling/                  # ML modeling utilities
โ”‚   โ”œโ”€โ”€ multicollinearity.py   # VIF-based multicollinearity detection
โ”‚   โ”œโ”€โ”€ residualizer.py        # Residualizer Feature
โ”‚   โ””โ”€โ”€ scaler.py              # Custom scaling transformations
โ”œโ”€โ”€ outlier/                    # Outlier detection algorithms
โ”‚   โ”œโ”€โ”€ base.py                 # Base outlier detection classes
โ”‚   โ”œโ”€โ”€ hbos.py                 # Histogram-Based Outlier Score
โ”‚   โ”œโ”€โ”€ pca.py                  # PCA-based outlier detection  
โ”‚   โ””โ”€โ”€ spad.py                 # Simple Probabilistic Anomaly Detector
โ”œโ”€โ”€ plot/                       # Visualization capabilities  
โ”‚   โ”œโ”€โ”€ correlation.py          # Correlation analysis plots
โ”‚   โ””โ”€โ”€ diagnostic.py           # Time series diagnostics plots
โ”œโ”€โ”€ series/                     # Time series analysis tools
โ”‚   โ”œโ”€โ”€ forecastability.py     # Forecast quality metrics
โ”‚   โ”œโ”€โ”€ outlier.py             # Time series outlier detection
โ”‚   โ””โ”€โ”€ stats.py               # Statistical analysis functions
โ””โ”€โ”€ stats/                      # Statistical utilities
    โ”œโ”€โ”€ bootstrap_bca.py        # Bootstrap confidence intervals
    โ”œโ”€โ”€ statistical_interval.py # Statistical interval estimation
    โ””โ”€โ”€ utils.py               # General statistical utilities
tinyshift
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ poetry.lock
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ tinyshift
โ”‚ย ย  โ”œโ”€โ”€ association_mining
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ analyzer.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ encoder.py
โ”‚ย ย  โ”œโ”€โ”€ examples
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ outlier.ipynb
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ tracker.ipynb
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ transaction_analyzer.ipynb
โ”‚ย ย  โ”œโ”€โ”€ modelling
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ multicollinearity.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ residualizer.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ scaler.py
โ”‚ย ย  โ”œโ”€โ”€ outlier
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ base.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ hbos.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ pca.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ spad.py
โ”‚ย ย  โ”œโ”€โ”€ plot
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ correlation.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ plot.py
โ”‚ย ย  โ”œโ”€โ”€ series
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ forecastability.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ outlier.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ stats.py
โ”‚ย ย  โ”œโ”€โ”€ stats
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ bootstrap_bca.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ series.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ statistical_interval.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ utils.py
โ”‚ย ย  โ”œโ”€โ”€ tests
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ test.pca.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ test_hbos.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ test_spad.py
โ”‚ย ย  โ””โ”€โ”€ drift
โ”‚ย ย      โ”œโ”€โ”€ __init__.py
โ”‚ย ย      โ”œโ”€โ”€ base.py
โ”‚ย ย      โ”œโ”€โ”€ categorical.py
โ”‚ย ย      โ”œโ”€โ”€ continuous.py

Development Setup

git clone https://github.com/HeyLucasLeao/tinyshift.git
cd tinyshift
pip install -e ".[all]"

๐Ÿ“‹ Requirements

  • Python: 3.10+
  • Core Dependencies:
    • pandas (>2.3.0)
    • scikit-learn (>1.3.0)
    • statsmodels (>=0.14.5)
  • Optional Dependencies:
    • plotly (>5.22.0) - for visualization
    • kaleido (<=0.2.1) - for static plot export
    • nbformat (>=5.10.4) - for notebook support

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinyshift-1.0.1.tar.gz (46.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tinyshift-1.0.1-py3-none-any.whl (60.7 kB view details)

Uploaded Python 3

File details

Details for the file tinyshift-1.0.1.tar.gz.

File metadata

  • Download URL: tinyshift-1.0.1.tar.gz
  • Upload date:
  • Size: 46.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tinyshift-1.0.1.tar.gz
Algorithm Hash digest
SHA256 704046ce3e2208b8651bd7f2cba5fbc75a8d53f05d986cc75024697e93997651
MD5 361351325c43da1ea6d6d31ca7ee5298
BLAKE2b-256 262bf2fe1325706b9ccea60206403e2a4e7b779fe40a8e512bf866df7610cdaa

See more details on using hashes here.

File details

Details for the file tinyshift-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: tinyshift-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 60.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for tinyshift-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 db50e1ac171d1a0db6434384ee6b3efd95473df6f6e5bf4ea293e5ad875a7e0c
MD5 e79441cfcc56512030e788a70ab1ba17
BLAKE2b-256 530d35beedd5472d25b60b0c0b347f2b4ea3df67e6d3e28ddcd6da24fddcc729

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page