Skip to main content

A library for normalizing streams of incoming data, particularly focused on improving sequential experimentation.

Project description

Online Normalization (onorm)

Contributor Covenant ci PyPI

onorm provides online (incremental) normalization algorithms for streaming data. These normalizers update their statistics incrementally without storing historical data, making them suitable for large-scale or real-time applications.

Installation

pip install onorm

Features

  • StandardScaler: Online standardization (z-score normalization)
  • MinMaxScaler: Online min-max scaling to [0, 1]
  • Winsorizer: Online outlier clipping using quantiles
  • MultivariateNormalizer: Online decorrelation and standardization
  • Pipeline: Chain multiple normalizers sequentially

All normalizers support:

  • Incremental updates via partial_fit()
  • Transformation via transform()
  • Combined operation via partial_fit_transform()
  • State reset via reset()

Usage Example

Let's compare online normalization with and without outlier handling. We'll process a stream of data points and track how well each approach maintains normalized statistics.

import numpy as np
import pandas as pd
from numpy.random import default_rng
from onorm import Pipeline, StandardScaler, Winsorizer
from plotnine import aes, geom_line, geom_vline, ggplot, labs, theme, theme_minimal

rng = default_rng(2024)
# Generate streaming data with outliers
n_samples = 1000
n_dim = 5

X = rng.normal(loc=10, scale=1, size=(n_samples, n_dim))

# Add some outliers
outlier_indices = [100, 250, 500, 750]
for idx in outlier_indices:
    X[idx] = rng.uniform(-100, 100, size=n_dim)

print(f"Generated {n_samples} samples with {len(outlier_indices)} outliers")
Generated 1000 samples with 4 outliers
# Approach 1: StandardScaler only (sensitive to outliers)
scaler_only = StandardScaler(n_dim=n_dim)

# Approach 2: Pipeline with Winsorizer + StandardScaler (robust to outliers)
pipeline = Pipeline([Winsorizer(n_dim=n_dim, clip_q=(0.05, 0.95)), StandardScaler(n_dim=n_dim)])

# Track mean estimates over time
scaler_means = []
pipeline_means = []

for x in X:
    scaler_only.partial_fit(x)
    pipeline.partial_fit(x)

    scaler_means.append(scaler_only.mean[0])
    pipeline_means.append(pipeline.normalizers[1].mean[0])

print(f"StandardScaler final mean: {scaler_only.mean[0]:.2f}")
print(f"Pipeline final mean: {pipeline.normalizers[1].mean[0]:.2f}")
StandardScaler final mean: 9.84
Pipeline final mean: 10.02

Visualization

The plot shows how the estimated mean evolves as data streams in. The pipeline with winsorization maintains stable estimates when outliers appear (red lines), while the standard scaler is more affected by extreme values.

# Prepare data for plotting
true_mean = X[~np.isin(np.arange(len(X)), outlier_indices), 0].mean()

df = pd.DataFrame(
    {
        "Sample": range(n_samples),
        "StandardScaler": scaler_means,
        "Pipeline": pipeline_means,
        "True Mean": true_mean,
    }
)

df_long = pd.melt(df, id_vars=["Sample"], var_name="Method", value_name="Estimated Mean")

# Plot
(
    ggplot(df_long, aes(x="Sample", y="Estimated Mean", color="Method"))
    + geom_line()
    + geom_vline(xintercept=outlier_indices, color="red", alpha=0.3)
    + labs(title="Mean Estimation Over Time", x="Sample Index", y="Estimated Mean")
    + theme_minimal()
    + theme(legend_position="bottom")
)

png

Key Takeaways

  • Online Learning: All normalizers update incrementally without storing historical data
  • Robustness: Use Pipeline with Winsorizer to handle outliers in streaming data
  • Efficiency: Memory footprint remains constant regardless of stream length
  • Flexibility: Mix and match normalizers to build custom preprocessing pipelines

For more details, see the documentation.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onorm-0.2.0.tar.gz (16.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onorm-0.2.0-py3-none-any.whl (20.9 kB view details)

Uploaded Python 3

File details

Details for the file onorm-0.2.0.tar.gz.

File metadata

  • Download URL: onorm-0.2.0.tar.gz
  • Upload date:
  • Size: 16.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/23.2.0

File hashes

Hashes for onorm-0.2.0.tar.gz
Algorithm Hash digest
SHA256 45b0642f2e98fdca6466039a930cbd50f30d731e8a36fc44c2f8b73181913aa2
MD5 54dcb94c75466812650b923890d1551e
BLAKE2b-256 fbd27eaa458011b7daa8a8df37c59287191d8ee558a55ea834d0c74039b18f5f

See more details on using hashes here.

File details

Details for the file onorm-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: onorm-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 20.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/23.2.0

File hashes

Hashes for onorm-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 31150c3c4f57369b4c3f4e24ec06b2b3bc8a1497c2c317c0381c46597c6dcee2
MD5 cf566d553a5d99c99577f4ec07eb1655
BLAKE2b-256 95dd0d0808c6929060758f77cdbd485bfea4efac75b7a6c1505ba0ad32bfe7bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page