A library for normalizing streams of incoming data, particularly focused on improving sequential experimentation.
Project description
Online Normalization (onorm)
onorm provides online (incremental) normalization algorithms for streaming data. These normalizers update their statistics incrementally without storing historical data, making them suitable for large-scale or real-time applications.
Installation
pip install onorm
Features
- StandardScaler: Online standardization (z-score normalization)
- MinMaxScaler: Online min-max scaling to [0, 1]
- Winsorizer: Online outlier clipping using quantiles
- MultivariateNormalizer: Online decorrelation and standardization
- Pipeline: Chain multiple normalizers sequentially
All normalizers support:
- Incremental updates via
partial_fit() - Transformation via
transform() - Combined operation via
partial_fit_transform() - State reset via
reset()
Usage Example
Let's compare online normalization with and without outlier handling. We'll process a stream of data points and track how well each approach maintains normalized statistics.
import numpy as np
import pandas as pd
from numpy.random import default_rng
from onorm import Pipeline, StandardScaler, Winsorizer
from plotnine import aes, geom_line, geom_vline, ggplot, labs, theme, theme_minimal
rng = default_rng(2024)
# Generate streaming data with outliers
n_samples = 1000
n_dim = 5
X = rng.normal(loc=10, scale=1, size=(n_samples, n_dim))
# Add some outliers
outlier_indices = [100, 250, 500, 750]
for idx in outlier_indices:
X[idx] = rng.uniform(-100, 100, size=n_dim)
print(f"Generated {n_samples} samples with {len(outlier_indices)} outliers")
Generated 1000 samples with 4 outliers
# Approach 1: StandardScaler only (sensitive to outliers)
scaler_only = StandardScaler(n_dim=n_dim)
# Approach 2: Pipeline with Winsorizer + StandardScaler (robust to outliers)
pipeline = Pipeline([Winsorizer(n_dim=n_dim, clip_q=(0.05, 0.95)), StandardScaler(n_dim=n_dim)])
# Track mean estimates over time
scaler_means = []
pipeline_means = []
for x in X:
scaler_only.partial_fit(x)
pipeline.partial_fit(x)
scaler_means.append(scaler_only.mean[0])
pipeline_means.append(pipeline.normalizers[1].mean[0])
print(f"StandardScaler final mean: {scaler_only.mean[0]:.2f}")
print(f"Pipeline final mean: {pipeline.normalizers[1].mean[0]:.2f}")
StandardScaler final mean: 9.84
Pipeline final mean: 10.02
Visualization
The plot shows how the estimated mean evolves as data streams in. The pipeline with winsorization maintains stable estimates when outliers appear (red lines), while the standard scaler is more affected by extreme values.
# Prepare data for plotting
true_mean = X[~np.isin(np.arange(len(X)), outlier_indices), 0].mean()
df = pd.DataFrame(
{
"Sample": range(n_samples),
"StandardScaler": scaler_means,
"Pipeline": pipeline_means,
"True Mean": true_mean,
}
)
df_long = pd.melt(df, id_vars=["Sample"], var_name="Method", value_name="Estimated Mean")
# Plot
(
ggplot(df_long, aes(x="Sample", y="Estimated Mean", color="Method"))
+ geom_line()
+ geom_vline(xintercept=outlier_indices, color="red", alpha=0.3)
+ labs(title="Mean Estimation Over Time", x="Sample Index", y="Estimated Mean")
+ theme_minimal()
+ theme(legend_position="bottom")
)
Key Takeaways
- Online Learning: All normalizers update incrementally without storing historical data
- Robustness: Use
PipelinewithWinsorizerto handle outliers in streaming data - Efficiency: Memory footprint remains constant regardless of stream length
- Flexibility: Mix and match normalizers to build custom preprocessing pipelines
For more details, see the documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onorm-0.2.0.tar.gz.
File metadata
- Download URL: onorm-0.2.0.tar.gz
- Upload date:
- Size: 16.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
45b0642f2e98fdca6466039a930cbd50f30d731e8a36fc44c2f8b73181913aa2
|
|
| MD5 |
54dcb94c75466812650b923890d1551e
|
|
| BLAKE2b-256 |
fbd27eaa458011b7daa8a8df37c59287191d8ee558a55ea834d0c74039b18f5f
|
File details
Details for the file onorm-0.2.0-py3-none-any.whl.
File metadata
- Download URL: onorm-0.2.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.13.5 Darwin/23.2.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
31150c3c4f57369b4c3f4e24ec06b2b3bc8a1497c2c317c0381c46597c6dcee2
|
|
| MD5 |
cf566d553a5d99c99577f4ec07eb1655
|
|
| BLAKE2b-256 |
95dd0d0808c6929060758f77cdbd485bfea4efac75b7a6c1505ba0ad32bfe7bc
|