Skip to main content

Streaming survey raking via SGD and MWU

Project description

onlinerake

PyPI version PyPI Downloads Documentation Tests

Real-time survey weighting for streaming data.

The Problem

You're collecting survey responses or observational data one record at a time. Your sample doesn't match population demographics—too many young respondents, too few from certain regions. Traditional weighting methods (raking/IPF) require reprocessing the entire dataset whenever a new response arrives.

onlinerake updates weights incrementally as each observation streams in, keeping weighted margins aligned with population targets in real time.

When to Use This

  • Online surveys where responses arrive continuously
  • A/B tests that need demographic balance during collection
  • Passive data collection (app usage, sensor data) requiring real-time calibration
  • Any streaming scenario where batch reweighting is too slow or impractical

Quick Start

pip install onlinerake
from onlinerake import OnlineRakingSGD, Targets

# Define population targets (proportion with indicator = 1)
targets = Targets(
    female=0.51,      # 51% female in population
    college=0.32,     # 32% college educated
    age_65_plus=0.17  # 17% age 65+
)

# Create raker
raker = OnlineRakingSGD(targets, learning_rate=5.0)

# Process observations as they arrive
for response in survey_stream:
    raker.partial_fit(response)

    # Check current state anytime
    print(f"Weighted margins: {raker.margins}")
    print(f"Effective sample size: {raker.effective_sample_size:.0f}")

# Get final weights
weights = raker.weights[:raker.n_obs]

Which Algorithm?

Use Case Algorithm Learning Rate
Most cases OnlineRakingSGD 5.0
Smoother weights, higher ESS OnlineRakingSGD 2.0-5.0
IPF-like multiplicative updates OnlineRakingMWU 0.5-1.0
Starting from unequal base weights OnlineRakingMWU 0.5-1.0

Recommendation: Start with OnlineRakingSGD(targets, learning_rate=5.0). It converges faster, maintains higher effective sample size, and handles most scenarios well.

Performance

In simulation studies across linear drift, sudden shift, and oscillating bias scenarios:

Method Margin Error Reduction Effective Sample Size
SGD 72-80% 225-280 (of 300)
MWU 47-52% 175-276 (of 300)
Unweighted baseline 300

SGD consistently outperforms MWU on margin accuracy while maintaining comparable effective sample sizes.

Features

Continuous Covariates (v1.3.0)

Target means instead of proportions:

targets = Targets(
    age=(42.0, "mean"),      # Target mean age = 42
    income=(55000, "mean"),  # Target mean income = $55,000
    female=0.51              # Binary: 51% female
)

Learning Rate Schedules

For theoretical convergence guarantees:

from onlinerake import OnlineRakingSGD, Targets, PolynomialDecayLR
from onlinerake.convergence import verify_robbins_monro

schedule = PolynomialDecayLR(initial_lr=10.0, power=0.6)
raker = OnlineRakingSGD(targets, learning_rate=schedule)

# Verify Robbins-Monro conditions (analytical for known schedules)
result = verify_robbins_monro(schedule)
print(result.condition_1_satisfied)  # True: Σ η_t = ∞
print(result.condition_2_satisfied)  # True: Σ η_t² < ∞

The verify_robbins_monro() function provides analytical verification for known schedule types with mathematical proofs.

Diagnostics

from onlinerake import check_target_feasibility, compute_design_effect

# Check if targets are achievable with your data
feasibility = check_target_feasibility(raker)
print(f"Feasible: {feasibility.is_feasible}")

# Measure weighting efficiency
deff = compute_design_effect(raker)
print(f"Design effect: {deff:.2f}")

Batch Comparison

Compare streaming results against traditional IPF:

from onlinerake import BatchIPF

batch_raker = BatchIPF(targets)
batch_raker.fit(all_observations)

print(f"Online loss: {online_raker.loss:.6f}")
print(f"Batch loss: {batch_raker.loss:.6f}")

API Reference

Core Classes

Targets(**features) - Define population margins

  • Binary features: female=0.51 (proportion = 1)
  • Continuous features: age=(42.0, "mean") (target mean)

OnlineRakingSGD(targets, learning_rate=5.0) - SGD-based streaming raker

  • .partial_fit(obs) - Process one observation
  • .margins - Current weighted margins (dict)
  • .loss - Current squared-error loss
  • .weights - Weight array (use [:raker.n_obs] to slice)
  • .effective_sample_size - ESS accounting for weight variation
  • .converged - Whether loss is below tolerance

OnlineRakingMWU(targets, learning_rate=1.0) - Multiplicative weights raker

  • Same API as OnlineRakingSGD

Key Parameters

Parameter Default Description
learning_rate 5.0 (SGD), 1.0 (MWU) Step size for updates
min_weight 0.1 Minimum allowed weight
max_weight 10.0 Maximum allowed weight
n_steps 3 Gradient steps per observation
convergence_tol 1e-6 Loss threshold for convergence

Installation

pip install onlinerake

Development install:

git clone https://github.com/finite-sample/onlinerake.git
cd onlinerake
pip install -e ".[docs]"

Testing

pytest tests/ -v

Examples

See examples/ for complete worked examples:

  • real_survey_example.py - Basic survey weighting
  • ab_test_calibration.py - Balancing treatment/control groups
  • ad_targeting_calibration.py - Real-time ad delivery calibration
  • recommendation_balancing.py - Content recommendation fairness

Interactive notebooks in docs/notebooks/:

  • 01_getting_started.ipynb - Visual introduction
  • 02_performance_comparison.ipynb - Algorithm benchmarking
  • 03_advanced_diagnostics.ipynb - Convergence and diagnostics

Citation

If you use this package in research, please cite:

@software{onlinerake,
  author = {Sood, Gaurav},
  title = {onlinerake: Streaming Survey Raking},
  url = {https://github.com/finite-sample/onlinerake},
  version = {1.3.0},
  year = {2026}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onlinerake-1.4.0.tar.gz (53.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onlinerake-1.4.0-py3-none-any.whl (61.7 kB view details)

Uploaded Python 3

File details

Details for the file onlinerake-1.4.0.tar.gz.

File metadata

  • Download URL: onlinerake-1.4.0.tar.gz
  • Upload date:
  • Size: 53.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for onlinerake-1.4.0.tar.gz
Algorithm Hash digest
SHA256 b98daaa71e96dac4530a1bccd598c78c105b67d8e26fb3d1737fec2dca233d5c
MD5 f74116a4759012ba88513c18eacf4e55
BLAKE2b-256 fd0259ad1d800f19e797bdd08f8b84e9bbf0f75fd681b81b9cc20b26dff3803d

See more details on using hashes here.

Provenance

The following attestation bundles were made for onlinerake-1.4.0.tar.gz:

Publisher: python-publish.yml on finite-sample/onlinerake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file onlinerake-1.4.0-py3-none-any.whl.

File metadata

  • Download URL: onlinerake-1.4.0-py3-none-any.whl
  • Upload date:
  • Size: 61.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for onlinerake-1.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9c50163fc023bde2e2ff0f0aa50cae7b533edd4f177b0ae531e0858aa26bc535
MD5 a9639feadc8fca8ebaffe796dde01de3
BLAKE2b-256 dfaf49a9e4805f0eb3e12b6a13f4d7270ca2897f4ccceb4586abdbc16695821c

See more details on using hashes here.

Provenance

The following attestation bundles were made for onlinerake-1.4.0-py3-none-any.whl:

Publisher: python-publish.yml on finite-sample/onlinerake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page