Skip to main content

Streaming survey raking via SGD and MWU

Project description

onlinerake

PyPI version PyPI Downloads Documentation Tests

Real-time survey weighting for streaming data.

The Problem

You're collecting survey responses or observational data one record at a time. Your sample doesn't match population demographics—too many young respondents, too few from certain regions. Traditional weighting methods (raking/IPF) require reprocessing the entire dataset whenever a new response arrives.

onlinerake updates weights incrementally as each observation streams in, keeping weighted margins aligned with population targets in real time.

When to Use This

  • Online surveys where responses arrive continuously
  • A/B tests that need demographic balance during collection
  • Passive data collection (app usage, sensor data) requiring real-time calibration
  • Any streaming scenario where batch reweighting is too slow or impractical

Quick Start

pip install onlinerake
from onlinerake import OnlineRakingSGD, Targets

# Define population targets (proportion with indicator = 1)
targets = Targets(
    female=0.51,      # 51% female in population
    college=0.32,     # 32% college educated
    age_65_plus=0.17  # 17% age 65+
)

# Create raker
raker = OnlineRakingSGD(targets, learning_rate=5.0)

# Process observations as they arrive
for response in survey_stream:
    raker.partial_fit(response)

    # Check current state anytime
    print(f"Weighted margins: {raker.margins}")
    print(f"Effective sample size: {raker.effective_sample_size:.0f}")

# Get final weights
weights = raker.weights[:raker.n_obs]

Which Algorithm?

Use Case Algorithm Learning Rate
Most cases OnlineRakingSGD 5.0
Smoother weights, higher ESS OnlineRakingSGD 2.0-5.0
IPF-like multiplicative updates OnlineRakingMWU 0.5-1.0
Starting from unequal base weights OnlineRakingMWU 0.5-1.0

Recommendation: Start with OnlineRakingSGD(targets, learning_rate=5.0). It converges faster, maintains higher effective sample size, and handles most scenarios well.

Performance

In simulation studies across linear drift, sudden shift, and oscillating bias scenarios:

Method Margin Error Reduction Effective Sample Size
SGD 72-80% 225-280 (of 300)
MWU 47-52% 175-276 (of 300)
Unweighted baseline 300

SGD consistently outperforms MWU on margin accuracy while maintaining comparable effective sample sizes.

Features

Continuous Covariates (v1.3.0)

Target means instead of proportions:

targets = Targets(
    age=(42.0, "mean"),      # Target mean age = 42
    income=(55000, "mean"),  # Target mean income = $55,000
    female=0.51              # Binary: 51% female
)

Learning Rate Schedules

For theoretical convergence guarantees:

from onlinerake import OnlineRakingSGD, Targets, PolynomialDecayLR
from onlinerake.convergence import verify_robbins_monro

schedule = PolynomialDecayLR(initial_lr=10.0, power=0.6)
raker = OnlineRakingSGD(targets, learning_rate=schedule)

# Verify Robbins-Monro conditions (analytical for known schedules)
result = verify_robbins_monro(schedule)
print(result.condition_1_satisfied)  # True: Σ η_t = ∞
print(result.condition_2_satisfied)  # True: Σ η_t² < ∞

The verify_robbins_monro() function provides analytical verification for known schedule types with mathematical proofs.

Diagnostics

from onlinerake import check_target_feasibility, compute_design_effect

# Check if targets are achievable with your data
feasibility = check_target_feasibility(raker)
print(f"Feasible: {feasibility.is_feasible}")

# Measure weighting efficiency
deff = compute_design_effect(raker)
print(f"Design effect: {deff:.2f}")

Batch Comparison

Compare streaming results against traditional IPF:

from onlinerake import BatchIPF

batch_raker = BatchIPF(targets)
batch_raker.fit(all_observations)

print(f"Online loss: {online_raker.loss:.6f}")
print(f"Batch loss: {batch_raker.loss:.6f}")

API Reference

Core Classes

Targets(**features) - Define population margins

  • Binary features: female=0.51 (proportion = 1)
  • Continuous features: age=(42.0, "mean") (target mean)

OnlineRakingSGD(targets, learning_rate=5.0) - SGD-based streaming raker

  • .partial_fit(obs) - Process one observation
  • .margins - Current weighted margins (dict)
  • .loss - Current squared-error loss
  • .weights - Weight array (use [:raker.n_obs] to slice)
  • .effective_sample_size - ESS accounting for weight variation
  • .converged - Whether loss is below tolerance

OnlineRakingMWU(targets, learning_rate=1.0) - Multiplicative weights raker

  • Same API as OnlineRakingSGD

Key Parameters

Parameter Default Description
learning_rate 5.0 (SGD), 1.0 (MWU) Step size for updates
min_weight 0.1 Minimum allowed weight
max_weight 10.0 Maximum allowed weight
n_steps 3 Gradient steps per observation
convergence_tol 1e-6 Loss threshold for convergence

Installation

pip install onlinerake

Development install:

git clone https://github.com/finite-sample/onlinerake.git
cd onlinerake
pip install -e ".[docs]"

Testing

pytest tests/ -v

Examples

See examples/ for complete worked examples:

  • real_survey_example.py - Basic survey weighting
  • ab_test_calibration.py - Balancing treatment/control groups
  • ad_targeting_calibration.py - Real-time ad delivery calibration
  • recommendation_balancing.py - Content recommendation fairness

Interactive notebooks in docs/notebooks/:

  • 01_getting_started.ipynb - Visual introduction
  • 02_performance_comparison.ipynb - Algorithm benchmarking
  • 03_advanced_diagnostics.ipynb - Convergence and diagnostics

Citation

If you use this package in research, please cite:

@software{onlinerake,
  author = {Sood, Gaurav},
  title = {onlinerake: Streaming Survey Raking},
  url = {https://github.com/finite-sample/onlinerake},
  version = {1.3.0},
  year = {2026}
}

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onlinerake-1.3.0.tar.gz (41.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onlinerake-1.3.0-py3-none-any.whl (47.1 kB view details)

Uploaded Python 3

File details

Details for the file onlinerake-1.3.0.tar.gz.

File metadata

  • Download URL: onlinerake-1.3.0.tar.gz
  • Upload date:
  • Size: 41.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for onlinerake-1.3.0.tar.gz
Algorithm Hash digest
SHA256 8230c844f77f755433b70269e78c3b02af33bd685a6d2f963e5bf3cceb36bc53
MD5 42e8f53045b7c3d18877bab3d11a3b36
BLAKE2b-256 eae57ea5e31d3d7b59543747c7f4932df1ec0f4472bb56124af1f1ecbdbdb7fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for onlinerake-1.3.0.tar.gz:

Publisher: python-publish.yml on finite-sample/onlinerake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file onlinerake-1.3.0-py3-none-any.whl.

File metadata

  • Download URL: onlinerake-1.3.0-py3-none-any.whl
  • Upload date:
  • Size: 47.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for onlinerake-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2488ed5fafd6a306b0e3db3706480ca8708ffd9d0f13eaee7af82c47fb4a4eac
MD5 966ebc0bc212d16d5b0a641a9a987d83
BLAKE2b-256 90dfbd91c6dc545ed03cf6e2c063946effb7ad8f02d266ec2ab44b160a27d95b

See more details on using hashes here.

Provenance

The following attestation bundles were made for onlinerake-1.3.0-py3-none-any.whl:

Publisher: python-publish.yml on finite-sample/onlinerake

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page