Streaming survey raking via SGD and MWU
Project description
onlinerake: Streaming Survey Raking Via MWU and SGD
Modern online surveys and passive data collection streams generate
responses one record at a time. Classic weighting methods such as
iterative proportional fitting (IPF, or “raking”) and calibration
weighting are inherently batch procedures: they reprocess the entire
dataset whenever a new case arrives. The onlinerake package
provides incremental, per‑observation updates to survey weights so
that weighted margins track known population totals in real time.
The package implements two complementary algorithms:
- SGD raking – an additive update that performs stochastic gradient descent on a squared–error loss over the margins. It produces smooth weight trajectories and maintains high effective sample size (ESS).
- MWU raking – a multiplicative update inspired by the multiplicative‑weights update rule. It corresponds to mirror descent under the Kullback–Leibler divergence and yields weight distributions reminiscent of classic IPF. However, it can produce heavier tails when the learning rate is large.
Both methods share the same API: call .partial_fit(obs) for each
incoming observation and inspect properties such as .margins, .loss
and .effective_sample_size to monitor progress.
Installation
Install from PyPI:
pip install onlinerake
For development, clone the repository and install in editable mode:
git clone https://github.com/finite-sample/onlinerake.git
cd onlinerake
pip install -e .
No external dependencies are required beyond numpy and pandas.
Usage
from onlinerake import OnlineRakingSGD, OnlineRakingMWU, Targets
# define target population margins (proportion of the population with indicator = 1)
targets = Targets(age=0.5, gender=0.5, education=0.4, region=0.3)
# instantiate a raker
raker = OnlineRakingSGD(targets, learning_rate=5.0)
# stream demographic observations
for obs in stream_of_dicts:
raker.partial_fit(obs)
print(raker.margins) # current weighted margins
print("final effective sample size", raker.effective_sample_size)
To use the multiplicative‑weights version, replace
OnlineRakingSGD with OnlineRakingMWU and adjust the
learning_rate (a typical default is 1.0). See the docstrings
for full parameter descriptions.
Simulation results
To understand the behavior of the two update rules we simulated three typical non‑stationary bias patterns: a linear drift in demographic composition, a sudden shift halfway through the stream, and an oscillation around the target frame. For each scenario we generated 300 observations per seed and averaged results over five random seeds. SGD used a learning rate of 5.0 and MWU used a learning rate of 1.0 with three update steps per observation. The table below summarizes the mean improvement in absolute margin error relative to the unweighted baseline (positive values indicate an improvement), the final effective sample size (ESS) and the mean final loss (squared‑error on margins). Higher ESS and larger improvements are better.
| Scenario | Method | Age Imp (%) | Gender Imp (%) | Education Imp (%) | Region Imp (%) | Overall Imp (%) | Final ESS | Final Loss |
|---|---|---|---|---|---|---|---|---|
| linear | SGD | 82.8 | 78.6 | 76.8 | 67.5 | 77.0 | 251.8 | 0.00147 |
| linear | MWU | 57.2 | 53.6 | 46.9 | 34.6 | 48.8 | 240.9 | 0.00676 |
| sudden | SGD | 82.9 | 82.3 | 79.6 | 63.5 | 79.5 | 225.5 | 0.00102 |
| sudden | MWU | 52.6 | 51.2 | 46.3 | 26.3 | 47.3 | 175.9 | 0.01235 |
| oscillating | SGD | 69.7 | 78.5 | 65.6 | 72.0 | 72.2 | 278.7 | 0.00023 |
| oscillating | MWU | 49.6 | 57.3 | 48.3 | 50.1 | 52.0 | 276.0 | 0.00048 |
Interpretation
- In all scenarios the online rakers dramatically reduce the margin errors relative to the unweighted baseline. For example, in the sudden‑shift scenario the SGD raker reduces the average age error from 0.20 to about 0.03 (a 83% improvement).
- The SGD update consistently yields higher improvements and lower final loss than the MWU update, albeit at the cost of choosing a more aggressive learning rate.
- The MWU update, while less accurate in these settings, maintains comparable effective sample sizes and might be preferable when multiplicative adjustments are desired (e.g., when starting from unequal base weights).
You can reproduce these results and explore interactive examples by running the Jupyter notebooks:
# Install with documentation dependencies
pip install onlinerake[docs]
# Launch interactive notebooks
jupyter notebook docs/notebooks/
See the 02_performance_comparison.ipynb notebook for detailed algorithm analysis and benchmarking.
Interactive Examples
Comprehensive examples with visualizations are provided as Jupyter notebooks in docs/notebooks/:
01_getting_started.ipynb- Introduction with visual demonstrations02_performance_comparison.ipynb- Algorithm benchmarking and analysis03_advanced_diagnostics.ipynb- Convergence monitoring and diagnostics
The notebooks include:
- Correcting feature bias in online surveys
- Real-time polling with demographic shifts
- Performance comparison between SGD and MWU algorithms
- Visual validation that the algorithms work correctly
Run the interactive examples:
# Install with documentation dependencies
pip install onlinerake[docs]
# Launch Jupyter notebooks
jupyter notebook docs/notebooks/
Testing
Run the comprehensive test suite:
pytest tests/test_onlinerake.py -v
Contributing
Pull requests are welcome! Feel free to open issues if you find bugs or have suggestions for new features, such as support for multi‑level controls or adaptive learning‑rate schedules.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onlinerake-1.0.0.tar.gz.
File metadata
- Download URL: onlinerake-1.0.0.tar.gz
- Upload date:
- Size: 13.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
67b90574f775ff06cf950ebd597c40c138e3ab06bfcf2b883f224aa50c1d57d7
|
|
| MD5 |
2ca82469e4a8f0b99c02656ce29b4fae
|
|
| BLAKE2b-256 |
dd7fe32a008d1cb72205b06115230e6e4aba860309e71a59a78a32fb2f1a5683
|
Provenance
The following attestation bundles were made for onlinerake-1.0.0.tar.gz:
Publisher:
python-publish.yml on finite-sample/onlinerake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
onlinerake-1.0.0.tar.gz -
Subject digest:
67b90574f775ff06cf950ebd597c40c138e3ab06bfcf2b883f224aa50c1d57d7 - Sigstore transparency entry: 774436857
- Sigstore integration time:
-
Permalink:
finite-sample/onlinerake@d8af17a118ccca3c5d95ef83b26123ff6d4bbb41 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/finite-sample
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d8af17a118ccca3c5d95ef83b26123ff6d4bbb41 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file onlinerake-1.0.0-py3-none-any.whl.
File metadata
- Download URL: onlinerake-1.0.0-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbf7aa1b411059640863f9a0ad963c9c69ea9f8823848a18680c9aca54a6a551
|
|
| MD5 |
2b57b4a0872173cfc1502503bc946835
|
|
| BLAKE2b-256 |
d688b6c93b17b6ab52ce526ebf1ac805ccce1954e7e0c42af9a3c838246fb8fc
|
Provenance
The following attestation bundles were made for onlinerake-1.0.0-py3-none-any.whl:
Publisher:
python-publish.yml on finite-sample/onlinerake
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
onlinerake-1.0.0-py3-none-any.whl -
Subject digest:
dbf7aa1b411059640863f9a0ad963c9c69ea9f8823848a18680c9aca54a6a551 - Sigstore transparency entry: 774436859
- Sigstore integration time:
-
Permalink:
finite-sample/onlinerake@d8af17a118ccca3c5d95ef83b26123ff6d4bbb41 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/finite-sample
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d8af17a118ccca3c5d95ef83b26123ff6d4bbb41 -
Trigger Event:
workflow_dispatch
-
Statement type: