This package helps you stratify, sample, and estimate.
Project description
ssepy: A Library for Efficient Model Evaluation through Stratification, Sampling, and Estimation in Python
Given an unlabeled dataset and model predictions, how can we select which instances to annotate in one go to maximize the precision of our estimates of model performance on the entire dataset?
The ssepy package helps you do that! The implementation of the ssepy package revolves around the following sequential framework:
- Predict: Predict the expected model performance for each example.
- Stratify: Divide the dataset into strata using the base predictions.
- Sample: Sample a data subset using the chosen sampling method.
- Annotate: Acquire annotations for the sampled subset.
- Estimate: Estimate model performance.
See our paper here for a technical overview of the framework.
Getting started
In order to intall the package, run
pip install ssepy
Alternatively, clone the repo, cd into it, and run
pip install .
You may want to initialize a conda environment before running this operation.
Test your setup using this example, which demonstrates data stratification, n allocation for annotation via proportional allocation, sampling via stratified simple random sampling, and estimation using the Horvitz-Thompson estimator:
import numpy as np
from sklearn.cluster import KMeans
from ssepy import ModelPerformanceEvaluator
np.random.seed(0)
# Generate data
N = 100000
Y = np.random.normal(0, 1, N) # Ground truth
# Unobserved target
print(np.mean(Y))
n = 100 # Annotation n
# 1. Proxy for ground truth
Yh = Y + np.random.normal(0, 0.1, N)
evaluator = ModelPerformanceEvaluator(Yh = Yh, budget = n) # Initialize evaluator
# 2. Stratify on Yh
evaluator.stratify_data(clustering_algo=KMeans(n_clusters=5, random_state=0, n_init="auto"), X=Yh) # 5 strata
# 3. Allocate n with proportional allocation and sample
evaluator.allocate_budget(allocation_type="proportional")
sampled_idx = evaluator.sample()
# 4. Annotate
Yl = Y[sampled_idx]
# 5. Estimate target and variance of estimate
estimate, variance_estimate = evaluator.compute_estimate(Yl, estimator="ht")
print(estimate, variance_estimate)
For the difference estimator under simple random sampling, run
evaluator = ModelPerformanceEvaluator(Yh=Yh, budget=n) # initialize sampler
sampled_idx = evaluator.sample(sampling_method="srs") # 3. sample
Yl = Y[sampled_idx] # 4. annotate
estimate, variance_estimate = evaluator.compute_estimate(Yl, estimator="df") # 5. estimate
print(estimate, variance_estimate)
See also some examples in the associated folder.
Features
The supported sample designs are: (SRS) simple random sampling without replacement, (SSRS) stratified simple random sampling without replacement with proportional and optimal/Neyman allocation, (Poisson) sampling. All sampling methods have associated (HT) Horvitz-Thompson and (DF) difference estimators.
Bugs and contribute
Feel free to reach out if you find any bugs or you would like other features to be implemented in the package.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ssepy-0.1.1.tar.gz.
File metadata
- Download URL: ssepy-0.1.1.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.10.16 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
78b6f2bcc366d3a96ce42cfa00932b4964583205a6b38714fc5c82d3f5a4dfa0
|
|
| MD5 |
d9597757544dfa6137e11754650490cb
|
|
| BLAKE2b-256 |
4495e3a9c89996aa640c5bace762297fdde43c288f878582becbdfb933988295
|
File details
Details for the file ssepy-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ssepy-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.1 CPython/3.10.16 Darwin/24.3.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70a84c1bc84b04e10a7e289df6d4748cd3fb419809d85f678b287fb7e32f3fcc
|
|
| MD5 |
2e61807a11541b5efbe5392654422384
|
|
| BLAKE2b-256 |
adab5ac9271abf276280016f7416951a2514edc6c9f390580fd131fe59123424
|