Skip to main content

This package helps you stratify, sample, and estimate.

Project description

ssepy: A Library for Efficient Model Evaluation through Stratification, Sampling, and Estimation in Python

Paper Apache-2.0

Given an unlabeled dataset and model predictions, how can we select which instances to annotate in one go to maximize the precision of our estimates of model performance on the entire dataset?

The ssepy package helps you do that! The implementation of the ssepy package revolves around the following sequential framework:

  1. Predict: Predict the expected model performance for each example.
  2. Stratify: Divide the dataset into strata using the base predictions.
  3. Sample: Sample a data subset using the chosen sampling method.
  4. Annotate: Acquire annotations for the sampled subset.
  5. Estimate: Estimate model performance.

See our paper here for a technical overview of the framework.

Getting started

In order to intall the package, run

pip install ssepy

Alternatively, clone the repo, cd into it, and run

pip install .

You may want to initialize a conda environment before running this operation.

Test your setup using this example, which demonstrates data stratification, n allocation for annotation via proportional allocation, sampling via stratified simple random sampling, and estimation using the Horvitz-Thompson estimator:

import numpy as np
from sklearn.cluster import KMeans
from ssepy import ModelPerformanceEvaluator

np.random.seed(0)
# Generate data
N = 100000
Y = np.random.normal(0, 1, N) # Ground truth

# Unobserved target
print(np.mean(Y))

n = 100 # Annotation n
# 1. Proxy for ground truth
Yh = Y + np.random.normal(0, 0.1, N)
evaluator = ModelPerformanceEvaluator(Yh = Yh, budget = n) # Initialize evaluator
# 2. Stratify on Yh
evaluator.stratify_data(clustering_algo=KMeans(n_clusters=5, random_state=0, n_init="auto"), X=Yh) # 5 strata
# 3. Allocate n with proportional allocation and sample
evaluator.allocate_budget(allocation_type="proportional")
sampled_idx = evaluator.sample()
# 4. Annotate
Yl = Y[sampled_idx]
# 5. Estimate target and variance of estimate
estimate, variance_estimate = evaluator.compute_estimate(Yl, estimator="ht")
print(estimate, variance_estimate)

For the difference estimator under simple random sampling, run

evaluator = ModelPerformanceEvaluator(Yh=Yh, budget=n) # initialize sampler
sampled_idx = evaluator.sample(sampling_method="srs") # 3. sample
Yl = Y[sampled_idx] # 4. annotate
estimate, variance_estimate = evaluator.compute_estimate(Yl, estimator="df") # 5. estimate
print(estimate, variance_estimate)

See also some examples in the associated folder.

Features

The supported sample designs are: (SRS) simple random sampling without replacement, (SSRS) stratified simple random sampling without replacement with proportional and optimal/Neyman allocation, (Poisson) sampling. All sampling methods have associated (HT) Horvitz-Thompson and (DF) difference estimators.

Bugs and contribute

Feel free to reach out if you find any bugs or you would like other features to be implemented in the package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssepy-0.1.1.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ssepy-0.1.1-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file ssepy-0.1.1.tar.gz.

File metadata

  • Download URL: ssepy-0.1.1.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for ssepy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 78b6f2bcc366d3a96ce42cfa00932b4964583205a6b38714fc5c82d3f5a4dfa0
MD5 d9597757544dfa6137e11754650490cb
BLAKE2b-256 4495e3a9c89996aa640c5bace762297fdde43c288f878582becbdfb933988295

See more details on using hashes here.

File details

Details for the file ssepy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ssepy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.1 CPython/3.10.16 Darwin/24.3.0

File hashes

Hashes for ssepy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 70a84c1bc84b04e10a7e289df6d4748cd3fb419809d85f678b287fb7e32f3fcc
MD5 2e61807a11541b5efbe5392654422384
BLAKE2b-256 adab5ac9271abf276280016f7416951a2514edc6c9f390580fd131fe59123424

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page