Skip to main content

This package helps you stratify, sample, and estimate.

Project description

ssepy: A Library for Efficient Model Evaluation through Stratification, Sampling, and Estimation in Python

Paper Apache-2.0

Given an unlabeled dataset and model predictions, how can we select which instances to annotate in one go to maximize the precision of our estimates of model performance on the entire dataset?

The ssepy package helps you do that! The implementation of the ssepy package revolves around the following sequential framework:

  1. Predict: Predict the expected model performance for each example.
  2. Stratify: Divide the dataset into strata using the base predictions.
  3. Sample: Sample a data subset using the chosen sampling method.
  4. Annotate: Acquire annotations for the sampled subset.
  5. Estimate: Estimate model performance.

See our paper here for a technical overview of the framework.

Getting started

In order to intall the package, download the repo, cd into it, and run

pip install .

You may want to initialize a conda environment before running this operation.

Test your setup using this example, which demonstrates data stratification, budget allocation for annotation via proportional allocation, sampling via stratified simple random sampling, and estimation using the Horvitz-Thompson estimator:

import numpy as np
from sklearn.cluster import KMeans
from ssepy import ModelPerformanceEvaluator

np.random.seed(0)
# Generate data
total_samples = 100000
true_performance = np.random.normal(0, 1, total_samples) # Ground truth

# Unobserved target
print(np.mean(true_performance))

annotation_budget = 100 # Annotation budget
# 1. Proxy for ground truth
proxy_performance = true_performance + np.random.normal(0, 0.1, total_samples)
evaluator = ModelPerformanceEvaluator(proxy_performance=proxy_performance, budget=annotation_budget) # Initialize evaluator
# 2. Stratify on proxy_performance
evaluator.stratify_data(clustering_algorithm=KMeans(n_clusters=5, random_state=0, n_init="auto"), features=proxy_performance) # 5 strata
# 3. Allocate budget with proportional allocation and sample
evaluator.allocate_budget(allocation_type="proportional")
sample_indices = evaluator.sample_data(sampling_method="ssrs")
# 4. Annotate
sampled_performance = true_performance[sample_indices]
# 5. Estimate target and variance of estimate
estimate, variance_estimate = evaluator.compute_estimate(sampled_performance, estimator="ht")
print(estimate, variance_estimate)

For the difference estimator under simple random sampling, run

evaluator = ModelPerformanceEvaluator(proxy_performance=proxy_performance, budget=annotation_budget) # initialize sampler
sample_indices = evaluator.sample_data(sampling_method="srs") # 2. sample
sampled_performance = true_performance[sample_indices] # 4. annotate
estimate, variance_estimate = evaluator.compute_estimate(sampled_performance, estimator="df") # 5. estimate
print(estimate, variance_estimate)

The difference estimator is also implemented in the ppi_py package. They implement the prediction-powered estimator that corresponds to the difference estimator in case of mean estimation. Their package has more functionalities than what we offer, so check it out if you're interested in using this estimator.

Examples

The repo comes with a series of examples contained in the examples folder:

  • sampling-and-estimation.ipynb for an example on how to stratify, sample, and estimate
  • oracle-estimation.ipynb on the computation of the (oracle) efficiency of the estimators under various survey designs, assuming we had access to all ground truth variables. This file contains the core part of the code underlying the results in the paper

Features

The supported sample designs are: (SRS) simple random sampling without replacement, (SSRS) stratified simple random sampling without replacement with proportional and optimal/Neyman allocation, (Poisson) sampling. All sampling methods have associated (HT) Horvitz-Thompson and (DF) difference estimators.

Bugs and contribute

Feel free to reach out if you find any bugs or you would like other features to be implemented in the package.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssepy-0.1.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

ssepy-0.1.0-py3-none-any.whl (11.0 kB view details)

Uploaded Python 3

File details

Details for the file ssepy-0.1.0.tar.gz.

File metadata

  • Download URL: ssepy-0.1.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.12 Darwin/23.5.0

File hashes

Hashes for ssepy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6ad65dea81c022fc03542b644bce875a6e8701832cacbb74479f8e4de7e886f2
MD5 9ac5a2799c4fe4e1f78f1670a0e1ae4e
BLAKE2b-256 e5ffab7a83d73a23183c3cbab37371da8c0ca05ee9cb4a107d4fbaa796df97f6

See more details on using hashes here.

File details

Details for the file ssepy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ssepy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.9.12 Darwin/23.5.0

File hashes

Hashes for ssepy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a9b57865ea3ad287d2e7ea6dc2dd86f7ecfffb8b8050d71a8b93f36bd33ac608
MD5 4ca554127bb52318e271dc1e2cde8855
BLAKE2b-256 cb6c781d7720f722876e95f23c1001f4e69918d952d302b1f10401b4ff0c1b68

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page