Skip to main content

Implement the Empirical Bayes ranking scheme developed in Kline, Rose, and Walters (2023)

Project description

DRrank

DRrank is a Python library to implement the Empirical Bayes ranking scheme developed in Kline, Rose, and Walters (2023). This code was originally developed by Hadar Avivi.

Installation:

The package uses the Gurobi optimizer. To use DRrank you must first install Gurobi and acquire a license. More guidance is available from Gurobi here. Gurobi offers a variety of free licenses for academic use. For more information, see the following page.

After having successfully set up Gurobipy, install DRrank via pip:

pip install drrank

Usage

1. Load sample data

DRrank grades units based on noisy estimates of a latent attribute. You can construct these estimates however you'd like---all DRrank requires is a vector of estimates, $\hat{\theta}_i$, and their associated standard errors, $s_i$.

To illustrate the package's features, this readme uses the data in example/name_example.csv, which contains estimates of name-specific contact rates from the experiment studied in Kline, Rose, and Walters (2023). These contact rates have been adjusted to stabilize their variances using the Bartlett (1936) transformation. Variance-stabilization is useful because the deconvolution procedure used in Step 2 below requires that $s_i$ be independent of $\theta_i$. In cases where variance stabilization is not possible, independence can sometimes be restored by residualizing $\hat{\theta_i}$ against $s_i$; see Section 5 of Kline, Rose, and Walters (2023) for a detailed example. The transformation used in our names example computes estimates as $\hat{\theta}_i = sin^{-1} \sqrt{\hat{p}_i}$, where $\hat{p}_i$ is share of applications with name $i$ that received a callback. As discussed in the paper, $\hat{\theta}_i$ has asymptotic variance of $(4N_i)^{-1}$, where $N_i$ is the number of applications sent with name $i$.

import pandas as pd
import os

# Read in the data
data = pd.read_csv(os.getcwd() + '/example/name_example.csv')
data.head()
name_id thetas s firstname
1 0.53788 0.0137214 Adam
2 0.518394 0.0173332 Aisha
3 0.53767 0.0174933 Allison
4 0.532129 0.0138126 Amanda
5 0.534998 0.0136503 Amy

While DRrank provides the functionality to account for any variance-stabilizing transformation, using one is not strictly necessary. DRrank can also accomodate cases where the $\hat{\theta}_i$ directly capture untransformed estimates of the relevant latent attribute.

2. Estimating the prior

DRrank provides the functionality to estimate a prior distribution using a variation on Efron (2016)'s log-spline deconvolution approach, which uses an exponential family mixing distribution with density parameterized by a flexible B-th order natural cubic spline.

To estimate the prior, generate an instance of the prior_estimate class with each unit's estimated latent attribute, $\hat{\theta}_i$, and its associated standard errors. You also have the option of supplying an inverse transform in case the $\hat{\theta}_i$ have been transformed to stabilize variances. The appropriate inverse transform for the ranking name-specific contact rates in Kline, Rose, and Walters (2023), for example, is $f(x) = sin(x)^2$. The inverse transform function should be vectorized.

import numpy as np
from drrank_distribution import prior_estimate
thetas = data.thetas.values # set of estimates
s = data.s.values # setstandard errors

# Initialize the estimator
G = prior_estimate(thetas, s, transform=lambda x: np.power(np.sin(x),2))

# Estimate the prior distribution 
G.estimate_prior(supp_points=5000, spline_order=5, seed = 123)

Use the supp_points option (default=5000) to pick the number of points of support over which to evaluate the prior density. The minimum and maximum of the support are the minimum and maximum of thetas.

Use the spline_order option (default=5) to adjust the degrees of freedom of the spline that parameterizes the mixing distribution.

The estimated prior distribution will be saved as a dictionary. You can access it with the following code:

# Dictionary with the prior distribution
G.prior_g

# Keys:
# mean_theta: mean of the prior
G.prior_g['mean_theta']
# sd_theta: standard deviation of the prior
G.prior_g['sd_theta']
# g_theta: estimated prior density
G.prior_g['g_theta']

You can then graph the results by calling the following function:

# Plot the estimated prior distribution
G.plot_estimates(save_path = "example/prior_distribution.jpg", 
                    binwidth = 0.0030,
                    line_kws = {'alpha': 0.6},
                    fill = True,
                    alpha = 0.3)

prior_distribution

Within the function you can specify the following arguments:

  • g_theta: provide your own prior distribution G. None implies the function will utilize the estimated G from the estimate_prior() method (default = None).
  • show_plot: whether to show the plot or not (default = True).
  • save_path: path to where the plot will be saved. None implies the graph will not be saved (default = None).

You can also furtherly change the visualization of the histogram, which inherits all the arguments from seaborn.histplot

3. Estimation of posterior features and $P$ matrix

Once the prior distribution has been estimated, you can estimate posterior means and credible intervals, as well as the matrix of pairwise posterior ordering probabilities $P$.

# Compute the posterior features
G.compute_posteriors(alpha=.05, g_theta=None)
G.pmean # posterior means
G.pmean_trans # inverse transformed posterior means
G.lci # lower limit of 1-alpha credible interval
G.uci # upper limit of 1-alpha credible interval
G.lci_trans # lower limit of inverse transformed 1-alpha credible interval
G.uci_trans # upper limit of inverse transformed 1-alpha credible interval

G.posterior_df.head() # Dataframe of posterior features
pmean pmean_trans lci uci lci_trans uci_trans
0 0.522524 0.249088 0.50771 0.532169 0.236368 0.257458
1 0.515032 0.242678 0.493166 0.529646 0.224124 0.255255
2 0.52051 0.247361 0.497425 0.531542 0.227686 0.25691
3 0.521265 0.248005 0.500928 0.531446 0.23063 0.256826
4 0.521993 0.248631 0.50456 0.531815 0.233697 0.257149

Then compute the pairwise ordering probabilities $P$ using:

# Compute the pairwise ordering probabilities
pis = G.compute_pis(g_theta=None, ncores=-1, power=0)

In both functions, it is possible to provide your own prior distribution G by feeding an array as the g_theta argument. This density must take support on the values determined by G.supp_theta.

compute_pis also provides the option to compute the elements of $P$, $\pi_{ij}$, as the posterior expectation of $max(\theta_i - \theta_j,0)^{power}$, providing an extension to weighted ranking exercises. The default, $power=0$, will produce $\pi_{ij}$ that are posterior ordering probabilities discussed in the next section and implies that ranking mistakes a considered equally costly regardless of the cardinal difference between $\theta_i$ and $\theta_j$.

4. Computing grades

To compute rankings, use the fit function with a matrix $P$. In the unweighted case, $P$ reflects the posterior probabilities that observation i's latent measure $\theta_i$ exceeds unit j's. That is, each element of this matrix takes the form:

$\pi_{ij} = Pr(\theta_i > \theta_j | Y_i = y_i, Y_j = y_j)$

DRrank expects these probabilities to satisfy $\pi_{ij} = 1-\pi_{ji}$ and will report a warning if that does not appear to be the case.

In the weighted case, $P$ captures expected differences between i and j's value of $\theta$, as discussed above.

There are two ways to use DRrank.

First, one can supply a parameter $\lambda \in [0,1]$, which corresponds to the user's value of correctly ranking pairs of units relative to the costs of misclassifying them. $\lambda=1$ implies correct and incorrect rankings are valued equally, while $\lambda=0$ implies correct rankings are not valued at all. In the unweighted case, it is optimal to assign unit $i$ a higher grade than unit $j$ when $\pi_{ij} > 1/(1+\lambda)$, which implies $\lambda$ also determines the minimum level of posterior certainty required to rank units pairwise.

from drrank import fit

# Fit the report card function
results = fit(pis, lamb = 0.25, DR = None)

The results object contains the row index of $P$, the assigned grades, and the Condorcet rank (i.e., grade under $\lambda=1$).

obs_idx grades_lamb0.25 condorcet_rank
1 1 3
2 1 32
3 1 12
4 1 9
5 1 5

We also provide functionality to compute results for a list of values of $\lambda$ in parallel:

import numpy as np
from drrank import fit_multiple

# Try different values of Lambda
results_l = fit_multiple(pis, np.arange(0, 0.9, 0.01))

Second, one can ask DRrank to compute grades that maximize Kendall (1938)'s $\tau$, a measure of the rank correlation between units' latent rankings and assigned grades, subject to a constraint on the expected share of pairwise units incorrectly classified, which we refer to as the Discordance Rate (DR).

# Fit the report card function
results_dr = fit(pis, lamb = None, DR = 0.05)

Finally, we provide functionality to plot grades along with posterior means and credible intervals:

from drrank import fig_ranks

# Merge the results with the identity of our observations
results['firstname'] = data.firstname
fig_ranks(ranking = results, posterior_features = G.posterior_df, ylabels = 'firstname', ylabel_fontsize = 8, save_path = 'example/name_ranking.jpg')

name_ranking

Within the function you can specify the following arguments:

  • results: ranking results from drrank.fit
  • posterior_features: posterior features computed through G.compute_posteriors()
  • ylabels: optional, specify the column in results where we have stored the labels of each observation (default = None)
  • ylabel_fontsize: optional, specify font size for labels (default = 8)
  • show_plot: whether to show the plot or not (default = True)
  • save_path: path to where the plot will be saved; None implies the graph will not be saved (default = None)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

drrank-0.1.3.tar.gz (16.8 kB view hashes)

Uploaded Source

Built Distribution

drrank-0.1.3-py3-none-any.whl (18.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page