Twosample Higher Criticism
Project description
TwoSampleHC  Higher Criticism Test between Two Frequency Tables
An adaptation of the DonohoJinTukey HigherCritisim (HC) test to frequency tables. This adapatation uses a binomial allocation model for the number of occurances of each feature in two samples, each of which is associated with a frequency table. An exact binomial test on each feature yields a pvalue. The HC statistic is used to combine these Pvalues that can be uses either as a measure of similarity or to construct a test against the null hypothesis that the two tables are sampled from the same population.
This test is particularly useful in identifying nonnull effects under weak and sparse alternatives, i.e., when the difference between the tables is due to few features, and the evidence each such feature provide is realtively weak. More details and applications in text classification challenges can be found in
[1] Alon Kipnis, Higher Criticism for Discriminating Word Frequency Tables and Testing Authorship'', 2019 [2] David Donoho and Alon Kipnis,
Twosample Testing for Large, Sparse HighDimensional Multinomials under Rare and WeakPerturbations'', 2020.
Example:
import numpy as np
N = 1000 # number of features
n = 5 * N #number of samples
P = 1 / np.arange(1,N+1) # Zipf base distribution
P = P / P.sum()
ep = 0.03 #fraction of features to perturb
mu = 0.005 #intensity of perturbation
TH = np.random.rand(N) < ep
Q = P.copy()
Q[TH] += mu
Q = Q / np.sum(Q)
smp_P = np.random.multinomial(n, P) # sample form P
smp_Q = np.random.multinomial(n, Q) # sample from Q
pv = two_sample_pvals(smp_Q, smp_P) # binomial Pvalues
HC, p_th = hc_vals(pv, alpha = 0.25) # Higher Criticism test
print("TV distance between P and Q: ", 0.5*np.sum(np.abs(PQ)))
print("HigherCriticism score for testing P == Q: ", HC)
# TV distance between P and Q: 0.11229216095188953
# HigherCriticism score for testing P == Q: 3.874043440201504
# (HC score rarely goes above 2.5 if P == Q)
Project details
Release history Release notifications
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size  File type  Python version  Upload date  Hashes 

Filename, size TwoSampleHC_kipnisal0.0.1py3noneany.whl (3.8 kB)  File type Wheel  Python version py3  Upload date  Hashes View 
Filename, size TwoSampleHCkipnisal0.0.1.tar.gz (2.3 kB)  File type Source  Python version None  Upload date  Hashes View 
Hashes for TwoSampleHC_kipnisal0.0.1py3noneany.whl
Algorithm  Hash digest  

SHA256  19d30d8c1bc8a30e5ee2f150a59fb3a16146b3fcfe30fc46c4259408ea25ff17 

MD5  27283d7c9a1c20ace426a2f8de7ed815 

BLAKE2256  c54761e492a8c2099124f6c2321a6a7cb8308ef1eaeeaeda1d0bcb1838f7806f 
Hashes for TwoSampleHCkipnisal0.0.1.tar.gz
Algorithm  Hash digest  

SHA256  ae099e9141a78ccc089f33410571fe820c75a7b54a7426dc8863e13e44d506b8 

MD5  559a2a14f1c01a5d903de66c5ceb74ea 

BLAKE2256  e473e297e92c0442e253c52ba3e7003ff52db97c9e55fd79349ccc53a044541f 