Skip to main content

Several two-samples tests for contingency tables with counts data

Project description

TwoSampleHC -- Higher Criticism Test between Two Frequency Tables

This package provides an adaptation of the Donoho-Jin-Tukey Higher- Critisim (HC) test to frequency tables. This adapatation uses a binomial allocation model for the number of occurances of each feature in two- samples, each of which is associated with a frequency table. The exact binomial test associated with each feature yields a p-value. The HC statistic combines these P-values to a global test against the null hypothesis that the two tables are two realizations of the same data generating mechanism.

This test is particularly useful in identifying non-null effects under weak and sparse alternatives, i.e., when the difference between the tables is due to few features, and the evidence each such feature provide is realtively weak. See references below for more details. [1] Alon Kipnis. (2022). Higher Criticism for Discriminating Word Frequency Tables and Testing Authorship. Annals of Applied Statistics. [2] David L. Donoho and Alon Kipnis. (2022). Higher criticism to compare two large frequency tables, with sensitivity to possible rare and weak differences. Annals of Statistics.

Example:

from TwoSampleHC import two_sample_pvals, HC
import numpy as np

N = 1000 # number of features
n = 5 * N #number of samples

P = 1 / np.arange(1,N+1) # Zipf base distribution
P = P / P.sum()

ep = 0.02 #fraction of features to perturb
mu = 0.005 #intensity of perturbation

TH = np.random.rand(N) < ep
Q = P.copy()
Q[TH] += mu
Q = Q / np.sum(Q)

smp_P = np.random.multinomial(n, P)  # sample form P
smp_Q = np.random.multinomial(n, Q)  # sample from Q

pv = two_sample_pvals(smp_Q, smp_P) # binomial P-values
hc = HC(pv)
hc_val, p_th = hc.HCstar(gamma = 0.25) # Small sample Higher Criticism test

print("TV distance between P and Q: ", 0.5*np.sum(np.abs(P-Q)))
print("Higher-Criticism score for testing P == Q: ", hc_val)  

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

twosamplehc-0.4.0.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

TwoSampleHC-0.4.0-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file twosamplehc-0.4.0.tar.gz.

File metadata

  • Download URL: twosamplehc-0.4.0.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.20

File hashes

Hashes for twosamplehc-0.4.0.tar.gz
Algorithm Hash digest
SHA256 f4479e37a1d74a6c189f04c49c1311826617b1d642bffca9849944af0d65a545
MD5 9229489c930d7a40719f03a4fc16e556
BLAKE2b-256 4db466ca5076d74b6fb4c899421c65a472942a53d8e9bd52612fcecbbe0d0e85

See more details on using hashes here.

File details

Details for the file TwoSampleHC-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: TwoSampleHC-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 8.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.20

File hashes

Hashes for TwoSampleHC-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b3d7ce5acc780e346cff482c1c505b1bfd3c1a2ede6191a095e4f885104d5dcc
MD5 df1f6a6ae2c411ce2cc50cd1d8ddfecf
BLAKE2b-256 8fecc7ef1f9b42740eeaf4ac5eb66add6bf004a6f83c07bc821370edd1288b87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page