Project description

CachedContingency

Python 3.9+ classes to compute and cache Fisher's exact test and Boschloo's test more efficiently.

Installation

This package requires at least Python 3.9.

pip install cached_contingency

Idea

I have to compute lots of these tests and want to accelerate the process. There are two optimizations that came to my mind:

My contingency tables often have identical column sums, so many tests can be recycled
- Fisher's test: abcd, acbd, dbca and dcba are equivalent
- Boschloo's test: abcd, badc, cdab and dcba are equivalent
Some contingency tables are equivalent and only have to be computed once

Furthermore, sometimes, one has to re-run tools. In these cases, all previously computed results can be recycled.

As cache, a SQLite database is used.

Execution

Replace equivalent contingency tables with the same contingency table
Find all tests that are not cached yet
Calculate them in parallel, using all CPU cores
Add them to the cache
Return results

Usage

Set the location of the cache database:

export CACHED_CONTINGENCY_DB=/custom/path.db  # default: ~/.cache/contingency.db

Calculate single tests:

from cached_contingency import CachedFisher, CachedBoschloo
from scipy.stats import fisher_exact, boschloo_exact

# Create class (automatically creates database if none exists yet)
cf = CachedFisher()
# Calculate Fisher's test
pval, odds_ratio = cf.get_or_create(74, 31, 43, 32)
# This is equivalent to:
odds_ratio, pval = fisher_exact([[74, 31], [43, 32]])

# Create class (automatically creates database if none exists yet)
cb = CachedBoschloo()
# Calculate Fisher's test
pval_b, pval_f = cb.get_or_create(74, 31, 43, 32)
# This is almost* equivalent to:
boschloo_result = boschloo_exact([[74, 31], [43, 32]])
pval_b, pval_f = boschloo_result.pvalue, boschloo_result.statistic

*: Not exactly equivalent: My function never returns pvalues greater than 1 and never returns nan as pvalues. (See scipy issue.)

Calculate multiple tests:

from cached_contingency import CachedFisher, CachedBoschloo
import pandas as pd
import numpy as np

# Create class (automatically creates database if none exists yet)
cb = CachedBoschloo()

# Create test DataFrame, column names are important!
np.random.seed(42)
test_df = pd.DataFrame(
    [(np.random.randint(200) for _ in range(4)) for _ in range(5)],
    columns=['c1r1', 'c2r1', 'c1r2', 'c2r2']
)
print(test_df)
#    c1r1  c2r1  c1r2  c2r2
# 0   102   179    92    14
# 1   106    71   188    20
# 2   102   121    74    87
# 3   116    99   103   151
# 4   130   149    52     1

# Calculate multiple Boschloo's tests
result_df = cb.get_or_create_many(test_df)
print(result_df)
#    c1r1  c2r1  c1r2  c2r2          pval   fisher_stat
# 0   102   179    92    14  3.442564e-20  3.974758e-20
# 1   106    71   188    20  1.144156e-12  1.197655e-12
# 2   102   121    74    87  9.692791e-01  5.239450e-01
# 3   116    99   103   151  3.821222e-03  2.490365e-03
# 4   130   149    52     1  1.831830e-14  1.595989e-14

# If you run this again, the results will be loaded from cache:
result_df = cb.get_or_create_many(test_df)
print('Like a flash!')

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.4

Jan 5, 2022

0.0.3

Jan 5, 2022

0.0.2

Jan 4, 2022

This version

0.0.1

Jan 3, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cached-contingency-0.0.1.tar.gz (6.6 kB view hashes)

Uploaded Jan 3, 2022 Source

Hashes for cached-contingency-0.0.1.tar.gz

Hashes for cached-contingency-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`319ae465ad032a77b3aaf7160460c99b570d22e4d35f82e03cdab920978e0d50`
MD5	`621612ed01c99324ed8275f8921fa813`
BLAKE2b-256	`a806ab12cabfc3460368d4ba215a93f3436dc77dda57637b302966eed8642798`