Skip to main content

A comprehensive statistical assumption checking package

Project description

AssumptionSheriff Documentation

Overview

AssumptionSheriff is a comprehensive Python package designed to validate statistical test assumptions. It provides automated checking of common statistical assumptions and offers recommendations when assumptions are violated. The packges supports 16 commonly used statistical tests, more tests will be added in the coming package updates.

Python 3.10+ License: MIT

Installation

The package can be installed using pip. The dependdencies are pandas,numpy, scipy, lifelines, and statsmodels.

pip install assumption_sheriff

Quick Start

# Basic import of everything
import assumption_sheriff as ash
# or direct import of specific components
from assumption_sheriff import StatisticalTestAssumptions

Supported Statistical Tests

AssumptionSheriff supports assumption checking for the following statistical tests:

  1. Independent Samples t-test t_test_ind
  2. Repeated-Measures ANOVA repeated_anova
  3. Logistic Regression logistic
  4. Factorial ANOVA (Two-way ANOVA) factorial_anova
  5. One-Way ANOVA (one_way_anova) one_way_anova
  6. Pearson Correlation pearson_correlation
  7. Paired t-test paired_ttest
  8. Chi-Square Test of Independence chi_square_independence
  9. Multiple Regression (multiple_regression)
  10. Two-Way ANOVA two_way_anova
  11. Kaplan-Meier Analysis kaplan_meier
  12. Cox Proportional Hazards cox_ph
  13. Poisson Regression poisson
  14. Spearman Correlation spearman
  15. Wilcoxon Signed-Rank Test wilcoxon_signed_rank
  16. MANOVA (Multivariate Analysis of Variance) manova

More tests to be added in future vesrions.

Key Features

  1. Comprehensive assumption checking
  2. Recommendations for alternative methods
  3. Flexible integration
  4. Commonly used test support

Package structure

The package is divided into:

Mixin classes:

To handle specific assumption checks: Such as Noramlity, Homoscedasticity, Monotonicity, etc.

Specific Checkers classes:

Which are specific checkers for various statistical tests (T-test, ANOVA, Pearson, etc.), each inheriting from relevant Mixin classes.

Detailed user example

# Generate a sample data to test the package 
import numpy as np
import pandas as pd

np.random.seed(123)
n = 100

# 1. Independent groups for t-test / one-way ANOVA / two-way ANOVA (factor A/B)
group_bin = np.random.choice([0, 1], size=n)  
factorA = np.random.choice(['A1','A2'], size=n)  
factorB = np.random.choice(['B1','B2'], size=n) 

# 2. Continuous variables (for t-tests, ANOVAs, correlations, regressions, etc.)
cont_var1 = np.random.normal(loc=50, scale=10, size=n)   
cont_var2 = np.random.normal(loc=0, scale=5, size=n)     
cont_var3 = np.random.normal(loc=100, scale=20, size=n) 

# 3. Repeated-measures variables (for repeated-measures ANOVA)
rm_time1 = np.random.normal(loc=5, scale=1, size=n)  
rm_time2 = rm_time1 + np.random.normal(loc=0.5, scale=0.5, size=n)  
rm_time3 = rm_time1 + np.random.normal(loc=1.0, scale=0.5, size=n)  

# 4. Paired data (for paired t-tests or Wilcoxon signed-rank)
paired_pre = np.random.normal(loc=10, scale=2, size=n)
paired_post = paired_pre + np.random.normal(loc=-1, scale=1, size=n)

# 5. Logistic outcome (binary) for logistic regression
logistic_outcome = np.random.binomial(n=1, p=0.4, size=n)

# 6. Categorical variables (for Chi-Square)
cat_var1 = np.random.choice(['Yes','No'], size=n)
cat_var2 = np.random.choice(['High','Low'], size=n)

# 7. Survival data (time-to-event + event indicator for KM/Cox)
time_to_event = np.random.exponential(scale=10, size=n)
event_occurred = np.random.binomial(n=1, p=0.7, size=n)

# 8. Count data (for Poisson regression)
count_data = np.random.poisson(lam=2, size=n)

# 9. Ordinal data (for Spearman correlation or ordinal logistic)
ordinal_data = np.random.choice(['Mild','Moderate','Severe'], size=n)

# 10. Additional continuous variables for correlations / MANOVA
manova_var1 = np.random.normal(loc=30, scale=5, size=n)
manova_var2 = np.random.normal(loc=60, scale=10, size=n)

# Assemble everything into a DataFrame
data = pd.DataFrame({
    'group_bin': group_bin,
    'factorA': factorA,
    'factorB': factorB,
    'cont_var1': cont_var1,
    'cont_var2': cont_var2,
    'cont_var3': cont_var3,
    'rm_time1': rm_time1,
    'rm_time2': rm_time2,
    'rm_time3': rm_time3,
    'paired_pre': paired_pre,
    'paired_post': paired_post,
    'logistic_outcome': logistic_outcome,
    'cat_var1': cat_var1,
    'cat_var2': cat_var2,
    'time_to_event': time_to_event,
    'event_occurred': event_occurred,
    'count_data': count_data,
    'ordinal_data': ordinal_data,
    'manova_var1': manova_var1,
    'manova_var2': manova_var2
})

print(data.head(5))
   group_bin factorA factorB  cont_var1  cont_var2   cont_var3  rm_time1  \
0          0      A2      B2  54.743473  -6.186766   96.147700  4.413384   
1          1      A1      B2  44.360761   0.620279  108.982712  5.154290   
2          0      A1      B1  40.026785  -8.002203   97.092729  3.852763   
3          0      A2      B1  38.999569   3.769344  137.374529  6.520166   
4          0      A2      B1  42.435628  -1.234079   89.625923  5.189043   

   rm_time2  rm_time3  paired_pre  paired_post  logistic_outcome cat_var1  \
0  5.256786  5.338739   10.545469     8.320940                 0       No   
1  5.947174  7.147672   10.850672    10.859323                 0      Yes   
2  4.585648  4.515028    9.538192     8.055827                 1       No   
3  6.311219  7.372252   17.143158    17.222955                 0      Yes   
4  5.909335  5.162844    9.207688     7.786610                 1       No   

  cat_var2  time_to_event  event_occurred  count_data ordinal_data  \
0      Low       0.491169               1           5       Severe   
1      Low      21.322712               1           2       Severe   
2      Low       6.101167               1           0       Severe   
3      Low      17.788497               1           1     Moderate   
4     High       3.188856               0           1         Mild   

   manova_var1  manova_var2  
0    35.227235    49.717239  
1    25.034866    52.544302  
2    32.711234    56.899888  
3    32.427786    48.581779  
4    33.709235    69.263657  

T-tests

# for independent t-test
# ---------------------
# Initialize checker
checker = ash.StatisticalTestAssumptions()

# Check assumptions for independent t-test
results = checker.check_assumptions(
    data=data,
    test_type='t_test_ind',
    variables=['cont_var1', 'cont_var2'],
    group_column='group_bin'
)

# Get recommendation
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Independent t-test.
# for paired t-test
# ---------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='paired_ttest',
    variables=['paired_pre', 'paired_post']
)
# Get recommendations
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Paired t-test.

ANOVAs

# One-way ANOVA
# -------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='one_way_anova',
    variables=['cont_var1', 'cont_var2'],
    group_column='group_bin'
)

recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the One-way ANOVA.
# Two-Way ANOVA
# ----------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='two_way_anova',
    variables=['cont_var1'],
    #dependent_var='cont_var1',
    factors=['factorA', 'factorA']
)

recommendation = checker.get_recommendation(results)
print(recommendation) 
✓ All assumptions are met. You can proceed with the Two-way ANOVA.
# Factorial (Two-way) ANOVA
# -----------------------
checker = ash.StatisticalTestAssumptions()
# Sample data structure
data2 = pd.DataFrame({
    'fertilizer': ['A', 'A', 'B', 'B'] * 25,
    'watering': ['daily', 'weekly'] * 50,
    'yield': np.random.normal(loc=[50, 45, 60, 55] * 25, scale=5)
})

results = checker.check_assumptions(
    data=data2,
    test_type='factorial_anova',
    variables=['yield'],
    group_columns= ['fertilizer', 'watering']
)

recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Factorial ANOVA are violated:

- Insufficient sample size in some cells (minimum 25 < required 30)

Consider these alternatives:
- Non-parametric factorial analysis
- Mixed-effects model
- Robust ANOVA
# Repeated measures ANOVA
#-------------------------
checker = ash.StatisticalTestAssumptions()
# Check assumptions
results = checker.check_assumptions(
    data=data,
    test_type='repeated_anova',
    variables=['rm_time1', 'rm_time2', 'rm_time3'],
    subject_column='group_bin'
)

recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Repeated Measures ANOVA.
# for MANOVA (Multivariate Analysis of Variance)
#---------------------------------------------------

checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='manova',
    variables=['manova_var1', 'manova_var2'],
    group_col='group_bin'
)

recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for MANOVA are violated:

- Multivariate normality violated in group_1
- Number of dependent variables should ideally be greater than number of groups

Consider these alternatives:
- Separate univariate ANOVAs with Bonferroni correction
- Robust MANOVA
- Permutation MANOVA
- Non-parametric multivariate tests (e.g., NPMANOVA)
- Linear Discriminant Analysis

Correlation tests

# for Pearson corraltion 
# --------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='pearson_correlation',
    variables=['cont_var1', 'cont_var2']
)

recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Pearson Correlation are violated:

- Variable pair cont_var1_vs_cont_var2 may not have a monotonic relationship (Spearman correlation=0.07)

Consider these alternatives:
- Spearman rank correlation
- Kendall rank correlation
- Robust correlation methods
# for spearman correlation
#----------------------------

checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='spearman',
    variables=['ordinal_data', 'cont_var2']
)

recommendation = checker.get_recommendation(results)
print(recommendation) 
✓ All assumptions are met. You can proceed with the Spearman's Rank Correlation.

Chi-square independence

# for Chi-square test 
# ---------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='chi_square_independence',
    variables=['cat_var1', 'cat_var1']
)

recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Chi-square test of independence.

Regression

# For Logistic Regression
#------------------------
checker = ash.StatisticalTestAssumptions()
# Check assumptions
results = checker.check_assumptions(
    data=data,
    test_type='logistic',
    variables=['cont_var1', 'cont_var2', 'cont_var3'],
    dependent_var='logistic_outcome'
)

recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Logistic Regression are violated:

- High multicollinearity detected for 'const' (VIF=58.70)

Consider these alternatives:
- Penalized regression (Ridge, Lasso)
- Decision trees
# Multiple Linear Regression 
# ----------------------------
checker = StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='multiple_regression',
    variables=['cont_var1', 'cont_var2', 'cont_var3'],
    dependent_var='logistic_outcome'
)

recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Multiple Linear Regression are violated:

- Residuals are not normally distributed (Shapiro-Wilk p=0.0000)
- Non-linear relationship detected for predictor 'cont_var1'
- Non-linear relationship detected for predictor 'cont_var2'
- High multicollinearity detected for variables: ['const']
# for Poisson regression 
#-----------------------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='poisson',
    variables=[
        'count_data',    # dependent variable must be first
        'cont_var1',     # predictors follow
        'cont_var2'
    ],
    offset_var='exposure_time'  # Optional
)

recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Poisson Regression are violated:

- High multicollinearity detected for variables: ['const']

Consider these alternatives:
- Negative Binomial Regression
- Zero-inflated Poisson Regression
- Zero-inflated Negative Binomial Regression
- Quasi-Poisson Regression

Survival analysis

# for Kaplan-Meier 
# ----------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='kaplan_meier',
    variables=['time_to_event', 'event_occurred'],  # time variable first, event variable second
    group_col='group_bin'  # optional grouping variable
)

recommendation = checker.get_recommendation(results)
print(recommendation) 
✓ All assumptions are met. You can proceed with the Kaplan-Meier survival analysis.
# for Cox Proportional Hazards 
# -----------------------------
checker = ash.StatisticalTestAssumptions()

cox_results = checker.check_assumptions(
    data=data,
    test_type='cox_ph',
    variables=['time_to_event', 'event_occurred'],  # time variable first, event variable second
    group_col='group_bin' 
) 

recommendation = checker.get_recommendation(cox_results)
print(recommendation) 
✓ All assumptions are met. You can proceed with the Cox Proportional Hazards Regression.

Non-parametric tests

# for Wilcoxon Signed-Rank Test
#-------------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='wilcoxon_signed_rank',
    variables=['paired_pre', 'paired_post']
)

recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Wilcoxon Signed-Rank Test.

Common issues and solutions

1. Handling missing data

AssumptionSheriff automatically handles missing data in most cases. However, for best results:

  • Remove or impute missing values before checking assumptions
  • Ensure complete cases for paired tests
  • Document any data preprocessing steps

2. Dealing with outliers

When outliers are detected:

  • Review them for data entry errors
  • Consider robust statistical methods
  • Document justification for outlier handling

3. Small sample sizes

For small samples:

  • Consider non-parametric alternatives
  • Use exact tests when available
  • Be cautious with assumption violations

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assumption_sheriff-0.1.0.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

assumption_sheriff-0.1.0-py3-none-any.whl (23.8 kB view details)

Uploaded Python 3

File details

Details for the file assumption_sheriff-0.1.0.tar.gz.

File metadata

  • Download URL: assumption_sheriff-0.1.0.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.10.13

File hashes

Hashes for assumption_sheriff-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3996f345ec771bc0a9a2db27e05dce9a71f65ef2382ae89bc25908a20fa2141b
MD5 4ae0c83f031b136ff616fea45b6b0f18
BLAKE2b-256 83f06329d67557c4fe4a340b58143ad772dc5beede886c7e38eb13948b9882ef

See more details on using hashes here.

File details

Details for the file assumption_sheriff-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for assumption_sheriff-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4fc8eddbff8db693af88e14e19807ea5756ccc1ea97dc87020961d2231a2a6fe
MD5 831b729d16c78e3376b1bf38b84023e9
BLAKE2b-256 3175bf39faf0bc76326270032a95d19b492c26ff15d8984b7e9a8bd2bfe68a12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page