A comprehensive statistical assumption checking package

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

`AssumptionSheriff` Documentation

Overview

AssumptionSheriff is a comprehensive Python package designed to validate statistical test assumptions. It provides automated checking of common statistical assumptions and offers recommendations when assumptions are violated. The packges supports 16 commonly used statistical tests, more tests will be added in the coming package updates.

Installation

The package can be installed using pip. The dependdencies are pandas,numpy, scipy, lifelines, and statsmodels.

pip install assumption_sheriff

Quick Start

# Basic import of everything
import assumption_sheriff as ash

# or direct import of specific components
from assumption_sheriff import StatisticalTestAssumptions

Supported Statistical Tests

AssumptionSheriff supports assumption checking for the following statistical tests:

Independent Samples t-test t_test_ind
Repeated-Measures ANOVA repeated_anova
Logistic Regression logistic
Factorial ANOVA (Two-way ANOVA) factorial_anova
One-Way ANOVA (one_way_anova) one_way_anova
Pearson Correlation pearson_correlation
Paired t-test paired_ttest
Chi-Square Test of Independence chi_square_independence
Multiple Regression (multiple_regression)
Two-Way ANOVA two_way_anova
Kaplan-Meier Analysis kaplan_meier
Cox Proportional Hazards cox_ph
Poisson Regression poisson
Spearman Correlation spearman
Wilcoxon Signed-Rank Test wilcoxon_signed_rank
MANOVA (Multivariate Analysis of Variance) manova

More tests to be added in future vesrions.

Key Features

Comprehensive assumption checking
Recommendations for alternative methods
Flexible integration
Commonly used test support

Package structure

The package is divided into:

Mixin classes:

To handle specific assumption checks: Such as Noramlity, Homoscedasticity, Monotonicity, etc.

Specific Checkers classes:

Which are specific checkers for various statistical tests (T-test, ANOVA, Pearson, etc.), each inheriting from relevant Mixin classes.

Detailed user example

# Generate a sample data to test the package 
import numpy as np
import pandas as pd

np.random.seed(123)
n = 100

# 1. Independent groups for t-test / one-way ANOVA / two-way ANOVA (factor A/B)
group_bin = np.random.choice([0, 1], size=n)  
factorA = np.random.choice(['A1','A2'], size=n)  
factorB = np.random.choice(['B1','B2'], size=n) 

# 2. Continuous variables (for t-tests, ANOVAs, correlations, regressions, etc.)
cont_var1 = np.random.normal(loc=50, scale=10, size=n)   
cont_var2 = np.random.normal(loc=0, scale=5, size=n)     
cont_var3 = np.random.normal(loc=100, scale=20, size=n) 

# 3. Repeated-measures variables (for repeated-measures ANOVA)
rm_time1 = np.random.normal(loc=5, scale=1, size=n)  
rm_time2 = rm_time1 + np.random.normal(loc=0.5, scale=0.5, size=n)  
rm_time3 = rm_time1 + np.random.normal(loc=1.0, scale=0.5, size=n)  

# 4. Paired data (for paired t-tests or Wilcoxon signed-rank)
paired_pre = np.random.normal(loc=10, scale=2, size=n)
paired_post = paired_pre + np.random.normal(loc=-1, scale=1, size=n)

# 5. Logistic outcome (binary) for logistic regression
logistic_outcome = np.random.binomial(n=1, p=0.4, size=n)

# 6. Categorical variables (for Chi-Square)
cat_var1 = np.random.choice(['Yes','No'], size=n)
cat_var2 = np.random.choice(['High','Low'], size=n)

# 7. Survival data (time-to-event + event indicator for KM/Cox)
time_to_event = np.random.exponential(scale=10, size=n)
event_occurred = np.random.binomial(n=1, p=0.7, size=n)

# 8. Count data (for Poisson regression)
count_data = np.random.poisson(lam=2, size=n)

# 9. Ordinal data (for Spearman correlation or ordinal logistic)
ordinal_data = np.random.choice(['Mild','Moderate','Severe'], size=n)

# 10. Additional continuous variables for correlations / MANOVA
manova_var1 = np.random.normal(loc=30, scale=5, size=n)
manova_var2 = np.random.normal(loc=60, scale=10, size=n)

# Assemble everything into a DataFrame
data = pd.DataFrame({
    'group_bin': group_bin,
    'factorA': factorA,
    'factorB': factorB,
    'cont_var1': cont_var1,
    'cont_var2': cont_var2,
    'cont_var3': cont_var3,
    'rm_time1': rm_time1,
    'rm_time2': rm_time2,
    'rm_time3': rm_time3,
    'paired_pre': paired_pre,
    'paired_post': paired_post,
    'logistic_outcome': logistic_outcome,
    'cat_var1': cat_var1,
    'cat_var2': cat_var2,
    'time_to_event': time_to_event,
    'event_occurred': event_occurred,
    'count_data': count_data,
    'ordinal_data': ordinal_data,
    'manova_var1': manova_var1,
    'manova_var2': manova_var2
})

print(data.head(5))

   group_bin factorA factorB  cont_var1  cont_var2   cont_var3  rm_time1  \
0          0      A2      B2  54.743473  -6.186766   96.147700  4.413384   
1          1      A1      B2  44.360761   0.620279  108.982712  5.154290   
2          0      A1      B1  40.026785  -8.002203   97.092729  3.852763   
3          0      A2      B1  38.999569   3.769344  137.374529  6.520166   
4          0      A2      B1  42.435628  -1.234079   89.625923  5.189043   

   rm_time2  rm_time3  paired_pre  paired_post  logistic_outcome cat_var1  \
0  5.256786  5.338739   10.545469     8.320940                 0       No   
1  5.947174  7.147672   10.850672    10.859323                 0      Yes   
2  4.585648  4.515028    9.538192     8.055827                 1       No   
3  6.311219  7.372252   17.143158    17.222955                 0      Yes   
4  5.909335  5.162844    9.207688     7.786610                 1       No   

  cat_var2  time_to_event  event_occurred  count_data ordinal_data  \
0      Low       0.491169               1           5       Severe   
1      Low      21.322712               1           2       Severe   
2      Low       6.101167               1           0       Severe   
3      Low      17.788497               1           1     Moderate   
4     High       3.188856               0           1         Mild   

   manova_var1  manova_var2  
0    35.227235    49.717239  
1    25.034866    52.544302  
2    32.711234    56.899888  
3    32.427786    48.581779  
4    33.709235    69.263657

T-tests

# for independent t-test
# ---------------------
# Initialize checker
checker = ash.StatisticalTestAssumptions()

# Check assumptions for independent t-test
results = checker.check_assumptions(
    data=data,
    test_type='t_test_ind',
    variables=['cont_var1', 'cont_var2'],
    group_column='group_bin'
)

# Get recommendation
recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Independent t-test.

# for paired t-test
# ---------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='paired_ttest',
    variables=['paired_pre', 'paired_post']
)
# Get recommendations
recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Paired t-test.

ANOVAs

# One-way ANOVA
# -------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='one_way_anova',
    variables=['cont_var1', 'cont_var2'],
    group_column='group_bin'
)

recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the One-way ANOVA.

# Two-Way ANOVA
# ----------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='two_way_anova',
    variables=['cont_var1'],
    #dependent_var='cont_var1',
    factors=['factorA', 'factorA']
)

recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Two-way ANOVA.

# Factorial (Two-way) ANOVA
# -----------------------
checker = ash.StatisticalTestAssumptions()
# Sample data structure
data2 = pd.DataFrame({
    'fertilizer': ['A', 'A', 'B', 'B'] * 25,
    'watering': ['daily', 'weekly'] * 50,
    'yield': np.random.normal(loc=[50, 45, 60, 55] * 25, scale=5)
})

results = checker.check_assumptions(
    data=data2,
    test_type='factorial_anova',
    variables=['yield'],
    group_columns= ['fertilizer', 'watering']
)

recommendation = checker.get_recommendation(results)
print(recommendation)

⚠ Some assumptions for Factorial ANOVA are violated:

- Insufficient sample size in some cells (minimum 25 < required 30)

Consider these alternatives:
- Non-parametric factorial analysis
- Mixed-effects model
- Robust ANOVA

# Repeated measures ANOVA
#-------------------------
checker = ash.StatisticalTestAssumptions()
# Check assumptions
results = checker.check_assumptions(
    data=data,
    test_type='repeated_anova',
    variables=['rm_time1', 'rm_time2', 'rm_time3'],
    subject_column='group_bin'
)

recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Repeated Measures ANOVA.

# for MANOVA (Multivariate Analysis of Variance)
#---------------------------------------------------

checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='manova',
    variables=['manova_var1', 'manova_var2'],
    group_col='group_bin'
)

recommendation = checker.get_recommendation(results)
print(recommendation)

⚠ Some assumptions for MANOVA are violated:

- Multivariate normality violated in group_1
- Number of dependent variables should ideally be greater than number of groups

Consider these alternatives:
- Separate univariate ANOVAs with Bonferroni correction
- Robust MANOVA
- Permutation MANOVA
- Non-parametric multivariate tests (e.g., NPMANOVA)
- Linear Discriminant Analysis

Correlation tests

# for Pearson corraltion 
# --------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='pearson_correlation',
    variables=['cont_var1', 'cont_var2']
)

recommendation = checker.get_recommendation(results)
print(recommendation)

⚠ Some assumptions for Pearson Correlation are violated:

- Variable pair cont_var1_vs_cont_var2 may not have a monotonic relationship (Spearman correlation=0.07)

Consider these alternatives:
- Spearman rank correlation
- Kendall rank correlation
- Robust correlation methods

# for spearman correlation
#----------------------------

checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='spearman',
    variables=['ordinal_data', 'cont_var2']
)

recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Spearman's Rank Correlation.

Chi-square independence

# for Chi-square test 
# ---------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='chi_square_independence',
    variables=['cat_var1', 'cat_var1']
)

recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Chi-square test of independence.

Regression

# For Logistic Regression
#------------------------
checker = ash.StatisticalTestAssumptions()
# Check assumptions
results = checker.check_assumptions(
    data=data,
    test_type='logistic',
    variables=['cont_var1', 'cont_var2', 'cont_var3'],
    dependent_var='logistic_outcome'
)

recommendation = checker.get_recommendation(results)
print(recommendation)

⚠ Some assumptions for Logistic Regression are violated:

- High multicollinearity detected for 'const' (VIF=58.70)

Consider these alternatives:
- Penalized regression (Ridge, Lasso)
- Decision trees

# Multiple Linear Regression 
# ----------------------------
checker = StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='multiple_regression',
    variables=['cont_var1', 'cont_var2', 'cont_var3'],
    dependent_var='logistic_outcome'
)

recommendation = checker.get_recommendation(results)
print(recommendation)

⚠ Some assumptions for Multiple Linear Regression are violated:

- Residuals are not normally distributed (Shapiro-Wilk p=0.0000)
- Non-linear relationship detected for predictor 'cont_var1'
- Non-linear relationship detected for predictor 'cont_var2'
- High multicollinearity detected for variables: ['const']

# for Poisson regression 
#-----------------------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='poisson',
    variables=[
        'count_data',    # dependent variable must be first
        'cont_var1',     # predictors follow
        'cont_var2'
    ],
    offset_var='exposure_time'  # Optional
)

recommendation = checker.get_recommendation(results)
print(recommendation)

⚠ Some assumptions for Poisson Regression are violated:

- High multicollinearity detected for variables: ['const']

Consider these alternatives:
- Negative Binomial Regression
- Zero-inflated Poisson Regression
- Zero-inflated Negative Binomial Regression
- Quasi-Poisson Regression

Survival analysis

# for Kaplan-Meier 
# ----------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='kaplan_meier',
    variables=['time_to_event', 'event_occurred'],  # time variable first, event variable second
    group_col='group_bin'  # optional grouping variable
)

recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Kaplan-Meier survival analysis.

# for Cox Proportional Hazards 
# -----------------------------
checker = ash.StatisticalTestAssumptions()

cox_results = checker.check_assumptions(
    data=data,
    test_type='cox_ph',
    variables=['time_to_event', 'event_occurred'],  # time variable first, event variable second
    group_col='group_bin' 
) 

recommendation = checker.get_recommendation(cox_results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Cox Proportional Hazards Regression.

Non-parametric tests

# for Wilcoxon Signed-Rank Test
#-------------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
    data=data,
    test_type='wilcoxon_signed_rank',
    variables=['paired_pre', 'paired_post']
)

recommendation = checker.get_recommendation(results)
print(recommendation)

✓ All assumptions are met. You can proceed with the Wilcoxon Signed-Rank Test.

Common issues and solutions

1. Handling missing data

AssumptionSheriff automatically handles missing data in most cases. However, for best results:

Remove or impute missing values before checking assumptions
Ensure complete cases for paired tests
Document any data preprocessing steps

2. Dealing with outliers

When outliers are detected:

Review them for data entry errors
Consider robust statistical methods
Document justification for outlier handling

3. Small sample sizes

For small samples:

Consider non-parametric alternatives
Use exact tests when available
Be cautious with assumption violations

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.0

Dec 24, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

assumption_sheriff-0.1.0.tar.gz (28.3 kB view details)

Uploaded Dec 24, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

assumption_sheriff-0.1.0-py3-none-any.whl (23.8 kB view details)

Uploaded Dec 24, 2024 Python 3

File details

Details for the file assumption_sheriff-0.1.0.tar.gz.

File metadata

Download URL: assumption_sheriff-0.1.0.tar.gz
Upload date: Dec 24, 2024
Size: 28.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.13

File hashes

Hashes for assumption_sheriff-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`3996f345ec771bc0a9a2db27e05dce9a71f65ef2382ae89bc25908a20fa2141b`
MD5	`4ae0c83f031b136ff616fea45b6b0f18`
BLAKE2b-256	`83f06329d67557c4fe4a340b58143ad772dc5beede886c7e38eb13948b9882ef`

See more details on using hashes here.

File details

Details for the file assumption_sheriff-0.1.0-py3-none-any.whl.

File metadata

Download URL: assumption_sheriff-0.1.0-py3-none-any.whl
Upload date: Dec 24, 2024
Size: 23.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.10.13

File hashes

Hashes for assumption_sheriff-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4fc8eddbff8db693af88e14e19807ea5756ccc1ea97dc87020961d2231a2a6fe`
MD5	`831b729d16c78e3376b1bf38b84023e9`
BLAKE2b-256	`3175bf39faf0bc76326270032a95d19b492c26ff15d8984b7e9a8bd2bfe68a12`

See more details on using hashes here.

assumption-sheriff 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AssumptionSheriff Documentation

Overview

Installation

Quick Start

Supported Statistical Tests

Key Features

Package structure

Mixin classes:

Specific Checkers classes:

Detailed user example

T-tests

ANOVAs

Correlation tests

Chi-square independence

Regression

Survival analysis

Non-parametric tests

Common issues and solutions

1. Handling missing data

2. Dealing with outliers

3. Small sample sizes

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`AssumptionSheriff` Documentation