A comprehensive statistical assumption checking package
Project description
AssumptionSheriff Documentation
Overview
AssumptionSheriff is a comprehensive Python package designed to validate statistical test assumptions. It provides automated checking of common statistical assumptions and offers recommendations when assumptions are violated. The packges supports 16 commonly used statistical tests, more tests will be added in the coming package updates.
Installation
The package can be installed using pip. The dependdencies are pandas,numpy, scipy, lifelines, and statsmodels.
pip install assumption_sheriff
Quick Start
# Basic import of everything
import assumption_sheriff as ash
# or direct import of specific components
from assumption_sheriff import StatisticalTestAssumptions
Supported Statistical Tests
AssumptionSheriff supports assumption checking for the following statistical tests:
- Independent Samples t-test
t_test_ind - Repeated-Measures ANOVA
repeated_anova - Logistic Regression
logistic - Factorial ANOVA (Two-way ANOVA)
factorial_anova - One-Way ANOVA (one_way_anova)
one_way_anova - Pearson Correlation
pearson_correlation - Paired t-test
paired_ttest - Chi-Square Test of Independence
chi_square_independence - Multiple Regression (
multiple_regression) - Two-Way ANOVA
two_way_anova - Kaplan-Meier Analysis
kaplan_meier - Cox Proportional Hazards
cox_ph - Poisson Regression
poisson - Spearman Correlation
spearman - Wilcoxon Signed-Rank Test
wilcoxon_signed_rank - MANOVA (Multivariate Analysis of Variance)
manova
More tests to be added in future vesrions.
Key Features
- Comprehensive assumption checking
- Recommendations for alternative methods
- Flexible integration
- Commonly used test support
Package structure
The package is divided into:
Mixin classes:
To handle specific assumption checks: Such as Noramlity, Homoscedasticity, Monotonicity, etc.
Specific Checkers classes:
Which are specific checkers for various statistical tests (T-test, ANOVA, Pearson, etc.), each inheriting from relevant Mixin classes.
Detailed user example
# Generate a sample data to test the package
import numpy as np
import pandas as pd
np.random.seed(123)
n = 100
# 1. Independent groups for t-test / one-way ANOVA / two-way ANOVA (factor A/B)
group_bin = np.random.choice([0, 1], size=n)
factorA = np.random.choice(['A1','A2'], size=n)
factorB = np.random.choice(['B1','B2'], size=n)
# 2. Continuous variables (for t-tests, ANOVAs, correlations, regressions, etc.)
cont_var1 = np.random.normal(loc=50, scale=10, size=n)
cont_var2 = np.random.normal(loc=0, scale=5, size=n)
cont_var3 = np.random.normal(loc=100, scale=20, size=n)
# 3. Repeated-measures variables (for repeated-measures ANOVA)
rm_time1 = np.random.normal(loc=5, scale=1, size=n)
rm_time2 = rm_time1 + np.random.normal(loc=0.5, scale=0.5, size=n)
rm_time3 = rm_time1 + np.random.normal(loc=1.0, scale=0.5, size=n)
# 4. Paired data (for paired t-tests or Wilcoxon signed-rank)
paired_pre = np.random.normal(loc=10, scale=2, size=n)
paired_post = paired_pre + np.random.normal(loc=-1, scale=1, size=n)
# 5. Logistic outcome (binary) for logistic regression
logistic_outcome = np.random.binomial(n=1, p=0.4, size=n)
# 6. Categorical variables (for Chi-Square)
cat_var1 = np.random.choice(['Yes','No'], size=n)
cat_var2 = np.random.choice(['High','Low'], size=n)
# 7. Survival data (time-to-event + event indicator for KM/Cox)
time_to_event = np.random.exponential(scale=10, size=n)
event_occurred = np.random.binomial(n=1, p=0.7, size=n)
# 8. Count data (for Poisson regression)
count_data = np.random.poisson(lam=2, size=n)
# 9. Ordinal data (for Spearman correlation or ordinal logistic)
ordinal_data = np.random.choice(['Mild','Moderate','Severe'], size=n)
# 10. Additional continuous variables for correlations / MANOVA
manova_var1 = np.random.normal(loc=30, scale=5, size=n)
manova_var2 = np.random.normal(loc=60, scale=10, size=n)
# Assemble everything into a DataFrame
data = pd.DataFrame({
'group_bin': group_bin,
'factorA': factorA,
'factorB': factorB,
'cont_var1': cont_var1,
'cont_var2': cont_var2,
'cont_var3': cont_var3,
'rm_time1': rm_time1,
'rm_time2': rm_time2,
'rm_time3': rm_time3,
'paired_pre': paired_pre,
'paired_post': paired_post,
'logistic_outcome': logistic_outcome,
'cat_var1': cat_var1,
'cat_var2': cat_var2,
'time_to_event': time_to_event,
'event_occurred': event_occurred,
'count_data': count_data,
'ordinal_data': ordinal_data,
'manova_var1': manova_var1,
'manova_var2': manova_var2
})
print(data.head(5))
group_bin factorA factorB cont_var1 cont_var2 cont_var3 rm_time1 \
0 0 A2 B2 54.743473 -6.186766 96.147700 4.413384
1 1 A1 B2 44.360761 0.620279 108.982712 5.154290
2 0 A1 B1 40.026785 -8.002203 97.092729 3.852763
3 0 A2 B1 38.999569 3.769344 137.374529 6.520166
4 0 A2 B1 42.435628 -1.234079 89.625923 5.189043
rm_time2 rm_time3 paired_pre paired_post logistic_outcome cat_var1 \
0 5.256786 5.338739 10.545469 8.320940 0 No
1 5.947174 7.147672 10.850672 10.859323 0 Yes
2 4.585648 4.515028 9.538192 8.055827 1 No
3 6.311219 7.372252 17.143158 17.222955 0 Yes
4 5.909335 5.162844 9.207688 7.786610 1 No
cat_var2 time_to_event event_occurred count_data ordinal_data \
0 Low 0.491169 1 5 Severe
1 Low 21.322712 1 2 Severe
2 Low 6.101167 1 0 Severe
3 Low 17.788497 1 1 Moderate
4 High 3.188856 0 1 Mild
manova_var1 manova_var2
0 35.227235 49.717239
1 25.034866 52.544302
2 32.711234 56.899888
3 32.427786 48.581779
4 33.709235 69.263657
T-tests
# for independent t-test
# ---------------------
# Initialize checker
checker = ash.StatisticalTestAssumptions()
# Check assumptions for independent t-test
results = checker.check_assumptions(
data=data,
test_type='t_test_ind',
variables=['cont_var1', 'cont_var2'],
group_column='group_bin'
)
# Get recommendation
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Independent t-test.
# for paired t-test
# ---------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='paired_ttest',
variables=['paired_pre', 'paired_post']
)
# Get recommendations
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Paired t-test.
ANOVAs
# One-way ANOVA
# -------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='one_way_anova',
variables=['cont_var1', 'cont_var2'],
group_column='group_bin'
)
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the One-way ANOVA.
# Two-Way ANOVA
# ----------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='two_way_anova',
variables=['cont_var1'],
#dependent_var='cont_var1',
factors=['factorA', 'factorA']
)
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Two-way ANOVA.
# Factorial (Two-way) ANOVA
# -----------------------
checker = ash.StatisticalTestAssumptions()
# Sample data structure
data2 = pd.DataFrame({
'fertilizer': ['A', 'A', 'B', 'B'] * 25,
'watering': ['daily', 'weekly'] * 50,
'yield': np.random.normal(loc=[50, 45, 60, 55] * 25, scale=5)
})
results = checker.check_assumptions(
data=data2,
test_type='factorial_anova',
variables=['yield'],
group_columns= ['fertilizer', 'watering']
)
recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Factorial ANOVA are violated:
- Insufficient sample size in some cells (minimum 25 < required 30)
Consider these alternatives:
- Non-parametric factorial analysis
- Mixed-effects model
- Robust ANOVA
# Repeated measures ANOVA
#-------------------------
checker = ash.StatisticalTestAssumptions()
# Check assumptions
results = checker.check_assumptions(
data=data,
test_type='repeated_anova',
variables=['rm_time1', 'rm_time2', 'rm_time3'],
subject_column='group_bin'
)
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Repeated Measures ANOVA.
# for MANOVA (Multivariate Analysis of Variance)
#---------------------------------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='manova',
variables=['manova_var1', 'manova_var2'],
group_col='group_bin'
)
recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for MANOVA are violated:
- Multivariate normality violated in group_1
- Number of dependent variables should ideally be greater than number of groups
Consider these alternatives:
- Separate univariate ANOVAs with Bonferroni correction
- Robust MANOVA
- Permutation MANOVA
- Non-parametric multivariate tests (e.g., NPMANOVA)
- Linear Discriminant Analysis
Correlation tests
# for Pearson corraltion
# --------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='pearson_correlation',
variables=['cont_var1', 'cont_var2']
)
recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Pearson Correlation are violated:
- Variable pair cont_var1_vs_cont_var2 may not have a monotonic relationship (Spearman correlation=0.07)
Consider these alternatives:
- Spearman rank correlation
- Kendall rank correlation
- Robust correlation methods
# for spearman correlation
#----------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='spearman',
variables=['ordinal_data', 'cont_var2']
)
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Spearman's Rank Correlation.
Chi-square independence
# for Chi-square test
# ---------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='chi_square_independence',
variables=['cat_var1', 'cat_var1']
)
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Chi-square test of independence.
Regression
# For Logistic Regression
#------------------------
checker = ash.StatisticalTestAssumptions()
# Check assumptions
results = checker.check_assumptions(
data=data,
test_type='logistic',
variables=['cont_var1', 'cont_var2', 'cont_var3'],
dependent_var='logistic_outcome'
)
recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Logistic Regression are violated:
- High multicollinearity detected for 'const' (VIF=58.70)
Consider these alternatives:
- Penalized regression (Ridge, Lasso)
- Decision trees
# Multiple Linear Regression
# ----------------------------
checker = StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='multiple_regression',
variables=['cont_var1', 'cont_var2', 'cont_var3'],
dependent_var='logistic_outcome'
)
recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Multiple Linear Regression are violated:
- Residuals are not normally distributed (Shapiro-Wilk p=0.0000)
- Non-linear relationship detected for predictor 'cont_var1'
- Non-linear relationship detected for predictor 'cont_var2'
- High multicollinearity detected for variables: ['const']
# for Poisson regression
#-----------------------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='poisson',
variables=[
'count_data', # dependent variable must be first
'cont_var1', # predictors follow
'cont_var2'
],
offset_var='exposure_time' # Optional
)
recommendation = checker.get_recommendation(results)
print(recommendation)
⚠ Some assumptions for Poisson Regression are violated:
- High multicollinearity detected for variables: ['const']
Consider these alternatives:
- Negative Binomial Regression
- Zero-inflated Poisson Regression
- Zero-inflated Negative Binomial Regression
- Quasi-Poisson Regression
Survival analysis
# for Kaplan-Meier
# ----------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='kaplan_meier',
variables=['time_to_event', 'event_occurred'], # time variable first, event variable second
group_col='group_bin' # optional grouping variable
)
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Kaplan-Meier survival analysis.
# for Cox Proportional Hazards
# -----------------------------
checker = ash.StatisticalTestAssumptions()
cox_results = checker.check_assumptions(
data=data,
test_type='cox_ph',
variables=['time_to_event', 'event_occurred'], # time variable first, event variable second
group_col='group_bin'
)
recommendation = checker.get_recommendation(cox_results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Cox Proportional Hazards Regression.
Non-parametric tests
# for Wilcoxon Signed-Rank Test
#-------------------------------
checker = ash.StatisticalTestAssumptions()
results = checker.check_assumptions(
data=data,
test_type='wilcoxon_signed_rank',
variables=['paired_pre', 'paired_post']
)
recommendation = checker.get_recommendation(results)
print(recommendation)
✓ All assumptions are met. You can proceed with the Wilcoxon Signed-Rank Test.
Common issues and solutions
1. Handling missing data
AssumptionSheriff automatically handles missing data in most cases. However, for best results:
- Remove or impute missing values before checking assumptions
- Ensure complete cases for paired tests
- Document any data preprocessing steps
2. Dealing with outliers
When outliers are detected:
- Review them for data entry errors
- Consider robust statistical methods
- Document justification for outlier handling
3. Small sample sizes
For small samples:
- Consider non-parametric alternatives
- Use exact tests when available
- Be cautious with assumption violations
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file assumption_sheriff-0.1.0.tar.gz.
File metadata
- Download URL: assumption_sheriff-0.1.0.tar.gz
- Upload date:
- Size: 28.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3996f345ec771bc0a9a2db27e05dce9a71f65ef2382ae89bc25908a20fa2141b
|
|
| MD5 |
4ae0c83f031b136ff616fea45b6b0f18
|
|
| BLAKE2b-256 |
83f06329d67557c4fe4a340b58143ad772dc5beede886c7e38eb13948b9882ef
|
File details
Details for the file assumption_sheriff-0.1.0-py3-none-any.whl.
File metadata
- Download URL: assumption_sheriff-0.1.0-py3-none-any.whl
- Upload date:
- Size: 23.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.10.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fc8eddbff8db693af88e14e19807ea5756ccc1ea97dc87020961d2231a2a6fe
|
|
| MD5 |
831b729d16c78e3376b1bf38b84023e9
|
|
| BLAKE2b-256 |
3175bf39faf0bc76326270032a95d19b492c26ff15d8984b7e9a8bd2bfe68a12
|