Skip to main content

An autonomous data-analysis agent that red-teams its own conclusions and reports what it cannot prove.

Project description

statskeptic

A data-analysis agent that red-teams its own conclusions.

Give it a dataset and a question. statskeptic profiles the data, picks a vetted statistical method, runs it, and then turns on the result: it attacks its own analysis against a methodological rubric (assumption violations, multiple comparisons, confounding, underpowered samples, data leakage, outlier sensitivity), revises what it can, and reports what the data shows and what it cannot conclude.

Two rules make it different from the fluent-but-wrong tools it competes with:

  1. The model never produces a number. Every statistic comes from real, tested code (scipy / statsmodels) and ships with the exact call that produced it, so any figure can be re-run and checked. statskeptic selects methods and interprets them; it does not invent them.
  2. "Cannot conclude" is a success state. Over-claiming is the cardinal sin here. When the data does not support a reliable answer, statskeptic says so plainly, and a non-zero exit code lets a pipeline act on it.

A trap a naive tool walks into

examples/skewed_trial.csv is a two-arm trial where recovery time is heavily right-skewed and there is no real difference between the arms. Point a tool that reaches straight for a t-test at it and you get a confident false positive: p = 0.014, "significant," ship it.

$ statskeptic analyze examples/skewed_trial.csv -q "Does the drug reduce recovery hours?"

## Mann-Whitney U
comparing 'recovery_hours' across 'arm': two groups, so a t-test is the usual first pass

- Result: U = 814, p = 0.110 (not significant at alpha=0.05)
- Effect: rank_biserial_r = -0.196
- location shift (drug - placebo): 95% CI [-17.5, 1.1]
- n = 90

### Revisions
- Switched from Student's t-test to Mann-Whitney U (assumption.normality): data is
  non-normal; the rank-based test is valid here. p 0.014 -> 0.110.

### Objections raised
- None outstanding.

## What this cannot conclude
- Nothing beyond the assumptions and caveats noted above.

statskeptic planned the same t-test a careful analyst would reach for first, then its normality check fired, the revision loop switched to the rank-based test, and the "significant" result evaporated. The audit trail shows the switch and the p-value before and after. The false positive never leaves the building.

What it catches

Each objection is grounded in the actual numbers and carries a concrete remedy. Some are fixed automatically by re-running; others can only be flagged, and those push the verdict toward "cannot conclude."

Objection What fires it What statskeptic does
Non-normality Shapiro plus a real skew magnitude, not a trivial deviation switch to the rank test (Mann-Whitney, Kruskal-Wallis, Spearman)
Unequal variance Levene on a pooled-variance t-test switch to Welch's t-test
Sparse contingency cells expected counts below Cochran's threshold switch a 2x2 to Fisher's exact test
Multiple comparisons many tests run against one outcome apply a Holm correction and re-read significance
Confounding a causal question on observational data name a candidate confounder; refuse the causal claim
Low power a non-significant result where only a large effect was detectable report the minimum detectable effect; refuse to read "no effect"
Data leakage an identifier used as a predictor drop it and re-fit
Outlier sensitivity dropping extreme points flips significance switch to a rank-based test

The vetted toolset covers two-group comparisons (Student's t, Welch, Mann-Whitney), k-group comparisons (one-way ANOVA, Kruskal-Wallis), association (Pearson, Spearman, chi-square, Fisher's exact), and regression (OLS, logistic). Each routine reports an effect size and, where one is defined, a confidence interval, and lists the assumptions it checked against your data.

Install

git clone https://github.com/Burton-David/statskeptic
cd statskeptic
pip install -e .

Python 3.10 or newer. The core needs no API key and makes no network calls.

Usage

statskeptic analyze data.csv --question "Does the treatment change recovery?"

The reader detects the file's dialect (delimiter, quoting, encoding) with CleverCSV, so semicolon-delimited, tab-delimited, or non-UTF-8 files load as the table they actually are rather than a single mangled column, and infinities are treated as missing data.

Options:

  • --json emits the full typed report, every number traceable to its computation.
  • --outcome, --group / --by, --predictors name columns when the question is ambiguous (the planner declines rather than guess).
  • --alpha sets the significance level (default 0.05).
  • --quiet suppresses the report body and returns only the exit code.

Exit codes make it scriptable as a gate:

code meaning
0 a defensible result (with caveats counts as defensible)
2 the data cannot support a reliable answer
3 the question does not map to a vetted method
64 usage error (bad flags, missing file, unknown column)
70 a statistical routine failed and the cause is reported, not hidden

As a library:

from statskeptic import analyze

report = analyze("data.csv", "Does exercise cause better health?")
print(report.explain())     # markdown
report.to_json()            # the full typed report
report.verdict              # defensible / defensible_with_caveats / cannot_conclude / declined

Try the planted-trap corpus

examples/ ships five datasets, each with one planted flaw, generated by a seeded script so the numbers above are reproducible (python examples/make_demo_data.py):

statskeptic analyze examples/biomarker_screen.csv  -q "Which markers are associated with the outcome?"
statskeptic analyze examples/exercise_health.csv   -q "Does more exercise cause a better health score?"
statskeptic analyze examples/small_trial.csv       -q "Does the treatment change the test score?"
statskeptic analyze examples/clean_ab_test.csv     -q "Does the variant change order value?"

The biomarker screen finds 4 markers significant at p<0.05, then a Holm correction across the 24 tests leaves only the one real signal standing. The exercise question reports a strong correlation and still refuses to call it causal, naming age as the likely confounder. The small trial returns "cannot conclude": at nine per arm, only a large effect was ever detectable. The clean A/B test returns a plain, defensible yes.

Honest limits

  • Causal critique is a flag, not an engine. statskeptic names a candidate confounder and declines the causal claim; it does not estimate causal effects.
  • The rule-based planner maps a question to a method by keywords and column structure. It declines ambiguous questions rather than guess, so you may need --outcome / --group to point it at the right columns.
  • Independence is assumed and stated, not tested. It is a property of the study design, which the data alone cannot reveal.
  • An optional LLM critic (for context-specific objections the static rubric cannot encode) and clinical / financial domain packs are planned extensions, not yet shipped. The check registry and the planner are built as the seams for them.

License

MIT.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

statskeptic-0.1.0.tar.gz (74.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

statskeptic-0.1.0-py3-none-any.whl (53.9 kB view details)

Uploaded Python 3

File details

Details for the file statskeptic-0.1.0.tar.gz.

File metadata

  • Download URL: statskeptic-0.1.0.tar.gz
  • Upload date:
  • Size: 74.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for statskeptic-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0f1817c57287dbab8404ddf682438ed0a85e3d1e5f9057b197072ac44d561b2e
MD5 e132f3b28cbc0e064111b0fad8456d6b
BLAKE2b-256 a43c3a1d224c1f827094ef666e110d488efd05dfc5a1abbfb0a87cd588f098bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for statskeptic-0.1.0.tar.gz:

Publisher: publish.yml on Burton-David/statskeptic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file statskeptic-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: statskeptic-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 53.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for statskeptic-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8e0fd515b7c3aa112871d0752e7aef6cd7d34f4b5aec514279be1f29dfe551ce
MD5 3d29a3cfb3b2e02ed3b4ca34653b593b
BLAKE2b-256 a0c103c4fc0322d96c72828c2e057b326e49d668ed43dffa44186b7a2cf49169

See more details on using hashes here.

Provenance

The following attestation bundles were made for statskeptic-0.1.0-py3-none-any.whl:

Publisher: publish.yml on Burton-David/statskeptic

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page