Skip to main content

Find unit tests that give false positives: green tests that protect nothing, and tests that pass while asserting the wrong expected value.

Project description

falsegreen

CI PyPI Python License: MIT

Unit-test quality, focused on one failure: the false positive. falsegreen finds Python/pytest tests that pass green without protecting anything, and tests that pass while asserting the wrong value.

A green test that never fails when the code breaks is worse than no test: it tells you a broken program is safe. AI coding assistants produce these in volume. The tool catches them in two layers.

The scanner is a zero-dependency AST pass that runs on every commit. It validates each test against more than twenty mechanical smells, the ones a parser can prove: an assertion that never runs, a check that is empty or always true, a swallowed exception, a mock of the unit under test, an assertion stranded in dead code, a weak truthiness check, an async test that never awaits. High-confidence findings block the commit; the rest warn. The Claude Code skill then does the part a parser cannot: it reads the production code and judges whether each test asserts the right value, measured against the intended behavior rather than the current (possibly buggy) output.

The checks are grounded in the rotten-green-test research (Soares 2023; Delplanque et al., ICSE 2019) and cross-walked against the published test-smell catalog. See CREDITS.md.

Live on PyPI: pip install falsegreen. Also a pre-commit hook and a Claude Code plugin (see the three install paths below).


Table of contents


Why this exists

Coverage tells you which lines ran. It does not tell you whether anything was checked. A suite can report 90% coverage while a chunk of those tests assert nothing real, so the green bar is a comfort, not a guarantee. The danger is not a test that fails. It is a test that passes when it should not.

falsegreen targets that single, high-cost problem: the false positive. It is not a general style linter and it is not a coverage tool. It answers one question per test: is there a way for the code to be wrong and this test to stay green? If the answer is yes, the test is not protecting what it claims.

This matters more now that a large share of tests are written by AI assistants. The common machine-written failure modes are exactly the ones here: assertion roulette, mocking the unit under test, asserting the value you fed the mock, copying the expected value from the current output. They look thorough and they pass. falsegreen is the second reader that asks whether they actually verify anything.


The methodology and its basis

One rule sits under everything: a test is only useful if it fails when the code breaks. If you have never seen a test go red, you do not know that it tests anything. Every pattern falsegreen flags is a variation on tests that never fail, fail for the wrong reason, or check the wrong thing.

The patterns are organized into five families:

  • A. The test never checks anything. The assertion is skipped, missing, or swallowed, or the test is never even collected by the runner.
  • B. The check exists but is weak or always true. It accepts almost any output, or it is true by construction.
  • C. The test checks itself, not the program. It mocks the thing under test, asserts the value it configured, or re-implements the production logic as its own answer key.
  • D. Green depends on outside factors. Time, randomness, test order, a fixed sleep.
  • E. The test passes but checks the wrong thing. The assertion runs and compares a real result, yet the expected value contradicts what the code should do, so the test freezes a bug.

Family E is the reason the project has a semantic layer, and it carries the core principle of the methodology: the expected value must come from an independent oracle, never from the code's current output. The oracle hierarchy, highest first, is: explicit spec or requirement, documented contract (docstring, types, API doc), independent human judgment, and only last the current code. Code is the lowest-priority oracle. Promoting it above the others is how you end up rubber-stamping a bug.

Because intent changes which oracle is authoritative, the semantic pass first classifies the test: a spec/TDD test (the test is the authority), a characterization test (it intentionally freezes current behavior), a regression test for a known bug, or a plain behavior test. A red TDD test is not a false positive, and a labeled characterization snapshot is not a frozen bug. That classification step keeps the tool from flagging legitimate styles.

The plain-language guide behind every case, with a real-world analogy and a before/after for each, is in docs/guide.md. The detection reference that maps each code to its scanner code and to established tooling is in skills/falsegreen/reference.md.

The basis is the rotten-green-test research: a passing test that holds an assertion which never runs (Elvys Soares, A Multimethod Study of Test Smells, 2023; Delplanque et al., Rotten Green Tests, ICSE 2019). The catalog of patterns is cross-walked against the published test-smell catalog and against PyNose, the Python test-smell detector. Mutation testing (mutmut, cosmic-ray) is the honest measure of whether a suite actually fails when the code is wrong. Every reference, and the specific thing falsegreen took from each one, is in CREDITS.md.


What it validates, how, and why

The catalog has 18 named cases across the five families, and the scanner now ships 21 codes (the five families are the scanner-facing view; the semantic pass asks the same questions as six judgments, J1 to J6, which is the LLM-facing view of the same thing, mapped code by code in reference.md). A case is caught either by the deterministic scanner (a code like C5) or only by the semantic pass (it needs to read the production code). HIGH-confidence scanner findings block a commit; LOW ones warn.

# Case Why it fools you Detected by Conf
1 assert inside an if/for that may not run the check silently skips C1 LOW
2 Test with no assertion at all proves only that it did not crash C2/C2b HIGH/LOW
3 assert inside a try whose except swallows it the failure is discarded C3 HIGH
4 Test the runner never collects it vanishes from the count C4/C4b HIGH/LOW
5 Always-true check (assert True, non-empty tuple, or True) passes by construction C5 HIGH
6 Weak check (truthiness, len>0, "x" in str(...)) accepts almost anything C6 LOW
7 Compares a value to itself true by construction C7 HIGH
8 Exact equality on a float fails on rounding, not on bugs C8 LOW
9 pytest.raises(Exception) too broad accepts the wrong error too C9 LOW
10 Mocks the unit under test tests the mock, not the code semantic -
11 Asserts the value fed to the mock an echo, not a result semantic -
12 Re-implements the production formula both agree on the same wrong number semantic -
13 Mock assertion misspelled / not called becomes a no-op that always passes C13/C13b HIGH/LOW
14 Golden/snapshot written from the output records today's bug as correct C14 LOW
15 Passes only if another test ran first borrowed state semantic -
16 Depends on time, randomness, or a sleep passes or fails by luck C16 LOW
17 skip inside a broad except turns red into yellow, hides the defect C17 HIGH
18 Expected value contradicts what the code should do freezes a bug as "correct" semantic -

Five codes added after the original eighteen, each in an existing family:

Code Pattern Why it fools you Family Conf
C18 Compares str()/repr()/an f-string to a literal checks formatting, not the value B LOW
C19 pytest.raises block wraps more than one call an earlier line raises, the target is never reached A LOW
C20 assert in dead code after return/raise/fail() the check never runs A HIGH
C21 Every assert is conditional, none runs unconditionally a false condition passes the whole test A LOW
C22 async test asserts but never awaits the unit the assertion checks an un-awaited coroutine A off by default

(CC, a commented-out assert, is also flagged LOW.)

How the scanner detects. It parses each test file with Python's ast module and inspects the tree. It never imports or runs the test, so a malicious or broken test cannot execute through it. Detection is structural: an assert whose expression is a constant, both sides of a comparison being AST-identical, a pytest.raises whose argument is Exception, a mock-named receiver with a no-parentheses assert_called_once, a Test* class with an __init__, and so on. Precision is the priority for HIGH codes, because they block commits: each one is stress-tested against legitimate look-alikes (optional-dependency skips, abstract base test classes, @patch-injected mocks, exact-count len(x) == N) and stays quiet on them.

How the semantic pass detects. Cases 10, 11, 12, 15, and 18 cannot be proven by structure. A parser sees a mock but cannot tell whether it replaced an edge (network, disk, clock) or the thing under test. It sees an arithmetic expression but cannot tell whether the expected value was derived independently or copied from the code. The /falsegreen skill reads the production code, derives the intended behavior from the oracle hierarchy, compares it against what the test asserts, and when they disagree, names which side is wrong. It is told to favor precision over recall and to ground a verdict in a cited contract line, never in the code's current output alone.

Why two confidence levels. A blocking gate that cries wolf gets disabled. So only near-certain, mechanically-unambiguous patterns are HIGH (they block). The rest are LOW (they warn) and are starting points for human or semantic judgment, not verdicts.

How falsegreen is validated

A tool that flags tests for not protecting anything has to show it protects something itself. The two layers are validated differently, because they have different natures.

  • The scanner (deterministic). Every rule ships with two tests: one proving it fires on the bad pattern, and at least one proving it stays quiet on a legitimate look-alike. The second matters more. The scanner also runs on its own source on every commit, the self-scan, because the false-positive detector is not allowed to contain one. And it is validated against real-world Python projects (Python web and service codebases with large test suites), where it has found and fixed its own false positives before they shipped. The most recent pass ran the scanner over 8 projects, each above 200 GitHub stars with 500 or more test functions (httpx, starlette, flask, fastapi, django-rest-framework, aiohttp, sanic, werkzeug). That pass surfaced about 47 HIGH-confidence false positives in two rule classes (C7 on deliberate __eq__ tests, C4 on test-named route handlers and local callbacks). Both were fixed, each with a fires-on-bad and a stays-clean regression test, and a re-scan brought the HIGH count to 0 across all 8 projects. Each false positive is recorded as it is fixed, with its regression tests, in the commit history and the CHANGELOG.
  • The semantic pass (LLM, any language). Cross-language coverage runs through this pass, so its reliability is measured, not assumed. The validation is a benchmark corpus: tests planted with a known ground truth, a test that mocks the unit under test, one that copies the expected value from current output, one that re-implements the production formula, in Python and in other languages, scored for precision and recall with precision held above recall. Because the pass runs on an LLM it is non-deterministic, so this is a periodic skill-validation artifact, not a CI gate. The first labeled corpus has 24 Python cases (10 rotten, 14 sound) across cases 10, 11, 12, and 18, with sound look-alikes and plain controls. Run blind on a small model (Claude Haiku), the pass scored precision 1.00 (no false alarms on the 14 sound tests), recall 0.70, and 1.00 recall on the clear-cut smells; the only misses were borderline cases (a pure-delegation passthrough, a trivial one-operator formula) where the precision-first guardrail defers to "sound". That is the evidence behind the design claim that a small model is enough for a precision-first semantic pass. The number to grow is recall: a larger corpus, a second annotator, and multi-vote runs are the next step. A second corpus of 20 TypeScript cases (Jest/Vitest) reproduced the pattern: precision 1.00, recall 0.625, with the only misses being the same boundary cases, evidence that the pass carries across languages and frameworks, not just Python.

What falsegreen takes from the research, and what it leaves out

The detection logic is a deliberate subset of the test-smell literature, not a copy of it. falsegreen keeps the smells that make a passing test ineffective and drops the rest on purpose. The specific debt to each work, with citations, is in CREDITS.md; the short version:

  • From the rotten-green-test work (Soares 2023; Delplanque et al., 2019): the core definition (a passing test with an assertion that never runs) and the rules that follow from it, C1, C20, C21, plus the AAA Assert framing.
  • From the test-smell catalog: the semantic-logic slice, reorganized as the six judgments. The other seven categories (naming, size, duplication, performance, manual tests) stay out.
  • From PyNose: the confirmation that the unittest/xUnit dialect is a real gap worth covering. PyNose's maintainability smells (Assertion Roulette, Redundant Print, Lack of Cohesion, Test Maverick) stay out.
  • From the LLM studies (Agentic LMs, SBES 2025; Santana Jr. et al., 2025): the method, an LLM agent that detects and proposes fixes, validated by a coverage-and-mutation check. Santana Jr. et al. found standalone LLM refactoring drops coverage and recommend exactly that validating agent, which is why the AI-fix path here is gated, not trusted.

Why leave the rest out: a cloned test, a badly named test, an Eager Test, an Assertion Roulette still goes red when the code is wrong. They cost maintainability, not protection. Flagging them as false positives would itself be a false positive, and precision over recall is the whole point. ruff's PT rules and PyNose cover that maintainability layer well; run them alongside falsegreen.


The two layers

Layer What it is When it runs Catches
Scanner Zero-dependency AST analysis (Python/pytest), one self-contained module CLI, CI, pre-commit the mechanical patterns (21 codes)
Semantic pass A Claude Code skill (/falsegreen) that reads the code on demand, in Claude Code the bug-freezing patterns no static tool can see (cases 10/11/12/15/18)

The scanner is the fast, deterministic pre-filter. It overlaps in part with ruff's PT rules and with research tools like PyNose, and that overlap is fine: run them together. The semantic pass is the part nobody else ships, and it is the reason the project exists.

The semantic pass runs on whatever Claude model your Claude Code session uses. It is not pinned to one model, and it does not need a frontier one: the research it draws on (Agentic LMs, SBES 2025; Santana Jr. et al., 2025) shows that small, locally-runnable models detect and refactor these patterns well. The value is in the protocol, not in any single model.


Download and use: the three ways

Pick one or combine them. The CLI and pre-commit need no Claude Code; the skill adds the semantic pass on top.

1. As a Python package (CLI, no skill needed)

Install from PyPI:

pip install falsegreen

Run it:

falsegreen                      # scan the current directory
falsegreen tests/               # scan a folder or a single file
falsegreen --staged             # only the test files staged in git
falsegreen --format sarif       # text (default) | json | sarif | junit
falsegreen --summary            # one-line "N scanned, M flagged" to stderr
falsegreen --output report.sarif  # write the formatted output to a file
falsegreen --json               # alias for --format json
falsegreen --disable C6,C2b     # turn specific codes off

--format sarif emits SARIF 2.1.0 (HIGH -> error, LOW -> warning) for GitHub code scanning / PR annotations; --format junit emits JUnit XML (HIGH -> <failure>, LOW -> <skipped>) for CI dashboards.

python -m falsegreen ... is equivalent to the falsegreen command. Exit codes: 0 clean, 10 low-confidence findings only, 20 at least one high-confidence finding. Wire those into any CI step. No third-party runtime dependencies; Python 3.8+.

Try it on the bundled demo (one bad test per case):

pipx run falsegreen skills/falsegreen/examples/bad_tests_sample.py

2. As a pre-commit hook

This is the standard, version-pinned way to gate every commit. Add to your .pre-commit-config.yaml:

  - repo: https://github.com/vinicq/falsegreen
    rev: v0.1.0
    hooks:
      - id: falsegreen

Then pre-commit install. On each commit it scans the staged test files. HIGH-confidence findings block the commit. Bypass once with git commit --no-verify, or set FALSEGREEN_BLOCK=0 in the environment to make the hook warn-only.

If you do not use the pre-commit framework, install a raw git hook instead:

python -m falsegreen.hook_install --repo .      # install
python -m falsegreen.hook_install --uninstall   # remove

3. As a Claude Code skill (the semantic pass)

Install the plugin:

/plugin marketplace add vinicq/falsegreen

Then, in a Claude Code session, run:

/falsegreen

against a diff or a module. The skill triages the scanner output first, then does the semantic work: for each test it finds the unit under test, derives the intended behavior from the oracle hierarchy, and reports tests that pass while asserting the wrong thing, with the cited evidence and a concrete fix. It is read-only by default (it proposes fixes, it does not edit your tests unless you ask).

The scanner is bundled inside the skill, so the plugin works on its own. On another Agent Skills client that does not define ${CLAUDE_SKILL_DIR}, install the package (pip install falsegreen) and the skill falls back to the CLI.

With the skill vs without the skill

  • Without the skill (CLI / pre-commit / CI): you get the deterministic scanner. It catches the 16 mechanical codes and blocks commits on the high-confidence ones. This is everything a non-Claude-Code user needs and runs anywhere Python runs.
  • With the skill (/falsegreen in Claude Code): you additionally get the semantic pass, which catches the five code-aware cases (10, 11, 12, 15, 18), including the headline one: a test that is green while its expected value contradicts the spec. No static tool, this one included, can find that on its own.

Configuration

  • Inline suppression: add # falsegreen: ignore to silence every code on a line, or # falsegreen: ignore[C8] to silence one.
  • Disable codes globally: --disable C6,C2b.
  • Environment: FALSEGREEN_BLOCK=0 makes the pre-commit hook warn instead of block.

Project config file

Put a [tool.falsegreen] table in pyproject.toml, or a flat .falsegreen.toml at the repo root (the .falsegreen.toml wins if both exist):

[tool.falsegreen]
disable = ["C13b"]          # turn these codes off everywhere
exclude = ["tests/legacy/*"] # skip files matching these globs

[tool.falsegreen.severity]
C8 = "high"                  # promote: now blocks the commit (exit 20)
C6 = "off"                   # same as adding C6 to disable

severity values are high, low, or off. Precedence, highest first: --disable on the CLI, then the inline # falsegreen: ignore, then this config, then the built-in default. Point at a specific file with --config PATH. The config reader uses the standard library on Python 3.11+ and tomli on older versions; on 3.8 without tomli it is a silent no-op.

Baseline (adopt on a legacy repo without blocking day one)

Record the findings you already have, then fail only on new ones:

falsegreen --write-baseline tests/   # writes .falsegreen-baseline.json, exits 0
falsegreen --baseline tests/         # suppresses the recorded findings, fails on new

A finding is fingerprinted by its relative path, code, detail, and the normalized source line, not by line number, so prepending or moving code does not re-trigger a baselined finding. Commit .falsegreen-baseline.json and the ratchet only tightens. Both flags default to .falsegreen-baseline.json; pass a path to override.


Technologies used

  • Python 3.8+, standard library only at runtime. The scanner uses ast for parsing, argparse for the CLI, json for machine output, and subprocess only to ask git for staged files.
  • Zero runtime dependencies. Dev and test use pytest and ruff (installed via pip install -e ".[dev]").
  • Packaging: hatchling build backend, SPDX license metadata (PEP 639), console entry point, distributed on PyPI.
  • Distribution: a pre-commit hook (.pre-commit-hooks.yaml) and a Claude Code plugin following the Agent Skills open standard (SKILL.md plus a .claude-plugin/ marketplace manifest).
  • CI: GitHub Actions across Python 3.8 / 3.11 / 3.13, running ruff, pytest, a self-scan (the tool must stay clean on its own code), and a drift-check that the bundled scanner copy matches the package byte for byte.

How it compares

  • ruff / flake8-pytest-style - mature, fast lint rules. Overlaps on broad raises (PT011) and assert-in-except (PT017). Run both. falsegreen adds uncollected tests, always-true asserts, self-comparison, mock typos, and the semantic pass.
  • PyNose / pytest-smell / TEMPY - test-smell catalogs from research. Broader taxonomy, but no commit gate and no oracle-correctness check.
  • mutmut / cosmic-ray - mutation testing, the most honest measure of whether a green suite fails when the code is wrong. Complementary and heavier. falsegreen is the cheap pre-filter you run on every commit; mutation testing is the deep audit you run on the suites that matter.

The defensible gap: nobody else combines a deterministic commit gate with a code-as-evidence semantic pass aimed at oracle correctness (cases 12 and 18).


Project layout

falsegreen/
  src/falsegreen/scanner.py        the deterministic scanner (canonical)
  src/falsegreen/hook_install.py   raw git-hook installer
  skills/falsegreen/
    SKILL.md                       the semantic-pass protocol
    reference.md                   the 18-case detection rubric
    scripts/scan.py                bundled scanner (kept identical to the package)
    examples/bad_tests_sample.py   one bad test per case (demo + regression)
  docs/guide.md                    plain-language guide to every case
  tests/test_scanner.py            the scanner's own tests
  .pre-commit-hooks.yaml           pre-commit integration
  .claude-plugin/                  plugin + marketplace manifests
  pyproject.toml                   packaging

Contributing, security, license

  • CONTRIBUTING.md - dev setup, how to add a detection rule (the four places it touches), the false-positive policy, Conventional Commits.
  • SECURITY.md - how to report a vulnerability privately.
  • CODE_OF_CONDUCT.md - Contributor Covenant 2.1.
  • CREDITS.md - the research falsegreen builds on (the Soares rotten- green work, PyNose, the test-smell catalog, and the agentic-LLM refactoring study), with author credit.
  • License: MIT, see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

falsegreen-0.2.0.tar.gz (99.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

falsegreen-0.2.0-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file falsegreen-0.2.0.tar.gz.

File metadata

  • Download URL: falsegreen-0.2.0.tar.gz
  • Upload date:
  • Size: 99.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for falsegreen-0.2.0.tar.gz
Algorithm Hash digest
SHA256 2d8a34b80d181ea256d667a4c2566e6ab4b681cf6a50a2c393ba911db0da8efd
MD5 c3dc6c6d17eaa7021b3c3574249c0cfd
BLAKE2b-256 31b50638367ad37715c88cc1947cc324d248a43559904df7f9617e04aba079fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for falsegreen-0.2.0.tar.gz:

Publisher: release.yml on vinicq/falsegreen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file falsegreen-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: falsegreen-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for falsegreen-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0f9fe7879c2e31f5e29ef2b6ede0164a53140bfbbe49b3e22c26f5812e08f26f
MD5 2950e7956d488d84f5c99e58cd00f649
BLAKE2b-256 3b4ffa94b98bee1ab34455c3d39a3ce65557c600522c60f9f3cdb3be64c52319

See more details on using hashes here.

Provenance

The following attestation bundles were made for falsegreen-0.2.0-py3-none-any.whl:

Publisher: release.yml on vinicq/falsegreen

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page