Rule-first three-filter pipeline for cleaning unit-test training data, packaged as four SKILL.md skills.

These details have not been verified by PyPI

Project description

CleanTest-Agent

Strip noise from unit-test training data in seconds, not hours --- with rules where rules win, and an LLM only where it actually helps.

Quick start · Why this exists · Results · How it works · Skills usage · Paper

What it does

CleanTest-Agent takes a CSV of (focal_method, test_case) pairs --- the canonical format used by Methods2Test, ATLAS, and most modern test-generation benchmarks --- and removes the noisy ones, leaving only the samples that actually teach a model how to write good tests.

It does this through three composable Agent Skills (each follows the SKILL.md protocol, so they drop directly into CodeBuddy, Claude Code, or Cursor):

Syntax filter --- AST parsing with tree-sitter plus an Aho-Corasick automaton over a 21,954-pattern annotation dictionary.
Relevance filter --- AST name matching with an opt-in 5-rule LLM reflection step for borderline indirect-testing cases.
Coverage filter --- a JaCoCo-label scan by default, or a fine-tuned Qwen2.5-Coder-0.5B regression model when ground-truth labels are missing.

The whole pipeline processes 593,953 deduplicated Methods2Test samples in under three minutes on a laptop. A single-LLM-per-sample baseline takes ~20 days and still misses 78% of the noise.

Why this exists

The CleanTest paper (Zhang et al., FSE 2025 Distinguished Paper) showed that 43.5% of the Methods2Test corpus is noisy, and that filtering the noise improves downstream branch coverage by ~67% across CodeBERT, AthenaTest, StarCoder, and CodeLlama-7B on Defects4J. Their pipeline is effective but exists as one-shot scripts: hard to compose, hard to drop into a coding assistant, and missing a path for projects that have no JaCoCo labels.

This repository is a from-scratch reimplementation that splits the pipeline into reusable skills, swaps the original CodeGPT coverage model for a fine-tuned Qwen2.5-Coder-0.5B (~2.6x lower MAE), and adds an optional Reflection step on the LLM relevance check. It is also the source artefact for the companion 67-page paper under report/.

Use it if...

...you train code models on Methods2Test, ATLAS, or any (focal_method, test) corpus and want to remove noisy samples before training.
...you want a rule-first pipeline (deterministic, free, fast) with an LLM only on the borderline cases.
...you want skills that drop into CodeBuddy / Claude Code / Cursor and trigger on natural language ("clean my test data", "check this test's relevance").
...you want to predict branch coverage for a (focal, test) pair without running JaCoCo --- with held-out MAE of 0.031 from a 0.5B fine-tuned model.

If you want a dialogue-based "ask the LLM if this looks like a bad test" service, this is not that --- the whole point of this project is that you do not need to call the LLM 593,953 times.

Quick start

Install from PyPI:

pip install cleantest-agent

Or install from source for development (also installs the bundled sample dataset and tests):

git clone https://github.com/jimmy0717/cleantest-agent.git
cd cleantest-agent
pip install -e ".[dev]"

# Clean the bundled 5,000-row sample (no API needed):
cleantest --input_csv data/sample_5000.csv --output_dir output/

# Inspect the noise report:
cat output/noise_report.json

You should see something like:

{
  "total_input": 5000,
  "total_kept": 2389,
  "removed": {
    "unnecessary_annotation": 2122,
    "no_relevance":           201,
    "syntax_error":           107,
    "non_english_literal":     66,
    "ambiguous_data_type":     63,
    "missing_implementation":  12,
    "empty_exception":          5
  },
  "wall_clock_seconds": 1.4
}

To enable the optional LLM relevance check on borderline samples, set an OpenAI-compatible endpoint and add --llm_enhance:

export OPENAI_API_KEY="your-key"
export OPENAI_BASE_URL="https://api.deepseek.com/v1"
cleantest --input_csv data/sample_5000.csv \
          --output_dir output/ \
          --llm_enhance --reflection

Results

We ran the four pipelines below on a 500-sample stratified subset of Methods2Test (231 noise / 269 clean), with real DeepSeek-V4-Flash API calls for the LLM rows.

Method	Precision	Recall	F1	Time (500 samples)
Rule-based (ours)	1.000	1.000	1.000	0.11 s
LLM zero-shot	0.505	0.221	0.307	1,487 s
LLM few-shot	0.534	0.303	0.387	1,642 s
Hybrid (rules + LLM borderline only)	0.974	0.956	0.965	< 60 s

Why the LLM baselines fail: 78% of the noise in Methods2Test is "this focal method is annotated with @ApiOperation / @SwaggerDefinition / ... and is therefore not useful for test-generation training". The LLM has no way to recall the 21,954-pattern dictionary that defines this class of noise; the Aho-Corasick automaton recalls it deterministically in microseconds.

The full evaluation (RQ1-RQ4 + Filter 3 model-mode validation + a case study + ablation study) is in report/main.tex Section 7. Per-sample predictions are archived under experiments/results/labeled_samples.csv.

Filter 3: replacing CodeGPT with Qwen2.5-Coder-0.5B

The default Filter 3 reads condition_cover_rate straight from a JaCoCo column. For label-free settings, we fine-tuned Qwen2.5-Coder-0.5B on a stratified 80/10/10 split of LessIsMore-FSE2025 filter_train.csv (469,174 rows) on a single A800-SXM4-80 GB (bf16, batch 64, max_seq 512, lr 3e-5 cosine, 2 epochs, ~3.3 h wall-clock) and evaluated it on the held-out 46,921-sample test split:

Metric	This work (Qwen2.5-Coder-0.5B)	CodeGPT (Zhang et al., FSE 2025)
MAE	0.0309	0.0798 (~2.6x higher)
MSE	0.0039	0.0105 (~2.7x higher)
RMSE	0.0628	--
R-squared	0.604	--
Pearson r	0.778	--
Spearman rho	0.848	--
F1 at tau = 0.10	0.857	--
F1 at tau = 0.15	0.912	--

Raw artefacts (training_metrics.json, metrics.jsonl, test_metrics.json, test_pred_a800.csv, test_threshold_sweep.json) live under experiments/results/coverage_run/. The end-to-end notebook that produced them is experiments/main-final.ipynb.

How it works

User / Coding Assistant
     |  natural-language trigger ("clean my test data")
     v
+-----------------------------------+
|  Orchestrator skill               |
|  (cleantest-pipeline)             |
+-----+----------+--------+---------+
      |          |        |
      v          v        v
+----------+ +-----------+ +--------------+
| Filter 1 | | Filter 2  | | Filter 3     |
| Syntax   | | Relevance | | Coverage     |
|          | |           | |              |
| AST +    | | Name      | | Qwen2.5-     |
| Aho-     | | match +   | | Coder-0.5B   |
| Corasick | | LLM       | | regression   |
| (21,954  | | fallback  | | (model mode) |
| patterns)| |           | |              |
+----+-----+ +-----+-----+ +------+-------+
     |             |              |
     +-------------+--------------+
                   |
                   v
         Clean dataset + noise report

Each filter is an independent Agent Skill; they share state through a small NoiseReport accumulator and emit a single JSON + Markdown report at the end.

How it compares to the alternatives

	Original CleanTest scripts	ChatUniTest / pure-LLM workflows	Hand-rolled regex pipeline	CleanTest-Agent
Faithful to the FSE 2025 paper's definitions	yes	partial	varies	yes
Composable into coding-assistant skills	no	partial	no	yes
Deterministic on the rule-decidable cases	yes	no	usually	yes
Aho-Corasick acceleration	no (linear scan)	n/a	rare	yes (~11.5x)
Optional Reflection on borderline LLM verdicts	no	no	no	yes
Filter 3 without JaCoCo labels	no	no	no	yes (Qwen 0.5B)
Cost on 593,953 samples	~free	~$35-58 (real-API)	~free	~$4.5

Use it from a coding assistant

CleanTest-Agent ships four skills that follow the SKILL.md protocol and drop directly into CodeBuddy, Claude Code, Cursor, or any assistant that implements the protocol.

Skill	What it does	Triggers on
`cleantest-pipeline`	full pipeline orchestration	"clean test data", "run cleantest"
`cleantest-syntax-filter`	syntax noise (AST + Aho-Corasick)	"check syntax noise"
`cleantest-relevance-filter`	test-focal relevance + reflection	"check test relevance"
`cleantest-coverage-filter`	branch coverage prediction	"predict coverage"

Install all four into the local CodeBuddy skills directory:

make install   # copies skills/* to ~/.codebuddy/skills/

Then in your assistant, just ask in natural language:

"Help me clean this unit test training dataset under ~/datasets/methods2test_train.csv and write the report to ~/datasets/cleaned/."

For the full distribution recipe (without cloning this repo), see docs/skill-distribution-guide.md. For a worked example of each skill, see docs/code-assistant-guide.md.

Project structure

cleantest-agent/
|-- cleantest_agent/            installable Python package
|   |-- pipeline.py             orchestrator (Aho-Corasick + 3 filters)
|   |-- parser_utils.py         tree-sitter AST utilities
|   |-- llm_client.py           OpenAI-compatible wrapper
|   |-- report_generator.py     JSON + Markdown reports
|   `-- data/noise_modifier_fm.txt   21,954-pattern dictionary
|-- skills/                     four SKILL.md skill bundles
|-- tests/                      36 pytest test cases
|-- experiments/                run_baselines.py + results/
|-- data/sample_5000.csv        bundled 5,000-row Methods2Test subset
|-- docs/                       user-facing guides
|-- report/                     LaTeX research paper (ACM acmlarge)
|-- .github/workflows/ci.yml    CI: Python 3.10/3.11/3.12 matrix
`-- pyproject.toml              package metadata + `cleantest` console script

Contributing

Bug reports, feature requests, and pull requests are all welcome. The starting points are:

CONTRIBUTING.md for the development workflow,
.github/ISSUE_TEMPLATE/ for filing bug / feature / question issues,
CODE_OF_CONDUCT.md for the (Contributor Covenant 2.1) baseline expectations.

If you fix a bug, the convention is: write a failing test first (tests/), confirm it fails on main, then push the fix; see tests/ for representative examples.

Citation

If you use CleanTest-Agent in academic work, please cite both the original CleanTest paper and this implementation:

@inproceedings{zhang2025cleantest,
  title     = {Less is More: On the Importance of Data Quality for Unit Test Generation},
  author    = {Zhang, Junwei and Hu, Xing and Gao, Shan and Xia, Xin and Lo, David and Li, Shanping},
  booktitle = {Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering (FSE)},
  year      = {2025},
  note      = {Distinguished Paper Award; arXiv:2502.14212}
}

@misc{yang2026cleantestagent,
  title  = {{CleanTest-Agent}: A Multi-Agent Skill-Orchestrated System for Unit Test Training Data Quality Assurance},
  author = {Yang, Yong},
  year   = {2026},
  howpublished = {\url{https://github.com/jimmy0717/cleantest-agent}}
}

License

MIT. The bundled data/sample_5000.csv is a derivative subset of Microsoft's MIT-licensed Methods2Test dataset and is redistributed under the same terms; see data/README.md for the full attribution.

Submission notes (course reviewers)

This repository was originally produced as the final-project artefact for Software Requirements Analysis and System Design at the School of Software, Beihang University. The deliverables map to the following entry points:

Deliverable	Location
Research report (LaTeX, 67 pp.)	`report/main.tex`, bibliography `report/references.bib`, compiled PDF `report/main.pdf`
Slides (10 pages, 3-min talk)	`ppt/slides.md` (English), `ppt/PPT大纲.md` (Chinese outline)
Reproducible code	this repository
Test suite (36 cases)	`tests/`, `make test`
CI pipeline	`.github/workflows/ci.yml`
Real DeepSeek API experiments	`experiments/run_baselines.py`
Filter 3 model-mode training	`experiments/main-final.ipynb` (end-to-end), `skills/cleantest-coverage-filter/scripts_paddle/`
Code-assistant skill bundles	`skills/`
Skill installation guide	`docs/skill-distribution-guide.md`
Code-assistant usage guide	`docs/code-assistant-guide.md`
Baidu AI Studio training guide	`docs/training-on-baidu-aistudio.md`

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

May 24, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleantest_agent-0.1.1.tar.gz (519.2 kB view details)

Uploaded May 24, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

cleantest_agent-0.1.1-py3-none-any.whl (517.1 kB view details)

Uploaded May 24, 2026 Python 3

File details

Details for the file cleantest_agent-0.1.1.tar.gz.

File metadata

Download URL: cleantest_agent-0.1.1.tar.gz
Upload date: May 24, 2026
Size: 519.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cleantest_agent-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`95035634d909be5469ad7bd7366c31db2c7d4432a2cbf4049d02c78b4e1a8106`
MD5	`05f496ab4c5a3b4165804aa81878eef5`
BLAKE2b-256	`010f6d1326c1dd44d90e0bf3b604d2493431ac3b3b6c12526b472842daee9478`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cleantest_agent-0.1.1.tar.gz:

Publisher: publish.yml on jimmy0717/cleantest-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cleantest_agent-0.1.1.tar.gz
- Subject digest: 95035634d909be5469ad7bd7366c31db2c7d4432a2cbf4049d02c78b4e1a8106
- Sigstore transparency entry: 1621961495
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: jimmy0717/cleantest-agent@e5dd851d6f7c861aba7de381604ea806c41b4663
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/jimmy0717
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e5dd851d6f7c861aba7de381604ea806c41b4663
- Trigger Event: push

File details

Details for the file cleantest_agent-0.1.1-py3-none-any.whl.

File metadata

Download URL: cleantest_agent-0.1.1-py3-none-any.whl
Upload date: May 24, 2026
Size: 517.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for cleantest_agent-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`34472583193ca1d9c5a556d076a35ba05fd46d356031e96829fd942cda18dc3d`
MD5	`9379b219774314d6aecbf86454a678aa`
BLAKE2b-256	`3562d5dfe1a9f7759222c4f7e6d6c144d5f84471e80ecf8848ba58c4f36a4534`

See more details on using hashes here.

Provenance

The following attestation bundles were made for cleantest_agent-0.1.1-py3-none-any.whl:

Publisher: publish.yml on jimmy0717/cleantest-agent

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: cleantest_agent-0.1.1-py3-none-any.whl
- Subject digest: 34472583193ca1d9c5a556d076a35ba05fd46d356031e96829fd942cda18dc3d
- Sigstore transparency entry: 1621961719
- Sigstore integration time: May 24, 2026
Source repository:
- Permalink: jimmy0717/cleantest-agent@e5dd851d6f7c861aba7de381604ea806c41b4663
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/jimmy0717
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e5dd851d6f7c861aba7de381604ea806c41b4663
- Trigger Event: push

cleantest-agent 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CleanTest-Agent

What it does

Why this exists

Use it if...

Quick start

Results

Filter 3: replacing CodeGPT with Qwen2.5-Coder-0.5B

How it works

How it compares to the alternatives

Use it from a coding assistant

Project structure

Contributing

Citation

License

Submission notes (course reviewers)

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance