AI-Readiness Auditor: audit how well LLMs can work with a codebase
Project description
█████╗ ██████╗ ███████╗███╗ ██╗████████╗ ███████╗██╗████████╗
██╔══██╗██╔════╝ ██╔════╝████╗ ██║╚══██╔══╝ ██╔════╝██║╚══██╔══╝
███████║██║ ███╗█████╗ ██╔██╗ ██║ ██║ █████╗ ██║ ██║
██╔══██║██║ ██║██╔══╝ ██║╚██╗██║ ██║ ██╔══╝ ██║ ██║
██║ ██║╚██████╔╝███████╗██║ ╚████║ ██║ ██║ ██║ ██║
╚═╝ ╚═╝ ╚═════╝ ╚══════╝╚═╝ ╚═══╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝
AgentFit
Does your codebase speak LLM?
AgentFit audits how well AI models can actually work with your Python code — not just read it, but complete functions, fix bugs, navigate a live repo with tools, and explain architecture. It scores five static AI-readiness metrics, then verifies them by benchmarking real LLMs against auto-generated challenges.
AgentFit eats its own dog food — it scores ≥ 80/100 on its own metrics.
What it does
agentfit benchmark ./src
Stage 1 — Static Analysis scores your codebase on five dimensions that predict how well LLMs will perform on it:
| Metric | What it measures |
|---|---|
| Schema Density | How many data-passing functions use Pydantic / TypedDict / dataclasses |
| DRYness | Absence of duplicated function bodies |
| Docstring Richness | Presence of >>> usage examples in public docstrings |
| Test Coverage Structural | Ratio of test files to source files |
| Import Clarity | Absence of circular imports and dependency tangles |
Stage 2 — LLM Benchmarking auto-generates coding challenges (completion, debugging, explanation, refactoring) from your source tree, sends them to every configured provider, and judges responses with a second LLM.
Stage 3 — Agentic Benchmarking introduces a real mutation into a copy of your codebase and lets the model use filesystem + test tools over multiple turns to find and fix the bug — just like a developer would.
Stage 4 — Correlated Reporting finds which static metrics actually correlate with LLM performance on your specific codebase and surfaces prioritised, actionable recommendations.
Install
pip install agentfit
Requires Python 3.11+. Optional provider SDKs:
pip install agentfit[anthropic] # Anthropic Claude
pip install agentfit[openai] # OpenAI / any OpenAI-compatible endpoint
pip install agentfit[all] # everything
Quick start
# 1. Scaffold a config file
agentfit init
# 2. Full audit — static analysis + LLM benchmarking
agentfit benchmark ./src
# 3. Static analysis only (no API keys needed)
agentfit benchmark ./src --no-llm
# 4. Gate CI — exit code 1 if score drops below 60
agentfit benchmark ./src --fail-below 60
# 5. Save full results to JSON
agentfit benchmark ./src --save-results
Try it without any API key
--no-llm runs the full static analysis pipeline and gives you a scored report — no LLM provider, no API key, no cost:
pip install agentfit
agentfit init
agentfit benchmark ./src --no-llm
You get scores for all five metrics (Schema Density, DRYness, Docstring Richness, Test Coverage Structural, Import Clarity) plus ranked recommendations. The LLM benchmarking and correlation stages are skipped — those need a provider configured in ai-bench.yml.
Sample output
╭──────────────────────── AgentFit Report ─────────────────────────╮
│ Source: ./src │
│ Generated: 2026-03-26T14:00:00+00:00 │
│ Overall Score: 71.4 Threshold: 60.0 ✓ PASS │
╰────────────────────────────────────────────────────────────────────╯
Static Analysis
Metric Score Bar Correlation
Schema Density 82.0 ████████░░ strong ↑ (r=0.81)
DRYness 71.0 ███████░░░ —
Docstring Richness 43.0 ████░░░░░░ strong ↑ (r=0.74)
Test Coverage Structural 55.0 █████░░░░░ —
Import Clarity 89.0 ████████░░ —
LLM Benchmark Results
Provider Model Attempted Passed Mean Score P50ms P95ms
anthropic claude-sonnet-4-6 15 12 74.2 1203 2847
qwen local 15 9 61.3 3100 5200
Recommendations
1. [HIGH] Add usage examples to public functions
Docstring Richness is 43.0/100. Strong positive correlation with LLM
scores (r=0.74). Adding >>> examples significantly improves LLM performance.
2. [MEDIUM] Increase structural test coverage
...
Use --verbose to also print per-metric warnings.
Configuration
agentfit init writes an ai-bench.yml to the current directory:
version: "1"
analysis:
source_path: "."
languages:
- python
metric_weights:
schema_density: 1.0 # set to 0 to exclude from overall score
dryness: 1.0
docstring_richness: 1.0
test_coverage: 1.0
import_clarity: 1.0
providers:
anthropic:
enabled: true
model: "claude-sonnet-4-6"
openai:
enabled: false
model: "gpt-4o"
# base_url: "https://your-local-endpoint/v1" # any OpenAI-compatible API
# name: "my-provider" # display name in reports
benchmarking:
challenges_per_module: 3
max_concurrent_requests: 5
max_tool_rounds: 10 # agentic mode: max turns per challenge
scoring:
judge_model: "claude-sonnet-4-6"
judge_provider: "anthropic"
reporting:
output_format: "text"
fail_below: null
Agentic benchmarking
AgentFit automatically generates agentic_debugging challenges for any source file that has a matching test file. Each challenge:
- Introduces one mutation into a copy of your source tree (e.g. flips
==→!=) - Gives the model access to five tools:
read_file,list_files,search_code,write_file,run_tests - Runs a multi-turn loop until the model fixes the bug or
max_tool_roundsis reached - Scores the result on correctness, fix quality, test verification, and round efficiency
Supported providers: Anthropic and any OpenAI-compatible endpoint. Ollama is not supported (no tool-use API).
You can also force any manual challenge through the agentic loop with agentic: true:
# challenges.yml
- id: "explain-scoring"
source_module: "agentfit.scoring"
challenge_type: "explanation"
agentic: true # model reads the real codebase before answering
prompt: |
Read the source files and explain how challenge scoring works end-to-end.
context_code: ""
expected_behavior: |
A detailed explanation covering ChallengeGenerator, JudgeLLM, and Scorer.
Run with manual challenges:
agentfit benchmark ./src --challenges challenges.yml --save-results
Local / self-hosted LLM endpoints
Any OpenAI-compatible API works — Ollama, LM Studio, ngrok tunnels, Qwen, Mistral, etc.:
providers:
openai:
enabled: true
model: "qwen2.5-coder:14b"
base_url: "https://xxxx.ngrok-free.app/v1"
name: "qwen" # shows as "qwen" in the report table
No API key required when base_url is set.
CLI reference
agentfit init [--output PATH]
Scaffold ai-bench.yml in the current directory.
agentfit benchmark SOURCE_PATH
[--config PATH] Override config file location
[--fail-below SCORE] Exit 1 if overall score < SCORE
[--no-llm] Static analysis only (no API keys needed)
[--verbose, -v] Show per-metric warnings
[--challenges PATH] YAML file of manually authored challenges
[--max-challenges N] Cap auto-generated challenges (manual always included)
[--save-results] Write full report to agentfit-results.json
[--manual-eval] Export challenge/response/verdict triples to JSONL
[--load-evals PATH] Merge a manual eval JSONL into auto scores
[--output-format FORMAT] 'text' (default) or 'html'
[--output-file PATH] Write HTML report to file
[--badge] Write agentfit-badge.json
Multi-language support
AgentFit analyses Python natively and has regex-based analysers for:
| Language | Schema Density | DRYness | Docstring Richness | Import Clarity | Test Coverage |
|---|---|---|---|---|---|
| Python | ✓ | ✓ | ✓ | ✓ | ✓ |
| TypeScript / JS | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Rust | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Go | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
| Java | ✓ | ✓ | ✓ | ✓ | needs tests¹ |
¹ Test Coverage Structural works by pairing source files with test files (e.g.
engine.go→engine_test.go). For non-Python languages the metric will score 0 if your project has no test files alongside the source — add tests to your project to get a meaningful score.
analysis:
languages:
- python
- typescript
- rust
Roadmap
| Version | Theme | Status |
|---|---|---|
| v0.1 | Python static analysis + Anthropic/OpenAI/Ollama runners | ✓ Done |
| v0.2 | TypeScript/JavaScript AST analysis | ✓ Done |
| v0.3 | Rust/C/C++ + Go + Java support | x Partialy |
| v0.4 | Agentic tool harness + multi-turn debugging challenges | ✓ Done |
| v0.5 | HTML report export + CI badge generation | ✓ Done |
| v1.0 | Real pytest-cov integration + VS Code extension | Planned |
What's left (v1.0)
- Real
pytest-covintegration — blend runtime branch coverage with the structural score - VS Code extension — inline metric decorations, status bar score, WebView report panel
- CI/GitHub Actions — self-audit job (
agentfit benchmark ./agentfit --fail-below 90) on every PR -
mypy --strict— full type-checking across all modules - PyPI publish —
pip i stall agentfitfrom the public registry
Development
git clone https://github.com/voicutomut/AgentFit
cd AgentFit
pip install -e ".[dev]"
pytest # 739 tests
ruff check agentfit/ # lint
See ROADMAP.md for the full phased implementation plan.
Community suggestions — help shape AgentFit
AgentFit is early and the five metrics are our first take at what makes a codebase LLM-friendly. We want your input.
Open an issue or start a discussion if you have thoughts on any of these:
Are the current metrics the right ones?
The five we picked:
| Metric | Our hypothesis |
|---|---|
| Schema Density | Typed data structures give LLMs clear contracts to reason about |
| DRYness | Duplicated logic confuses context windows and wastes tokens |
| Docstring Richness | >>> examples are the most information-dense context you can give a model |
| Test Coverage Structural | Tests tell the model what "correct" looks like |
| Import Clarity | Circular deps and star imports obscure the dependency graph |
Do these match your experience? Have you noticed other code properties that seem to help or hurt LLM performance on your projects?
What metrics are we missing?
Some candidates we're considering — tell us which matter most to you:
- Naming consistency — do identifiers follow a single convention? Does the model have to context-switch between styles?
- Function length / cyclomatic complexity — do shorter, focused functions produce better LLM completions?
- Comment density — inline comments vs. docstrings, which helps more?
- Dependency freshness — does using up-to-date libraries (in the model's training data) improve results?
- Magic number / constant density — does replacing raw literals with named constants help?
- Error handling coverage — does consistent exception handling improve LLM-generated patches?
Other ways to contribute
- Share a benchmark result — run
agentfit benchmark ./your-repo --save-resultsand share the JSON output. Real data helps us validate which metrics actually correlate with LLM performance. - Propose a new challenge type — beyond completion, debugging, refactoring, and explanation, what coding tasks should we be measuring?
- Report false positives — if a metric scores your codebase unfairly, open an issue with a minimal example.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file agentfit-0.1.0-py3-none-any.whl.
File metadata
- Download URL: agentfit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 165.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
429bc650520ff5e8edef75d90a73fc4131263f9a06915e33e9fbba817f703ae5
|
|
| MD5 |
57c80a82d96fb020d92492e611c55686
|
|
| BLAKE2b-256 |
0cf6186298ec0b039a33fbe170c7c524550c52ebd8ddbd766b790a6368bd2eb0
|