Multi-agent codebase evaluation and reliability optimization.
Project description
OneCode - Agentic Codebase Evaluation
Quantify agent and GenAI reliability • Analyze, search, and refactor codebases • Run and debug code using natural language • Intelligent code retrieval via semantic knowledge graphs • Track agent improvements over time.
Quick Navigation
- Evaluation Metrics
- Example: Evaluation Output
- Installation
- Setup
- How to run
- Example queries
- Development
- License
Evaluation Metrics
OneCode evaluates agents using industry-standard metrics:
Core GenAI Reliability
- Faithfulness - How faithful is the output to the context
- Hallucination - How much the output diverges from the context; lower is better
- Answer Accuracy - How accurate the answer is compared to ground truth
Agent-Specific
- Agent Goal Accuracy - Did the agent achieve its intended objective?
- Tool Call F1 - Precision and recall of tool invocations
Quality & Coherence
- Answer Relevancy - How relevant the output is to the input question
- Response Groundedness - How grounded the response is in retrieved context
Retrieval Quality
- Context Precision - Ratio of relevant to total retrieved context chunks
- Context Recall - Ratio of retrieved to total relevant context chunks
- Context Relevance - How relevant the retrieved context is to the question
Context-Aware Datasets
OneCode automatically generates test datasets tailored to each module by analyzing its purpose and code. These datasets are:
- Automatically refreshed when module code changes
- Reused consistently across evaluation runs for reliable trend tracking
Example: Evaluation Output
You: evaluate the summarizer agent
Here is the complete evaluation report for the Summarizer Agent
(agents/summarizer.py):
Metric Scores (5 samples)
- Hallucination: 0.90 ✓ Good (lower is better)
- Answer Accuracy: 0.45 ⚠ Needs Improvement
- Context Precision: 0.27 ✗ Critical
- Answer Relevancy: 0.39 ✗ Critical
- Faithfulness: 0.10 ✗ Critical
Root Cause Analysis:
Faithfulness (0.10) & Response Groundedness (0.10) — The agent is
largely fabricating content rather than grounding summaries in the
provided input. This is a fundamental failure for a summarizer.
Comparison with Prior Run (3 days ago):
- Faithfulness: 0.10 (no change)
- Answer Accuracy: 0.45 (↑ +0.05 improvement)
- Context Precision: 0.27 (↓ -0.12 regression)
Recommendations:
1. Add input validation to reject malformed text
2. Implement a grounding constraint that requires citations
3. Test with diverse document types
Accountability & Comparative Analysis
Track agent improvements over time and compare across versions:
You: how does this agent compare to last week's version?
→ Shows metrics side-by-side with delta (+/- changes)
You: which agents regressed in the last evaluation?
→ Flags agents with metric drops and explains why
You: show me the evaluation history for the coder agent
→ Displays trend chart showing faithfulness, accuracy over time
Installation
From PyPI:
pip install onecode-cli
Setup
1. Configure environment
Provide API keys using one of two methods:
Method A: Create .env file Add API keys to .env in your project or home directory. OPENAI_API_KEY is always required (used for embeddings):
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-... # only needed for Claude models
Method B: Export environment variables
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-... # only needed for Claude models
How to run
After installation, the onecode command is available globally:
# Default model (claude-sonnet-4-6) with explicit path
onecode ~/path/to/project
# Use current directory (default if no path specified)
onecode
# Specify a different model
onecode --model gpt-4o
# From within the codebase directory (same as above)
cd ~/myproject
onecode
First installation check:
onecode --help
First run — output looks like this:
$ onecode ~/myproject
OneCode - Codebase Analyzer
----------------------------------------
Model: claude-sonnet-4-6
Indexing: /Users/you/myproject
Ready: 42 nodes (class:12, file:18, function:12) | 42 embeddings
Type a question or task (or 'exit' to quit).
----------------------------------------
You:
Subsequent runs — output looks like this:
$ onecode ~/myproject
OneCode - Codebase Analyzer
----------------------------------------
Model: claude-sonnet-4-6
Indexing: /Users/you/myproject
Ready: 42 nodes (class:12, file:18, function:12) | 42 embeddings
Type a question or task (or 'exit' to quit).
----------------------------------------
You:
Example queries
Evaluate code quality with RAGAS metrics
You: evaluate the codebase
You: what is the accuracy of the coder agent?
You: compare this run with the previous evaluation
Understand the codebase
You: what does this codebase do?
You: explain the authentication flow
You: what agents/modules are in this project?
Find specific code
You: search for all calls to connect_db
You: where is the retry logic implemented?
You: find all async functions
Write and modify code
You: add input validation to the login function
You: write a utility function that validates emails
You: refactor the parse_config function to handle missing keys gracefully
Write, run, and debug
You: create a function that reverses a string, write a test for it, and run the test
You: add a health check endpoint and run the server to verify it starts
You: debug why the executor agent is failing on error handling
File management
You: rename src/helpers.py to src/utils.py
You: delete the tmp/ directory
You: move all test files into a tests/ directory
Git operations
You: show git status
You: show the diff of uncommitted changes
You: commit all staged files with message "add retry logic"
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file onecode_cli-0.1.6.tar.gz.
File metadata
- Download URL: onecode_cli-0.1.6.tar.gz
- Upload date:
- Size: 61.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13dbfc6f53516de458981a247a5ff4298dfd1bb0ead969e5baa4947e727001be
|
|
| MD5 |
f2bd43ded772bb6c0733a6ae61bda81c
|
|
| BLAKE2b-256 |
aa5d410378f922a92931c65450214807d4c66093df694003ccee8706dceb7e73
|
File details
Details for the file onecode_cli-0.1.6-py3-none-any.whl.
File metadata
- Download URL: onecode_cli-0.1.6-py3-none-any.whl
- Upload date:
- Size: 62.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d71897549061ffd986a01c90f473f0e8c8035c5571e0b60b048679eb6c316c3c
|
|
| MD5 |
204693acb7d4a922142452d1d5d98a32
|
|
| BLAKE2b-256 |
da5dfa67b2f618e7dba738f1343a1351191325858415bbaf3c13f8cca04bbacd
|