Skip to main content

Multi-agent codebase evaluation and reliability optimization.

Project description

OneCode - Agentic Codebase Evaluation

Release v0.1.5 License: MIT Python 3.9+ PyPI: onecode-cli Platform: All

Quantify agent and GenAI reliability • Analyze, search, and refactor codebases • Run and debug code using natural language • Intelligent code retrieval via semantic knowledge graphs • Track agent improvements over time.


Quick Navigation


Evaluation Metrics

OneCode evaluates agents using industry-standard metrics:

Core GenAI Reliability

  • Faithfulness - How faithful is the output to the context
  • Hallucination - How much the output diverges from the context; lower is better
  • Answer Accuracy - How accurate the answer is compared to ground truth

Agent-Specific

  • Agent Goal Accuracy - Did the agent achieve its intended objective?
  • Tool Call F1 - Precision and recall of tool invocations

Quality & Coherence

  • Answer Relevancy - How relevant the output is to the input question
  • Response Groundedness - How grounded the response is in retrieved context

Retrieval Quality

  • Context Precision - Ratio of relevant to total retrieved context chunks
  • Context Recall - Ratio of retrieved to total relevant context chunks
  • Context Relevance - How relevant the retrieved context is to the question

Context-Aware Datasets

OneCode automatically generates test datasets tailored to each module by analyzing its purpose and code. These datasets are:

  • Automatically refreshed when module code changes
  • Reused consistently across evaluation runs for reliable trend tracking

Example: Evaluation Output

You: evaluate the summarizer agent

Here is the complete evaluation report for the Summarizer Agent
(agents/summarizer.py):

Metric Scores (5 samples)
- Hallucination: 0.90 ✓ Good (lower is better)
- Answer Accuracy: 0.45 ⚠ Needs Improvement
- Context Precision: 0.27 ✗ Critical
- Answer Relevancy: 0.39 ✗ Critical
- Faithfulness: 0.10 ✗ Critical

Root Cause Analysis:
Faithfulness (0.10) & Response Groundedness (0.10) — The agent is 
largely fabricating content rather than grounding summaries in the 
provided input. This is a fundamental failure for a summarizer.

Comparison with Prior Run (3 days ago):
- Faithfulness: 0.10 (no change)
- Answer Accuracy: 0.45 (↑ +0.05 improvement)
- Context Precision: 0.27 (↓ -0.12 regression)

Recommendations:
1. Add input validation to reject malformed text
2. Implement a grounding constraint that requires citations
3. Test with diverse document types

Accountability & Comparative Analysis

Track agent improvements over time and compare across versions:

You: how does this agent compare to last week's version?
→ Shows metrics side-by-side with delta (+/- changes)

You: which agents regressed in the last evaluation?
→ Flags agents with metric drops and explains why

You: show me the evaluation history for the coder agent
→ Displays trend chart showing faithfulness, accuracy over time

Installation

From PyPI:

pip install onecode-cli

Setup

1. Configure environment

Provide API keys using one of two methods:

Method A: Create .env file Add API keys to .env in your project or home directory. OPENAI_API_KEY is always required (used for embeddings):

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...   # only needed for Claude models

Method B: Export environment variables

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...   # only needed for Claude models

How to run

After installation, the onecode command is available globally:

# Default model (claude-sonnet-4-6) with explicit path
onecode ~/path/to/project

# Use current directory (default if no path specified)
onecode

# Specify a different model
onecode --model gpt-4o

# From within the codebase directory (same as above)
cd ~/myproject
onecode

First installation check:

onecode --help

First run — output looks like this:

$ onecode ~/myproject
OneCode - Codebase Analyzer
----------------------------------------
Model:    claude-sonnet-4-6
Indexing: /Users/you/myproject
Ready:    42 nodes (class:12, file:18, function:12) | 42 embeddings

Type a question or task (or 'exit' to quit).
----------------------------------------

You: 

Subsequent runs — output looks like this:

$ onecode ~/myproject
OneCode - Codebase Analyzer
----------------------------------------
Model:    claude-sonnet-4-6
Indexing: /Users/you/myproject
Ready:    42 nodes (class:12, file:18, function:12) | 42 embeddings

Type a question or task (or 'exit' to quit).
----------------------------------------

You: 

Example queries

Evaluate code quality with RAGAS metrics

You: evaluate the codebase
You: what is the accuracy of the coder agent?
You: compare this run with the previous evaluation

Understand the codebase

You: what does this codebase do?
You: explain the authentication flow
You: what agents/modules are in this project?

Find specific code

You: search for all calls to connect_db
You: where is the retry logic implemented?
You: find all async functions

Write and modify code

You: add input validation to the login function
You: write a utility function that validates emails
You: refactor the parse_config function to handle missing keys gracefully

Write, run, and debug

You: create a function that reverses a string, write a test for it, and run the test
You: add a health check endpoint and run the server to verify it starts
You: debug why the executor agent is failing on error handling

File management

You: rename src/helpers.py to src/utils.py
You: delete the tmp/ directory
You: move all test files into a tests/ directory

Git operations

You: show git status
You: show the diff of uncommitted changes
You: commit all staged files with message "add retry logic"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

onecode_cli-0.1.6.tar.gz (61.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onecode_cli-0.1.6-py3-none-any.whl (62.8 kB view details)

Uploaded Python 3

File details

Details for the file onecode_cli-0.1.6.tar.gz.

File metadata

  • Download URL: onecode_cli-0.1.6.tar.gz
  • Upload date:
  • Size: 61.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for onecode_cli-0.1.6.tar.gz
Algorithm Hash digest
SHA256 13dbfc6f53516de458981a247a5ff4298dfd1bb0ead969e5baa4947e727001be
MD5 f2bd43ded772bb6c0733a6ae61bda81c
BLAKE2b-256 aa5d410378f922a92931c65450214807d4c66093df694003ccee8706dceb7e73

See more details on using hashes here.

File details

Details for the file onecode_cli-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: onecode_cli-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 62.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for onecode_cli-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 d71897549061ffd986a01c90f473f0e8c8035c5571e0b60b048679eb6c316c3c
MD5 204693acb7d4a922142452d1d5d98a32
BLAKE2b-256 da5dfa67b2f618e7dba738f1343a1351191325858415bbaf3c13f8cca04bbacd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page