Skip to main content

Multi-agent codebase evaluation, analysis and refactoring in natural language.

Project description

OneCode

Evaluate, analyze, and refactor multi-agent codebases in natural language.

OneCode is an AI system designed specifically for evaluating agentic workflows. Measure agent reliability, test coverage, and output quality with 10 RAGAS metrics. Then analyze, refactor, and improve your code—all through natural conversation.

Point OneCode at any project folder, and ask it anything:

  • 📊 "Evaluate the faithfulness of my coding agent"
  • 🤔 "Which agent has the least tool calling accuracy?"
  • 🔍 "Evaluate the entire workflow"
  • ✏️ "Add input validation to the login function"
  • 🧪 "Write a test for this function and run it"
  • 📁 "Move all test files to a tests/ directory"

No complex commands. No context switching. Just natural conversation.


Why OneCode?

Problem Solution
Measuring agent reliability Automatic RAGAS evaluation (faithfulness, accuracy, hallucination)
Understanding unfamiliar codebases Semantic search + AI analysis
Time-consuming refactoring AI-powered code modification
Manual testing & debugging Automated test generation & self-correction
Context switching between tools Single natural language interface
Code maintenance at scale Intelligent file operations & git integration

Evaluation Metrics

OneCode evaluates agents using industry-standard metrics:

Core Reliability (Critical)

  • Faithfulness — How faithful is the output to the context (foundational)
  • Hallucination — How much the output diverges from the context; lower is better
  • Answer Accuracy — How accurate the answer is compared to ground truth

Agent-Specific (Critical for agentic systems)

  • Agent Goal Accuracy — Did the agent achieve its intended objective?
  • Tool Call F1 — Precision and recall of tool invocations (critical for multi-agent workflows)

Quality & Coherence

  • Answer Relevancy — How relevant the output is to the input question
  • Response Groundedness — How grounded the response is in retrieved context

Retrieval Quality (Diagnostic)

  • Context Precision — Ratio of relevant to total retrieved context chunks
  • Context Recall — Ratio of retrieved to total relevant context chunks
  • Context Relevance — How relevant the retrieved context is to the question

Evaluation Speed:

  • Use quick for fast evaluation (2 samples)
  • Use comprehensive for detailed analysis (10 samples)
  • Default: 5 samples

Note: Evaluation uses gpt-4o-mini for reliable metrics computation, independent of your chosen model.


Installation

From PyPI:

pip install onecode-cli

Setup

1. Configure environment

Provide API keys using one of two methods:

Method A: Create .env file Add API keys to .env in your project or home directory. OPENAI_API_KEY is always required (used for embeddings):

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...   # only needed for Claude models

Method B: Export environment variables

export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...   # only needed for Claude models

How to run

After installation, the onecode command is available globally:

# Default model (claude-sonnet-4-6) with explicit path
onecode ~/path/to/project

# Use current directory (default if no path specified)
onecode

# Specify a different model
onecode --model gpt-4o

# From within the codebase directory (same as above)
cd ~/myproject
onecode

First installation check:

onecode --help

First run — output looks like this:

$ onecode ~/myproject
OneCode - Codebase Analyzer
----------------------------------------
Model:    claude-sonnet-4-6
Indexing: /Users/you/myproject
Ready:    42 nodes (class:12, file:18, function:12) | 42 embeddings

Type a question or task (or 'exit' to quit).
----------------------------------------

You: 

Subsequent runs — output looks like this:

$ onecode ~/myproject
OneCode - Codebase Analyzer
----------------------------------------
Model:    claude-sonnet-4-6
Indexing: /Users/you/myproject
Ready:    42 nodes (class:12, file:18, function:12) | 42 embeddings

Type a question or task (or 'exit' to quit).
----------------------------------------

You: 

Development

Setup for development

pip install -e ".[dev]"

This installs OneCode in development mode with test dependencies.

Run tests

pytest tests/              # Run all tests
pytest tests/ -v           # Verbose output with details
pytest tests/ -v --tb=short  # With short error tracebacks

Test coverage

The test suite validates the evaluation system:

  • Query parsing (6 tests) — Natural language query parsing for target module, metrics, and sample count

    • Extracts agent names and metric aliases correctly
    • Handles sample count keywords (quick=2, comprehensive=10)
  • Target selection (4 tests) — Module matching and filtering logic

    • Exact filename matching (prevents false positives)
    • Substring fallback for flexibility
    • "codebase" target returns all modules
  • Metrics filtering (3 tests) — Metrics display selection

    • Shows only requested metrics when specified
    • Shows all metrics when none specified
    • Gracefully handles missing metric values

Example queries

Evaluate code quality with RAGAS metrics

You: evaluate the codebase
You: quick evaluation of reader agent focusing on faithfulness
You: what is the accuracy of the coder agent?
You: comprehensive evaluation of all modules
You: evaluate the code writing agent
You: which agent has the best answer relevancy?

Understand the codebase

You: what does this codebase do?
You: explain the authentication flow
You: what classes exist and what are their responsibilities?
You: how does the database connection work?
You: analyze the modules in this codebase
You: what agents/modules are in this project?

Find specific code

You: search for all calls to connect_db
You: search for TODO comments
You: where is the retry logic implemented?
You: find all async functions

Write and modify code

You: add input validation to the login function
You: write a utility function that paginates a list and add it to utils.py
You: refactor the parse_config function to handle missing keys gracefully

Write, run, and self-correct

You: create a function that reverses a string, write a test for it, and run the test
You: add a health check endpoint and run the server to verify it starts
You: write a script that counts lines of code per file and run it

File management

You: rename src/helpers.py to src/utils.py
You: delete the tmp/ directory
You: move all test files into a tests/ directory

Git operations

You: show git status
You: show the diff of uncommitted changes
You: commit all staged files with message "add retry logic"
You: show the last 5 commits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

onecode_cli-0.1.5-py3-none-any.whl (49.4 kB view details)

Uploaded Python 3

File details

Details for the file onecode_cli-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: onecode_cli-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 49.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for onecode_cli-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 8cae69d6ad43715967d3288ee85ca46e5a00238a9556fd244c4381e39e4a496f
MD5 47da0d1dc86897ae48b4943e271ec508
BLAKE2b-256 33f30fc075f931dd1dc9f7c2889899aa249e9090dbe0ed254fdf99a90e3374d9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page