Skip to main content

No project description provided

Project description

Moatless Tools

Moatless Tools is a hobby project where I experiment with some ideas I have about how LLMs can be used to edit code in large existing codebases. I believe that rather than relying on an agent to reason its way to a solution, it is crucial to build good tools to insert the right context into the prompt and handle the response.

For the implementation used in the paper SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement, please see moatless-tree-search.

SWE-Bench

I use the SWE-bench benchmark as a way to verify my ideas.

Try it out

Environment Setup

You can install Moatless Tools either from PyPI or from source:

Install from PyPI

# Install base package only
pip install moatless

# Install with streamlit visualization tools
pip install "moatless[streamlit]"

# Install with API server
pip install "moatless[api]"

# Install everything (including dev dependencies)
pip install "moatless[all]"

Install from source

Clone the repository and install using Poetry:

# Clone the repository
git clone https://github.com/aorwall/moatless-tools.git
cd moatless-tools

# Using Poetry:

# Install base package only
poetry install

# Install with streamlit visualization tools
poetry install --with streamlit

# Install with API server
poetry install --with api

# Alternative: Install all optional components at once
poetry install --all-extras


## Environment Variables

Before running the evaluation, you'll need:
1. At least one LLM provider API key (e.g., OpenAI, Anthropic, etc.)
2. A Voyage AI API key from [voyageai.com](https://voyageai.com) to use the pre-embedded vector stores for SWE-Bench instances.
3. (Optional) Access to a testbed environment - see [moatless-testbeds](https://github.com/aorwall/moatless-testbeds) for setup instructions

You can configure these settings by either:

1. Create a `.env` file in the project root (copy from `.env.example`):

```bash
# Using Poetry:
cp .env.example .env
# Edit .env with your values

# Using pip:
curl -O https://raw.githubusercontent.com/aorwall/moatless-tools/main/.env.example
mv .env.example .env
# Edit .env with your values
  1. Or export the variables directly:
# Directory for storing vector index store files  
export INDEX_STORE_DIR="/tmp/index_store"    

# Directory for storing cloned repositories 
export REPO_DIR="/tmp/repos"

# Required: At least one LLM provider API key
export OPENAI_API_KEY="<your-key>"
export ANTHROPIC_API_KEY="<your-key>"

# ...or Base URL for custom LLM API service (optional)
export CUSTOM_LLM_API_BASE="<your-base-url>"
export CUSTOM_LLM_API_KEY="<your-key>"

# Required: API Key for Voyage Embeddings
export VOYAGE_API_KEY="<your-key>"

# Optional: Configuration for testbed environment (https://github.com/aorwall/moatless-testbeds)
export TESTBED_API_KEY="<your-key>"
export TESTBED_BASE_URL="<your-base-url>"

Verified Models

Note: The current version of litellm lacks support for computer use tools required by Claude 3.5 Sonnet. You need to use a specific dependency:

litellm = { git = "https://github.com/aorwall/litellm.git", branch = "anthropic-computer-use" }

Default model configurations are provided for verified models. Note that other models may work but have not been extensively tested. Verified models are models that have been tested and found to work with the Verified Mini subset of the SWE-Bench dataset.

When specifying just the --model argument, the following configurations are used:

Model Response Format Message History Thoughts in Action Verified Mini
claude-3-5-sonnet-20241022 tool_call messages no 46%
claude-3-5-haiku-20241022 tool_call messages no 28%
gpt-4o-2024-11-20 tool_call messages yes 32%
gpt-4o-mini-2024-07-18 tool_call messages yes 16%
o1-mini-2024-09-12 react react no (disabled thoughts) 28%
deepseek/deepseek-chat react react no 36%
deepseek/deepseek-reasoner react react no (disabled thoughts) 50%
gemini/gemini-2.0-flash-exp react react no 38%
openrouter/meta-llama/llama-3.1-70b-instruct react react no -
openrouter/meta-llama/llama-3.1-405b-instruct react react no 28%
openrouter/qwen/qwen-2.5-coder-32b-instruct react react no 32%

Verify Setup

Before running the full evaluation, you can verify your setup using the integration test script:

# Run a single model test
python -m moatless.validation.validate_simple_code_flow --model claude-3-5-sonnet-20241022

The script will run the model against a sample SWE-Bench instance

Results are saved in test_results/integration_test_<timestamp>/ .

Run evaluation

The evaluation script supports various configuration options through command line arguments:

python -m moatless.benchmark.run_evaluation [OPTIONS]

Required arguments:

  • --model MODEL: Model to use for evaluation (e.g., 'claude-3-5-sonnet-20241022', 'gpt-4o')

Optional arguments:

  • Model settings:

    • --model MODEL: Model identifier. Can be a supported model from the table below or any custom model identifier.
    • --api-key KEY: API key for the model
    • --base-url URL: Base URL for the model API
    • --response-format FORMAT: Response format ('tool_call' or 'react'). Defaults to 'tool_call' for custom models
    • --message-history TYPE: Message history type ('messages', 'summary', 'react', 'messages_compact', 'instruct'). Defaults to 'messages' for custom models
    • --thoughts-in-action: Enable thoughts in action
    • --temperature FLOAT: Temperature for model sampling. Defaults to 0.0
  • Dataset settings:

    • --split SPLIT: Dataset split to use. Defaults to 'lite'
    • --instance-ids ID [ID ...]: Specific instance IDs to evaluate
  • Loop settings:

    • --max-iterations INT: Maximum number of iterations
    • --max-cost FLOAT: Maximum cost in dollars
  • Runner settings:

    • --num-workers INT: Number of parallel workers. Defaults to 10
    • --evaluation-name NAME: Custom name for the evaluation run
    • --rerun-errors: Rerun instances that previously errored

Available dataset splits that can be specified with the --split argument:

Split Name Description Instance Count
lite All instances from the lite dataset 300
verified All instances from the verified dataset 500
verified_mini MariusHobbhahn/swe-bench-verified-mini, a subset of SWE-Bench Verified 50
lite_and_verified_solvable Instances that exist in both lite and verified datasets and have at least one solved submission to SWE-Bench 84

Example usage:

# Run evaluation with Claude 3.5 Sonnet using the ReACT format
python -m moatless.benchmark.run_evaluation \
  --model claude-3-5-sonnet-20241022 \
  --response-format react \
  --message-history react \
  --num-workers 10

# Run specific instances with GPT-4
python -m moatless.benchmark.run_evaluation \
  --model gpt-4o-2024-11-20 \
  --instance-ids "django__django-16527"

Running the UI and API

The project includes a web UI for visualizing saved trajectory files, built with SvelteKit. The UI is packaged with the Python package and will be served by the API server.

First, make sure you have the required components installed:

pip install "moatless[api]"

Start the API Server

moatless-api

This will start the FastAPI server on http://localhost:8000 and serve the UI at the same address.

Development Mode

If you want to develop the UI, you can run it in development mode:

# From the ui directory
cd ui
pnpm install
pnpm run dev

The UI development server will be available at http://localhost:5173.

Code Examples

Basic setup using the AgenticLoop to solve a SWE-Bench instance.

Example 1: Using Claude 3.5 Sonnet

from moatless.benchmark.swebench import create_repository
from moatless.benchmark.utils import get_moatless_instance
from moatless.agent.code_agent import CodingAgent
from moatless.index import CodeIndex
from moatless.loop import AgenticLoop
from moatless.file_context import FileContext
from moatless.completion.base import LLMResponseFormat
from moatless.schema import MessageHistoryType

index_store_dir = os.getenv("INDEX_STORE_DIR", "/tmp/index_store")
repo_base_dir = os.getenv("REPO_DIR", "/tmp/repos")
persist_path = "trajectory.json"

instance = get_moatless_instance("django__django-16379")
repository = create_repository(instance)
code_index = CodeIndex.from_index_name(
    instance["instance_id"], 
    index_store_dir=index_store_dir, 
    file_repo=repository
)
file_context = FileContext(repo=repository)

# Create agent using Claude 3.5 Sonnet with explicit config
agent = CodingAgent.create(
    repository=repository, # Repository instance with codebase
    code_index=code_index, # Code index for semantic search
    
    model="claude-3-5-sonnet-20241022",
    temperature=0.0, 
    max_tokens=4000,
    few_shot_examples=False, # We don't need few-shot examples for this model
    
    response_format=LLMResponseFormat.TOOLS,
    message_history_type=MessageHistoryType.MESSAGES, # We must show the full message history to make us of claude's prompt cache
)

loop = AgenticLoop.create(
    message=instance["problem_statement"],
    agent=agent,
    file_context=file_context,
    repository=repository,
    persist_path=persist_path,
    max_iterations=50,
    max_cost=2.0
)

final_node = loop.run()
if final_node:
    print(final_node.observation.message)

Example 2: Using Deepseek V3

from moatless.benchmark.swebench import create_repository
from moatless.benchmark.utils import get_moatless_instance
from moatless.agent.code_agent import CodingAgent
from moatless.index import CodeIndex
from moatless.loop import AgenticLoop
from moatless.file_context import FileContext
from moatless.completion.base import LLMResponseFormat
from moatless.schema import MessageHistoryType

index_store_dir = os.getenv("INDEX_STORE_DIR", "/tmp/index_store")
repo_base_dir = os.getenv("REPO_DIR", "/tmp/repos")
persist_path = "trajectory.json"

instance = get_moatless_instance("django__django-16379")
repository = create_repository(instance)
code_index = CodeIndex.from_index_name(
    instance["instance_id"], 
    index_store_dir=index_store_dir, 
    file_repo=repository
)
file_context = FileContext(repo=repository)

# Create agent using Deepseek Chat with explicit config
agent = CodingAgent.create(
    repository=repository,
    code_index=code_index,
    
    model="deepseek/deepseek-chat",
    temperature=0.0,
    max_tokens=4000,
    few_shot_examples=True,
    
    response_format=LLMResponseFormat.REACT,
    message_history_type=MessageHistoryType.REACT
)

loop = AgenticLoop.create(
    message=instance["problem_statement"],
    agent=agent,
    file_context=file_context,
    repository=repository,
    persist_path=persist_path,
    max_iterations=50,
    max_cost=2.0
)

final_node = loop.run()
if final_node:
    print(final_node.observation.message)

CodingAgent Parameters

Parameter Type Default Description
model str Required Model identifier from supported models table (e.g., "claude-3-5-sonnet-20241022")
repository Repository Required Repository instance containing the codebase
code_index CodeIndex None Code index for semantic search functionality
runtime RuntimeEnvironment None Environment for running tests
message_history_type MessageHistoryType From config How to format the message history in the prompt ('messages', 'react', etc.)
thoughts_in_action bool From config Whether to include thoughts in action responses, used when the LLM can't provide the reasoning in the message content
disable_thoughts bool From config Whether to completely disable thought generation, used for reasoning models like o1 and Deepseek R1
few_shot_examples bool From config Whether to use few-shot examples in prompts
temperature float From config Temperature for model sampling (0.0 = deterministic)
max_tokens int From config Maximum tokens per model completion

The default values for optional parameters are taken from the model's configuration in model_config.py. See the Verified Models table above for model-specific defaults.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

moatless-0.0.12.tar.gz (3.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

moatless-0.0.12-py3-none-any.whl (3.1 MB view details)

Uploaded Python 3

File details

Details for the file moatless-0.0.12.tar.gz.

File metadata

  • Download URL: moatless-0.0.12.tar.gz
  • Upload date:
  • Size: 3.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.13.1 Linux/6.12.9-arch1-1

File hashes

Hashes for moatless-0.0.12.tar.gz
Algorithm Hash digest
SHA256 2a456ac8f032e9c11e2380dd0b877975cbffec192af15e46483de2318953354a
MD5 81c69df6d9da5ecc390a978fe20431c1
BLAKE2b-256 935b69fbfd7c0f4b41db82ea6bf3899428da0acf4127aec7d89bb854098379ed

See more details on using hashes here.

File details

Details for the file moatless-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: moatless-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 3.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.0.1 CPython/3.13.1 Linux/6.12.9-arch1-1

File hashes

Hashes for moatless-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 73e132fd28c61e0a028ee00f4fa8d8f91f8a86991080f084084891ec25eb963b
MD5 1468dd58717115fc9584a5207f253d22
BLAKE2b-256 bb6b036f3f5d9419174f221e2fe44790b9ca45b77130ec75da9c883fbd43c7f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page