Evaluation framework for Letta AI agents

These details have not been verified by PyPI

Project links

Project description

Letta Evals

Letta Evals provides a framework for evaluating AI agents built with Letta. We offer a flexible evaluation system to test different dimensions of agent behavior and the ability to write your own custom evals for use cases you care about. You can use your own datasets to build private evals that represent common patterns in your agentic workflows.

If you are building with agentic systems, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how agent configurations, model versions, or prompt changes might affect your use case.

Setup

To run evals against Letta agents, you will need a running Letta server. You can either:

Self-hosted: Follow the Letta installation guide to get started with self-hosting your server.

Letta Cloud: Create an account at app.letta.com and configure your environment:

export LETTA_API_KEY=your-api-key        # Get from Letta Cloud dashboard
export LETTA_PROJECT_ID=your-project-id  # Get from Letta Cloud dashboard

Then set `base_url: https://api.letta.com/` in your suite YAML.

If you plan to use LLM-based grading (rubric graders), you'll also need to configure API keys for your chosen provider (e.g., OPENAI_API_KEY).

Minimum Required Version: Python 3.9

Installing Letta Evals

If you are going to be creating custom evals or contributing to this repository, clone the repo directly from GitHub and install using:

# we recommend uv
uv sync --extra dev

Using the editable install, changes you make to your evals will be reflected immediately without having to reinstall.

Running Evals Only

If you simply want to run existing evals locally, you can install the package via pip:

pip install letta-evals

Quick Start

Create a test dataset (dataset.jsonl):

{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}

Write a suite configuration (suite.yaml):

name: my-eval-suite
dataset: dataset.jsonl
target:
  kind: letta_agent
  agent_file: my_agent.af  # or use agent_id for existing agents
  base_url: http://localhost:8283
graders:
  quality:
    kind: tool
    function: contains  # or exact_match
    extractor: last_assistant
gate:
  kind: simple
  metric_key: quality
  aggregation: avg_score
  op: gte
  value: 0.75  # require average score >= 0.75

Run the evaluation:

letta-evals run suite.yaml

Running Evals

The core evaluation flow is:

Dataset → Target (Agent) → Extractor → Grader → Gate → Result

# run an evaluation suite with real-time progress
letta-evals run suite.yaml

# save results to a directory (suite.json, summary.json, <model>.jsonl)
letta-evals run suite.yaml --output results

# run multiple times for statistical analysis
letta-evals run suite.yaml --num-runs 5

# validate suite configuration before running
letta-evals validate suite.yaml

# list available components
letta-evals list-extractors
letta-evals list-graders

You can also set run defaults directly in suite.yaml:

max_concurrent: 5
output: results/

output supports both absolute and relative paths. Relative paths are resolved from the suite file's directory. CLI flags (--max-concurrent, --output) override suite values when provided.

See the examples/ directory for complete working examples of different eval types.

Writing Evals

Letta Evals supports multiple approaches for creating evaluations, from simple YAML-based configs to fully custom Python implementations.

Getting Started

We suggest getting started with these examples:

Basic tool grading: examples/simple-tool-grader/ - Simple string matching with exact_match and contains functions
LLM-as-judge grading: examples/simple-rubric-grader/ - Using rubric graders with custom prompts for nuanced evaluation
Per-sample rubric overrides: examples/per-sample-rubric/ - Use a different rubric for each dataset row via inline rubric or rubric_path fields
Agent-as-judge grading: examples/letta-agent-rubric-grader/ - Using a Letta agent as an LLM judge (no API keys required!)
Multi-grader gates: examples/multi-grader-gate/ - Combining multiple graders with logical AND/OR gates, weighted averages, and advanced aggregation functions
Memory block extraction: examples/multiturn-memory-block-extractor/ - Extracting and evaluating agent memory across multiturn conversations
Per-turn evaluation: examples/multiturn-per-turn-grading/ - Grade each turn independently in multi-turn conversations with proportional scoring
Multi-model evaluation: examples/multi-model-simple-rubric-grader/ - Testing across multiple LLM configurations
Programmatic agent creation: examples/programmatic-agent-creation/ - Using agent factories to create agents dynamically per sample
Custom graders and extractors: examples/custom-tool-grader-and-extractor/ - Implementing custom evaluation logic with Python decorators
Letta Code CLI evaluation: examples/letta-code-simple-edit/ - Testing autonomous coding agents with async graders and subprocess execution, including multi-model evaluation support

Writing Custom Components

Letta Evals provides Python decorators for extending the framework:

@grader: Register custom scoring functions for domain-specific evaluation logic
@extractor: Create custom extractors to parse agent responses in specialized ways
@agent_factory: Define programmatic agent creation for dynamic instantiation per sample
@suite_setup: Run initialization code before evaluation starts. Supports three signatures:
- () -> None - Run once at the start with no parameters
- (client: AsyncLetta) -> None - Run once at the start with client access
- (client: AsyncLetta, model_name: str) -> None - Run once per model when evaluating multiple models (useful for model-specific setup like creating isolated working directories)

See examples/custom-tool-grader-and-extractor/ for implementation examples.

FAQ

Do you have examples of different eval types?

Yes! See the examples/ directory. Each subdirectory contains a complete working example with dataset, suite config, and any custom components.

Can I use this without writing any Python code?

Absolutely! You can create powerful evals using just YAML configs and JSONL datasets. See examples/simple-tool-grader/ or examples/simple-rubric-grader/ for code-free examples.

How do I evaluate multi-turn agent interactions?

Letta Evals natively supports multiturn conversations! Simply provide input as a list of strings in your dataset instead of a single string. The framework will send each message sequentially and capture the full trajectory. Use extractors like last_turn, all_assistant, or memory_block to evaluate different aspects of the multiturn interaction. See examples/multiturn-memory-block-extractor/ for a complete example testing memory updates across conversation turns.

Can I grade each turn independently in a multi-turn conversation?

Yes! Use per-turn evaluation by providing both input and ground_truth as lists of the same length in your dataset:
```
{"input": ["What is 2+2?", "What is 3+3?"], "ground_truth": ["4", "6"]}
```
Each turn is graded independently against its corresponding ground truth, and the final score is the average across all turns (e.g., 2/3 correct = 0.67). Access per-turn results via sample_result.grades["grader_key"].per_turn_grades. See examples/multiturn-per-turn-grading/ for a complete example.

Can I test the same agent with different LLM models?

Yes! Use the multi-model configuration feature. See examples/multi-model-simple-rubric-grader/ for an example that tests one agent with multiple model configurations.

Can I run evaluations multiple times to measure consistency?

Yes! Run evaluations multiple times to measure consistency and variance. See examples/simple-tool-grader/multi_run_tool_output_suite.yaml for an example.
```
# run 5 times and get mean/std dev statistics
letta-evals run suite.yaml --num-runs 5 --output results/
```
Results include aggregate statistics across runs with mean and standard deviation for all metrics.

Can I monitor long-running evaluations in real-time?

Yes! Results are written incrementally as JSONL, allowing you to monitor evaluations in real-time and resume interrupted runs.

Can I reuse agent trajectories when testing different graders?

Yes! Use --cached-results to reuse agent trajectories across evaluations, avoiding redundant agent runs when testing different graders.

Can I evaluate Letta Code agents across different models?

Yes! The Letta Code target supports evaluating multiple models. In your suite YAML, specify multiple model handles:
```
target:
  kind: letta_code
  model_handles:
    - anthropic/claude-sonnet-4-5-20250929
    - gpt-5-low
```
The framework automatically creates isolated working directories for each model to prevent interference between concurrent evaluations. When combined with @suite_setup functions that accept model_name, you can perform model-specific initialization for each evaluation run.

Can I use this in CI/CD pipelines?

Absolutely! Letta Evals is designed to integrate seamlessly into continuous integration workflows. Check out our .github/workflows/e2e-tests.yml for an example of running evaluations in GitHub Actions. The workflow automatically discovers and runs all suite files, making it easy to gate releases or validate changes to your agents.

I don't have access to LLM provider API keys - can I still use LLM-as-judge / rubric grading?

Yes! Use the agent-as-judge feature instead of the standard rubric grader. With agent-as-judge, you configure a Letta agent (with its own LLM access) to act as the evaluator. This is perfect for:
- Teams without direct LLM API access (using Letta Cloud or managed instances)
- Scenarios where you want the judge to use tools (e.g., web search, database queries) during evaluation
- Organizations with centralized LLM access through Letta
See examples/letta-agent-rubric-grader/ for a complete working example. The judge agent just needs a submit_grade(score: float, rationale: str) tool, and the framework handles the rest!

Contributing

Contributions are welcome! If you have an interesting eval or feature, please submit an issue or contact us on Discord.

License

This project is licensed under the MIT License. By contributing to evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an eval. Letta reserves the right to use this data in future service improvements to our product.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.17.1

May 21, 2026

0.16.0

May 21, 2026

0.15.0

May 5, 2026

0.14.0

Apr 29, 2026

0.13.0

Apr 9, 2026

0.9.0

Dec 23, 2025

0.8.1

Dec 18, 2025

0.8.0

Dec 17, 2025

0.7.0

Nov 4, 2025

0.6.1

Oct 30, 2025

0.6.0

Oct 30, 2025

0.5.0

Oct 23, 2025

0.4.1

Oct 21, 2025

0.4.0

Oct 21, 2025

0.3.2

Oct 20, 2025

0.3.1

Oct 20, 2025

0.3.0

Oct 20, 2025

0.2.0

Oct 15, 2025

0.1.0b1 pre-release

Sep 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

letta_evals-0.17.1.tar.gz (86.2 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

letta_evals-0.17.1-py3-none-any.whl (124.7 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file letta_evals-0.17.1.tar.gz.

File metadata

Download URL: letta_evals-0.17.1.tar.gz
Upload date: May 21, 2026
Size: 86.2 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for letta_evals-0.17.1.tar.gz
Algorithm	Hash digest
SHA256	`648743dd48a7f7c8bd9ef262d9ef82f020a15d851e0e437be2c34860cae1e8fd`
MD5	`153bbbc53c86495a7135b5ea45780ec0`
BLAKE2b-256	`9c767ab9170b82cdeeba3408b8281f7b2924ebabdcbfee6a05fc3363beb7f953`

See more details on using hashes here.

Provenance

The following attestation bundles were made for letta_evals-0.17.1.tar.gz:

Publisher: publish.yml on letta-ai/letta-evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: letta_evals-0.17.1.tar.gz
- Subject digest: 648743dd48a7f7c8bd9ef262d9ef82f020a15d851e0e437be2c34860cae1e8fd
- Sigstore transparency entry: 1589907992
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: letta-ai/letta-evals@958b517305a91a7e96cae95acc806d5fc799de44
- Branch / Tag: refs/heads/main
- Owner: https://github.com/letta-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@958b517305a91a7e96cae95acc806d5fc799de44
- Trigger Event: workflow_dispatch

File details

Details for the file letta_evals-0.17.1-py3-none-any.whl.

File metadata

Download URL: letta_evals-0.17.1-py3-none-any.whl
Upload date: May 21, 2026
Size: 124.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for letta_evals-0.17.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af5b0e12aa5f3decb2dfd6331735af250b11a34fa114c7dfc336458eea409c0f`
MD5	`3ad8f3690287b543d0bba2dd2360e3d3`
BLAKE2b-256	`c74056df700bdfb3e01be3ceec8dd9f3272030200bd8eee8389782dd6638bdfc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for letta_evals-0.17.1-py3-none-any.whl:

Publisher: publish.yml on letta-ai/letta-evals

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: letta_evals-0.17.1-py3-none-any.whl
- Subject digest: af5b0e12aa5f3decb2dfd6331735af250b11a34fa114c7dfc336458eea409c0f
- Sigstore transparency entry: 1589908097
- Sigstore integration time: May 21, 2026
Source repository:
- Permalink: letta-ai/letta-evals@958b517305a91a7e96cae95acc806d5fc799de44
- Branch / Tag: refs/heads/main
- Owner: https://github.com/letta-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@958b517305a91a7e96cae95acc806d5fc799de44
- Trigger Event: workflow_dispatch

letta-evals 0.17.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Letta Evals

Setup

Installing Letta Evals

Running Evals Only

Quick Start

Running Evals

Writing Evals

Getting Started

Writing Custom Components

FAQ

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance