This library allows for granular testing of llm-applications based on expert input.
Project description
Stop believing your chatbot. Take the ragpill.
ragpill is an evaluation framework for LLM agents and RAG pipelines. Define facts, sources, and tool call expectations — and find out what your AI actually does.
What is RAGPill?
RAGPill helps you:
- Create test datasets from CSV files - Easy collaboration with domain experts
- Define custom evaluators - Add domain-specific knowledge to evaluations
- Track results in MLflow - Full experiment tracking and tracing
- Follow best practices - Opinionated design guides you to robust testing
It specializes in "offline" evaluation of LLM-based systems, meaning it's supposed to be part of your CI/CD pipeline or run as scheduled tests, not real-time monitoring.
Core Philosophy
Here we focus a lot on the LLM Judge evaluator, although it's the last evaluator you should use - prefer deterministic evaluators (regex, exact match) whenever possible. However, for deterministic tests, there's already a lot of tooling available, like pytest for example (yes, we like the 'code-first' approach).
Expert-Defined Attributes
LLM judges usually lack context awareness to judge which discrepancies between chatbot answers and expected answers are relevant - especially in specialized fields like law, engineering, and science where words have precise definitions.
Domain experts should define specific attributes and criteria for evaluation.
Binary Evaluations
We use boolean pass/fail values only, not scoring scales (1-10), because:
- Scales are arbitrary and often decided by LLMs
- Binary decisions are more stable and reproducible (although LLMs of course remain probabilistic)
- Easier to track and reason about over time
Tags and Attributes for Organization
Evaluators can have:
- Tags: Categorical labels for filtering (e.g.,
retrieval,time-aware-rag,basic_logic) - Attributes: Key-value metadata for categorization (e.g.,
importance: high,scope: Phase1)
Metrics are automatically calculated per tag and attribute.
Quick Navigation
Getting Started:
Evaluators:
Key Concepts
As this library is built on pydantic-ai evals, please have a look here
Key Components
- Dataset: From pydantic-ai, contains test cases with inputs, evaluators, and metadata
- Evaluators: Check if outputs meet criteria (LLMJudge, regex matchers, custom evaluators)
- MLflow Integration: Wraps execution, traces runs, evaluates outputs, uploads results
Features
- Great MLflow Integration: Traces your agent/function execution to MLflow with evaluations in the native format
- CSV/Excel Adapter: Load test cases from CSV files with evaluator configurations
- Flexible Evaluators: Built-in LLM judges, regex matchers, and easy custom evaluator creation
- Metrics per Tags/Attributes: Automatic metric calculation for each tag and attribute combination
- Type Safety: Built on pydantic-ai with full type safety throughout
Built-in Evaluators
- LLMJudge: Uses an LLM to judge correctness based on a rubric
- RegexInSourcesEvaluator: Checks if regex patterns appear in retrieved sources
- RegexInDocumentMetadataEvaluator: Checks regex in document metadata
- Custom Evaluators: Inherit from
BaseEvaluatorand implement your logic
Best Practices
[!TIP] TDD Mindset — Begin with defining a Test-Set with potential users before even starting to develop the solution. This enables clear expectation management and progress tracking.
[!TIP] Create Multiple Testsets — It might make sense for you to have some core tests that run relatively quickly and inexpensive - use these for development. Before deploying to prod, you can run an exhaustive dataset that is integrated in your CI/CD.
[!TIP] Separate Evaluation Experiments — Create dedicated MLflow experiments for evaluations. Don't mix evaluation traces with production traces.
[!TIP] Use Domain Experts — Have domain experts define evaluation criteria rather than relying solely on generic LLM judges.
[!TIP] Version Your Tests — Keep test datasets in version control alongside your code.
Documentation
Full documentation is available at joelgotsch.github.io/ragpill/latest including:
- Installation Guide: Setup instructions
- Quickstart Tutorial: Run your first evaluation
- CSV Adapter Guide: Learn the CSV format and column meanings
- Evaluators Guide: Create custom evaluators
- MLflow Integration: Advanced MLflow usage
- API Reference: Complete API documentation
Roadmap
- Adapter for testset from CSV
- Documentation via mkdocs
- Evaluators for sources and regex
- Repeat Task Evaluations (run task multiple times and evaluate with threshold)
- Adapter for task from CSV (upload to mlflow)
- Create demo video
- CI/CD (tests, build package, publish docs)
- Global evaluators from CSV (empty input)
- Track git-commit hash in experiment
- Tests with mlflow server
- Dependency injection for llm, input_to_key functions
- pytest integration
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ragpill-0.1.0.tar.gz.
File metadata
- Download URL: ragpill-0.1.0.tar.gz
- Upload date:
- Size: 32.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d514f4fcb69a776b409b80efe9ee73893366a3dfa762e9cb24b41d6b7c50af00
|
|
| MD5 |
170484de71e5237e23cb64104951a2e1
|
|
| BLAKE2b-256 |
b60fe2ee78714e644ae4616d23a100c16c9874fffe0d064a785e5308267801cd
|
Provenance
The following attestation bundles were made for ragpill-0.1.0.tar.gz:
Publisher:
publish.yml on JoelGotsch/ragpill
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ragpill-0.1.0.tar.gz -
Subject digest:
d514f4fcb69a776b409b80efe9ee73893366a3dfa762e9cb24b41d6b7c50af00 - Sigstore transparency entry: 1305161947
- Sigstore integration time:
-
Permalink:
JoelGotsch/ragpill@cd2a2fde3fefcb24ccc24509b70eb484c15e51ac -
Branch / Tag:
refs/heads/main - Owner: https://github.com/JoelGotsch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cd2a2fde3fefcb24ccc24509b70eb484c15e51ac -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file ragpill-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ragpill-0.1.0-py3-none-any.whl
- Upload date:
- Size: 35.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b29417c1292b9e6f4e16e7acaff702f1bf1ee0dfca6d9a9641255860d25e3298
|
|
| MD5 |
b87b21ad4a708c619c63606ff1adc2d1
|
|
| BLAKE2b-256 |
c304fbc30293310eda669b1bee7abcdfe9af49d494e589d2b36398bfd4234b39
|
Provenance
The following attestation bundles were made for ragpill-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on JoelGotsch/ragpill
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ragpill-0.1.0-py3-none-any.whl -
Subject digest:
b29417c1292b9e6f4e16e7acaff702f1bf1ee0dfca6d9a9641255860d25e3298 - Sigstore transparency entry: 1305162025
- Sigstore integration time:
-
Permalink:
JoelGotsch/ragpill@cd2a2fde3fefcb24ccc24509b70eb484c15e51ac -
Branch / Tag:
refs/heads/main - Owner: https://github.com/JoelGotsch
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@cd2a2fde3fefcb24ccc24509b70eb484c15e51ac -
Trigger Event:
workflow_dispatch
-
Statement type: