Skip to main content

This library allows for granular testing of llm-applications based on expert input.

Project description

ragpill logo

Stop believing your chatbot. Take the ragpill.

CI PyPI version Python versions Codecov License: MIT Documentation Ruff basedpyright


ragpill is an evaluation framework for LLM agents and RAG pipelines. Define facts, sources, and tool call expectations — and find out what your AI actually does.

What is RAGPill?

RAGPill helps you:

  • Create test datasets from CSV files - Easy collaboration with domain experts
  • Define custom evaluators - Add domain-specific knowledge to evaluations
  • Track results in MLflow - Full experiment tracking and tracing
  • Follow best practices - Opinionated design guides you to robust testing

It specializes in "offline" evaluation of LLM-based systems, meaning it's supposed to be part of your CI/CD pipeline or run as scheduled tests, not real-time monitoring.

Core Philosophy

Here we focus a lot on the LLM Judge evaluator, although it's the last evaluator you should use - prefer deterministic evaluators (regex, exact match) whenever possible. However, for deterministic tests, there's already a lot of tooling available, like pytest for example (yes, we like the 'code-first' approach).

Expert-Defined Attributes

LLM judges usually lack context awareness to judge which discrepancies between chatbot answers and expected answers are relevant - especially in specialized fields like law, engineering, and science where words have precise definitions.

Domain experts should define specific attributes and criteria for evaluation.

Binary Evaluations

We use boolean pass/fail values only, not scoring scales (1-10), because:

  • Scales are arbitrary and often decided by LLMs
  • Binary decisions are more stable and reproducible (although LLMs of course remain probabilistic)
  • Easier to track and reason about over time

Tags and Attributes for Organization

Evaluators can have:

  • Tags: Categorical labels for filtering (e.g., retrieval, time-aware-rag, basic_logic)
  • Attributes: Key-value metadata for categorization (e.g., importance: high, scope: Phase1)

Metrics are automatically calculated per tag and attribute.

Quick Navigation

Getting Started:

Evaluators:

Key Concepts

As this library is built on pydantic-ai evals, please have a look here

Key Components

  • Dataset: From pydantic-ai, contains test cases with inputs, evaluators, and metadata
  • Evaluators: Check if outputs meet criteria (LLMJudge, regex matchers, custom evaluators)
  • MLflow Integration: Wraps execution, traces runs, evaluates outputs, uploads results

Features

  • Great MLflow Integration: Traces your agent/function execution to MLflow with evaluations in the native format
  • CSV/Excel Adapter: Load test cases from CSV files with evaluator configurations
  • Flexible Evaluators: Built-in LLM judges, regex matchers, and easy custom evaluator creation
  • Metrics per Tags/Attributes: Automatic metric calculation for each tag and attribute combination
  • Type Safety: Built on pydantic-ai with full type safety throughout

Built-in Evaluators

Best Practices

[!TIP] TDD Mindset — Begin with defining a Test-Set with potential users before even starting to develop the solution. This enables clear expectation management and progress tracking.

[!TIP] Create Multiple Testsets — It might make sense for you to have some core tests that run relatively quickly and inexpensive - use these for development. Before deploying to prod, you can run an exhaustive dataset that is integrated in your CI/CD.

[!TIP] Separate Evaluation Experiments — Create dedicated MLflow experiments for evaluations. Don't mix evaluation traces with production traces.

[!TIP] Use Domain Experts — Have domain experts define evaluation criteria rather than relying solely on generic LLM judges.

[!TIP] Version Your Tests — Keep test datasets in version control alongside your code.

Documentation

Full documentation is available at joelgotsch.github.io/ragpill/latest including:

  • Installation Guide: Setup instructions
  • Quickstart Tutorial: Run your first evaluation
  • CSV Adapter Guide: Learn the CSV format and column meanings
  • Evaluators Guide: Create custom evaluators
  • MLflow Integration: Advanced MLflow usage
  • API Reference: Complete API documentation

Roadmap

  • Adapter for testset from CSV
  • Documentation via mkdocs
  • Evaluators for sources and regex
  • Repeat Task Evaluations (run task multiple times and evaluate with threshold)
  • Adapter for task from CSV (upload to mlflow)
  • Create demo video
  • CI/CD (tests, build package, publish docs)
  • Global evaluators from CSV (empty input)
  • Track git-commit hash in experiment
  • Tests with mlflow server
  • Dependency injection for llm, input_to_key functions
  • pytest integration

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ragpill-0.1.0.tar.gz (32.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ragpill-0.1.0-py3-none-any.whl (35.6 kB view details)

Uploaded Python 3

File details

Details for the file ragpill-0.1.0.tar.gz.

File metadata

  • Download URL: ragpill-0.1.0.tar.gz
  • Upload date:
  • Size: 32.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragpill-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d514f4fcb69a776b409b80efe9ee73893366a3dfa762e9cb24b41d6b7c50af00
MD5 170484de71e5237e23cb64104951a2e1
BLAKE2b-256 b60fe2ee78714e644ae4616d23a100c16c9874fffe0d064a785e5308267801cd

See more details on using hashes here.

Provenance

The following attestation bundles were made for ragpill-0.1.0.tar.gz:

Publisher: publish.yml on JoelGotsch/ragpill

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ragpill-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ragpill-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ragpill-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b29417c1292b9e6f4e16e7acaff702f1bf1ee0dfca6d9a9641255860d25e3298
MD5 b87b21ad4a708c619c63606ff1adc2d1
BLAKE2b-256 c304fbc30293310eda669b1bee7abcdfe9af49d494e589d2b36398bfd4234b39

See more details on using hashes here.

Provenance

The following attestation bundles were made for ragpill-0.1.0-py3-none-any.whl:

Publisher: publish.yml on JoelGotsch/ragpill

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page