Tiny CI tool for RAG retrieval regression testing.
Project description
rag-contract
Fail CI when your RAG app stops retrieving the right documents for important test questions.
rag-contract compares expected document IDs with the document IDs your retriever actually returned. It returns pass or fail.
A retriever is the part of a RAG system that finds documents before the LLM writes an answer.
A golden query is a saved test question with the document IDs that should be returned. It works like an answer key for retrieval.
A baseline is a known-good retrieval run. Future runs are compared against it.
Use rag-contract when you change:
chunking
embeddings
reranking
filters
document parsing
vector database settings
indexed documents
Example
You save this test question:
{"id":"refund_policy","query":"What is the refund policy for enterprise customers?","relevant_doc_ids":["doc_refund_policy"],"must_rank_at_most":3}
This means:
When the query asks about refund policy, doc_refund_policy should appear in the top 3 retrieved documents.
After a code change, your retriever returns this:
{"query_id":"refund_policy","results":[{"doc_id":"doc_pricing"},{"doc_id":"doc_terms"},{"doc_id":"doc_support"}]}
doc_refund_policy is missing, so the check fails:
FAIL refund_policy
Expected doc_refund_policy in top 3
Found: missing from top 5
This catches the retrieval bug before the PR is merged.
Install
pip install rag-contract
Setup
You need four files:
golden.jsonl
baseline_run.jsonl
baseline.json
ragcontract.yml
1. Create golden.jsonl
This file contains your test questions and the document IDs that should be retrieved.
Example:
{"id":"refund_policy","query":"What is the refund policy for enterprise customers?","relevant_doc_ids":["doc_refund_policy"],"must_rank_at_most":3}
{"id":"hipaa_baa","query":"Do we offer a BAA for HIPAA customers?","relevant_doc_ids":["doc_hipaa_compliance","doc_baa_terms"],"must_rank_at_most":5}
Each row is one test case.
2. Export a known-good retrieval run
Run your retriever on the golden queries and save what it returned.
Example baseline_run.jsonl:
{"query_id":"refund_policy","results":[{"doc_id":"doc_refund_policy","score":0.92},{"doc_id":"doc_terms","score":0.74},{"doc_id":"doc_pricing","score":0.61}]}
{"query_id":"hipaa_baa","results":[{"doc_id":"doc_baa_terms","score":0.89},{"doc_id":"doc_hipaa_compliance","score":0.82},{"doc_id":"doc_security","score":0.64}]}
The run file can come from any retriever. The only required fields are:
query_id
results[].doc_id
3. Create a baseline
rag-contract baseline \
--golden golden.jsonl \
--run baseline_run.jsonl \
--out baseline.json
This saves the known-good retrieval scores.
4. Create ragcontract.yml
k: 5
fail_on:
mrr_drop_gt: 0.10
recall_drop_gt: 0.10
hitrate_drop_gt: 0.05
minimums:
mrr_at_k: 0.70
recall_at_k: 0.80
hitrate_at_k: 0.90
per_query:
enforce_must_rank_at_most: true
enforce_must_include: true
enforce_forbidden_docs: true
5. Check a new retrieval run
After changing your RAG pipeline, export a new run.
Example current_run.jsonl:
{"query_id":"refund_policy","results":[{"doc_id":"doc_pricing","score":0.81},{"doc_id":"doc_terms","score":0.72},{"doc_id":"doc_support","score":0.66}]}
{"query_id":"hipaa_baa","results":[{"doc_id":"doc_baa_terms","score":0.87},{"doc_id":"doc_security","score":0.68},{"doc_id":"doc_hipaa_compliance","score":0.62}]}
Run the check:
rag-contract check \
--golden golden.jsonl \
--run current_run.jsonl \
--baseline baseline.json \
--config ragcontract.yml
Exit codes:
0 = pass
1 = retrieval check failed
2 = invalid input
Input files
rag-contract uses two JSONL input files.
JSONL means one JSON object per line.
Golden file
The golden file contains the expected retrieval behavior.
Example:
{"id":"refund_policy","query":"What is the refund policy for enterprise customers?","relevant_doc_ids":["doc_refund_policy"],"must_rank_at_most":3,"tags":["policy"],"weight":2}
Required fields:
id
query
relevant_doc_ids
Optional fields:
must_rank_at_most
must_include_any
forbidden_doc_ids
weight
tags
metadata
Field meanings:
id stable ID for the test query
query the question being tested
relevant_doc_ids document IDs that should be retrieved
must_rank_at_most highest allowed rank for the expected document
must_include_any pass if at least one relevant document appears
forbidden_doc_ids document IDs that should not appear in retrieval
weight importance of this query in aggregate metrics
tags labels used for grouped reporting
metadata extra information saved with the test case
Retriever run file
The run file contains the documents returned by your retriever.
Example:
{"query_id":"refund_policy","results":[{"doc_id":"doc_pricing","score":0.81},{"doc_id":"doc_refund_policy","score":0.77},{"doc_id":"doc_terms","score":0.61}],"latency_ms":42}
Required fields:
query_id
results[].doc_id
Optional fields:
results[].score
results[].chunk_id
results[].metadata
latency_ms
embedding_model
index_version
chunking_version
retriever_version
Field meanings:
query_id ID from the golden file
results ranked list of retrieved documents
results[].doc_id document ID returned by the retriever
results[].score retriever score, if available
results[].chunk_id chunk ID, if retrieval happens at chunk level
latency_ms retrieval latency for the query
embedding_model embedding model used for this run
index_version index version used for this run
chunking_version chunking version used for this run
retriever_version retriever version used for this run
Your RAG stack only needs to export this format.
The retriever can use:
LangChain
LlamaIndex
Chroma
Pinecone
Weaviate
Postgres
Elasticsearch
custom code
Config
Example ragcontract.yml:
k: 5
fail_on:
mrr_drop_gt: 0.10
recall_drop_gt: 0.10
hitrate_drop_gt: 0.05
minimums:
mrr_at_k: 0.70
recall_at_k: 0.80
hitrate_at_k: 0.90
per_query:
enforce_must_rank_at_most: true
enforce_must_include: true
enforce_forbidden_docs: true
Config fields:
k number of retrieved documents to evaluate
mrr_drop_gt fail if MRR drops by more than this amount
recall_drop_gt fail if Recall drops by more than this amount
hitrate_drop_gt fail if HitRate drops by more than this amount
mrr_at_k minimum allowed MRR@k
recall_at_k minimum allowed Recall@k
hitrate_at_k minimum allowed HitRate@k
enforce_must_rank_at_most fail when expected docs appear too low
enforce_must_include fail when expected docs are missing
enforce_forbidden_docs fail when forbidden docs appear
Metrics
rag-contract computes:
MRR@k
Recall@k
Precision@k
HitRate@k
Plain-English meanings:
MRR@k how high the first correct document appears
Recall@k how many expected documents appeared in the top k
Precision@k how many retrieved documents were expected
HitRate@k whether at least one expected document appeared in the top k
Example metric output:
MRR@5 0.82 -> 0.68 FAIL
Recall@5 0.91 -> 0.76 FAIL
Precision@5 0.44 -> 0.41 PASS
HitRate@5 0.96 -> 0.84 FAIL
When golden queries include tags, report.json includes tag-level metrics.
Per-query checks
You can define query-specific rules.
Example:
{"id":"public_pricing","query":"What is public pricing?","relevant_doc_ids":["pricing_public"],"forbidden_doc_ids":["internal_discount_policy"],"must_rank_at_most":3}
This check fails when:
pricing_public is missing from the top results
pricing_public appears below rank 3
internal_discount_policy appears in the retrieved results
Commands
Validate input files:
rag-contract validate \
--golden golden.jsonl \
--run current_run.jsonl
Score one run:
rag-contract score \
--golden golden.jsonl \
--run current_run.jsonl \
--k 5
Create a baseline:
rag-contract baseline \
--golden golden.jsonl \
--run baseline_run.jsonl \
--out baseline.json
Check against a baseline:
rag-contract check \
--golden golden.jsonl \
--run current_run.jsonl \
--baseline baseline.json \
--config ragcontract.yml
Show query-level changes:
rag-contract diff \
--golden golden.jsonl \
--run current_run.jsonl \
--baseline baseline.json
GitHub Actions
name: RAG Contract Tests
on:
pull_request:
jobs:
rag-contract:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install rag-contract
run: pip install rag-contract
- name: Run retriever
run: python examples/export_retrieval_run.py --out current_run.jsonl
- name: Check retrieval contracts
run: |
rag-contract check \
--golden golden.jsonl \
--run current_run.jsonl \
--baseline baseline.json \
--config ragcontract.yml \
--report-md report.md \
--junit junit.xml
Output files
By default, check writes:
report.md
report.json
junit.xml
report.md
Human-readable report for local review or CI artifacts.
report.json
Machine-readable report with:
global metrics
per-query results
failed checks
tag-level metrics
junit.xml
JUnit-compatible test report for CI systems.
Checked failures
rag-contract detects:
expected document is missing
expected document moved below the allowed rank
forbidden document appeared in retrieved results
MRR@k dropped more than allowed
Recall@k dropped more than allowed
HitRate@k dropped more than allowed
overall metric is below the configured minimum
Scope
rag-contract evaluates retrieval output only.
Out of scope:
generated answer grading
LLM judges
synthetic test question generation
hosted dashboards
document ingestion
direct vector database connections
framework-specific requirements
File format summary
Golden row:
{"id":"query_id","query":"user question","relevant_doc_ids":["doc_id"],"must_rank_at_most":5}
Run row:
{"query_id":"query_id","results":[{"doc_id":"doc_id","score":0.91}]}
Minimum setup:
golden.jsonl
baseline_run.jsonl
baseline.json
current_run.jsonl
ragcontract.yml
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rag_contract-0.1.0.tar.gz.
File metadata
- Download URL: rag_contract-0.1.0.tar.gz
- Upload date:
- Size: 18.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0d6e6af52dee2d6445b0c25916f420bd9f0675667f8f42d6ff38c5c0590cf896
|
|
| MD5 |
98f56b8c2384e26ba7afefae79dc1cec
|
|
| BLAKE2b-256 |
d807aea1d387f8bfac0b0c9fa7b0d06e2f72517f8397445466e48cfc94ab7af2
|
Provenance
The following attestation bundles were made for rag_contract-0.1.0.tar.gz:
Publisher:
publish.yml on volkthienpreecha/rag-contract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_contract-0.1.0.tar.gz -
Subject digest:
0d6e6af52dee2d6445b0c25916f420bd9f0675667f8f42d6ff38c5c0590cf896 - Sigstore transparency entry: 1409070961
- Sigstore integration time:
-
Permalink:
volkthienpreecha/rag-contract@9de8ca69ecd6654504da15057cb7d2d02560ab8a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/volkthienpreecha
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9de8ca69ecd6654504da15057cb7d2d02560ab8a -
Trigger Event:
release
-
Statement type:
File details
Details for the file rag_contract-0.1.0-py3-none-any.whl.
File metadata
- Download URL: rag_contract-0.1.0-py3-none-any.whl
- Upload date:
- Size: 17.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
285886808ec7f3947bade57824cf89c22db4366068c5c5eb458d8aa577a7b57f
|
|
| MD5 |
c3feb4af56227037e5d2f8f83585b5e0
|
|
| BLAKE2b-256 |
71d7f0bbd406fc97c839152baca20bfe8a4c57a8ede8ac9023d9a10c0aad93d7
|
Provenance
The following attestation bundles were made for rag_contract-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on volkthienpreecha/rag-contract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
rag_contract-0.1.0-py3-none-any.whl -
Subject digest:
285886808ec7f3947bade57824cf89c22db4366068c5c5eb458d8aa577a7b57f - Sigstore transparency entry: 1409070983
- Sigstore integration time:
-
Permalink:
volkthienpreecha/rag-contract@9de8ca69ecd6654504da15057cb7d2d02560ab8a -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/volkthienpreecha
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@9de8ca69ecd6654504da15057cb7d2d02560ab8a -
Trigger Event:
release
-
Statement type: