Core infrastructure for implementing TREC AutoJudge systems
Project description
AutoJudge Base
Core infrastructure for implementing TREC AutoJudge systems.
Installation
pip install autojudge-base
Quick Start
from autojudge_base import (
AutoJudge,
Report,
Request,
Leaderboard,
LeaderboardBuilder,
LlmConfigBase,
)
# Define your judge
class MyJudge:
nugget_banks_type = None # Or your NuggetBanks class
def judge(self, rag_responses, rag_topics, llm_config, **kwargs):
# Your judging logic
builder = LeaderboardBuilder(...)
return builder.build()
def create_nuggets(self, rag_responses, rag_topics, llm_config, **kwargs):
return None # Or create nuggets
def create_qrels(self, rag_responses, rag_topics, llm_config, **kwargs):
return None # Or create qrels
Components
Protocols
AutoJudge- Combined protocol for all three phases (see Quick Start)- Example:
NaiveJudge,RetrievalJudgein starterkit
- Example:
LeaderboardJudgeProtocol- Produces leaderboard scores (see Writing Leaderboards)- Example:
TinyJudge(minimal LLM judge)
- Example:
QrelsCreatorProtocol- Creates relevance judgments (see Writing Qrels)NuggetCreatorProtocol- Creates nugget banks (see Writing NuggetBanks)
For modular composition (separate classes per protocol), see CompleteExampleJudge in starterkit.
Input Data Models
Report- RAG system output (see Loading Reports)Request- Evaluation topic/query (see Loading Requests)Document- Document with content (see Loading Documents)
Output Containers
Leaderboard- System rankings (see Writing Leaderboards)Qrels- Relevance judgments (see Writing Qrels)NuggetBanks,NuggetizerNuggetBanks- Nugget collections (see Writing NuggetBanks)
Configuration
LlmConfigProtocol,LlmConfigBase- LLM configurationload_llm_config()- Load config from env/yaml/cli
See the auto-judge-starterkit README for LLM configuration examples.
CLI
# Run a judge workflow
auto-judge run --workflow workflow.yml --rag-responses responses.jsonl
# Export corpus
auto-judge export-corpus --output corpus.tar.gz
# List available models
auto-judge list-models
Data Loading Utilities
Loading Reports (RAG System Outputs)
A Report contains sentences with text and citations. Three sentence formats are supported:
| Format | Citations Field | Description |
|---|---|---|
NeuclirReportSentence |
List[str] |
Doc IDs ordered by priority |
RagtimeReportSentence |
Dict[str, float] |
Doc ID → confidence score (0-100) |
Rag24ReportSentence |
List[int] |
Indices into report.references |
Getting text and citations from sentences:
from autojudge_base.report import Report, load_report
reports: list[Report] = load_report(Path("responses.jsonl"))
for report in reports:
# Get sentences with citations in unified format (does not modify report)
for sentence in report.get_sentences_with_citations():
text = sentence.text # The sentence text
citations = sentence.citations or [] # Doc IDs ordered by priority
# Get the cited document content
for doc_id in citations:
if report.documents and doc_id in report.documents:
doc = report.documents[doc_id]
print(f"Citation: {doc.title} - {doc.text[:100]}...")
Convenience methods:
# Text only (no citations)
texts: list[str] = report.get_sentences()
# Text with citations (unified format, non-mutating)
sentences: List[NeuclirReportSentence] = report.get_sentences_with_citations()
# Full response as single string
test: str = report.get_report_text()
# Full text of cited documents (keyed by doc_id)
documents: Dict[str, Document] = report.documents
Report metadata and convenience methods:
for report in reports:
print(report.metadata.run_id) # Which system produced this
print(report.metadata.topic_id) # Which topic/query this answers
Note that formats of different TREC tasks differ slightly. This module will automatically load any and expose it as this one format. Task specific fields, such as narrative_id vs request_id are also available.
Loading Requests (Topics/Queries)
A Request represents an evaluation topic with the query and context:
from autojudge_base.request import Request, load_requests_from_file, load_requests_from_irds
# Load from JSONL file
requests: list[Request] = load_requests_from_file(Path("topics.jsonl"))
# Load from ir_datasets
requests: list[Request] = load_requests_from_irds("trec-rag-2025")
# Access request fields
for req in requests:
print(req.request_id) # Unique topic identifier
print(req.title) # The query/question (required)
print(req.problem_statement) # Detailed description of the information need
print(req.background) # User background/context for personalization
RAG narratives were converted to this format, exposing narratives in the problem_statement field.
Loading Documents (Background Corpus)
Use this when you need to fetch additional documents from a background corpus beyond what's included in reports.
from autojudge_base.document import Document, load_documents, RetrievedDocuments, load_retrieved_docs
# Load corpus documents from JSONL
docs: list[Document] = load_documents(Path("corpus.jsonl"))
# Access document content
for doc in docs:
print(doc.id) # Document identifier
print(doc.title) # Document title (optional)
print(doc.text) # Document content
print(doc.get_text()) # Title + text combined
# Load pre-retrieved documents with rankings
retrieved: list[RetrievedDocuments] = load_retrieved_docs(Path("retrieved.jsonl"))
for result in retrieved:
print(result.query_id)
for ranked_doc in result.ranked_docs:
print(f" Rank {ranked_doc.rank}: {ranked_doc.doc.id} (score: {ranked_doc.score})")
Writing Leaderboards
Use LeaderboardBuilder with a LeaderboardSpec to create type-safe leaderboards:
from autojudge_base import LeaderboardBuilder, LeaderboardSpec, MeasureSpec
# Define your measures
spec = LeaderboardSpec(measures=(
MeasureSpec("RELEVANCE"), # Default: mean aggregation, float values
MeasureSpec("FLUENCY"),
MeasureSpec("CITATION_QUALITY", aggregate=sum), # Custom aggregation
))
# Build the leaderboard
builder = LeaderboardBuilder(spec)
for report in reports:
builder.add(
run_id=report.metadata.run_id,
topic_id=report.metadata.topic_id,
values={
"RELEVANCE": 0.85,
"FLUENCY": 0.92,
"CITATION_QUALITY": 3,
}
)
# Finalize with expected topics (handles missing data)
leaderboard = builder.build(
expected_topic_ids=["topic1", "topic2", "topic3"],
on_missing="fix_aggregate" # or "error", "warn", "ignore"
)
In your judge:
def judge(self, rag_responses, rag_topics, llm_config, **kwargs) -> Leaderboard:
# ... build leaderboard as above ...
return leaderboard
Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.
# Write (formats: "trec_eval", "tot", "ir_measures", "jsonl")
leaderboard.write(Path("output.eval"), format="trec_eval")
# Load
leaderboard = Leaderboard.load(Path("output.eval"), format="trec_eval")
Writing NuggetBanks
NuggetBanks store evaluation nuggets (questions/claims) per topic. Two formats are supported:
| Format | Configure in workflow.yml | Description |
|---|---|---|
NuggetBanks |
nugget_banks_type: "autojudge_base.nugget_data.NuggetBanks" |
AutoArgue format with questions and claims |
NuggetizerNuggetBanks |
nugget_banks_type: "autojudge_base.nugget_data.NuggetizerNuggetBanks" |
Nuggetizer format |
from autojudge_base.nugget_data import (
NuggetBanks,
NuggetBank,
NuggetQuestion,
load_nugget_banks_from_file,
write_nugget_banks,
)
# Load existing nuggets
nuggets: NuggetBanks = load_nugget_banks_from_file(Path("nuggets.jsonl"))
# Access by topic
bank = nuggets.banks["topic-1"]
for question in bank.nuggets_as_list():
print(question.question)
print(question.gold_answers)
# Create new nuggets
bank = NuggetBank(query_id="topic-1")
bank.add_nuggets([
NuggetQuestion.from_lazy(
query_id="topic-1",
question="What is the capital of France?",
gold_answers=["Paris"],
references=["doc123"],
creator="my-judge",
)
])
nuggets = NuggetBanks.from_banks_list([bank])
In your judge:
nugget_banks_type = NuggetBanks # Required class attribute
def create_nuggets(self, rag_responses, rag_topics, llm_config, **kwargs) -> NuggetBanks:
# ... create nugget banks as above ...
return nuggets
Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.
# Write
write_nugget_banks(nuggets, Path("nuggets.jsonl"))
# Load
nuggets = load_nugget_banks_from_file(Path("nuggets.jsonl"))
Writing Qrels
Qrels store relevance judgments as (topic_id, doc_id, grade) tuples.
from autojudge_base.qrels import Qrels, QrelRow, QrelsSpec, build_qrels, write_qrel_file
# Option 1: Build directly from QrelRow objects
rows = [
QrelRow(topic_id="topic1", doc_id="doc123", grade=2),
QrelRow(topic_id="topic1", doc_id="doc456", grade=1),
QrelRow(topic_id="topic2", doc_id="doc789", grade=3),
]
qrels = Qrels(rows=rows)
# Option 2: Build from arbitrary records using QrelsSpec
@dataclass
class MyJudgment:
query: str
document: str
relevance: int
judgments = [
MyJudgment("topic1", "doc123", 2),
MyJudgment("topic1", "doc456", 1),
]
spec = QrelsSpec(
topic_id=lambda j: j.query,
doc_id=lambda j: j.document,
grade=lambda j: j.relevance,
on_duplicate="keep_max", # or "error", "keep_last"
)
qrels = build_qrels(records=judgments, spec=spec)
In your judge:
def create_qrels(self, rag_responses, rag_topics, llm_config, **kwargs) -> Qrels:
# ... build qrels as above ...
return qrels
Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.
# Write to TREC format (topic_id 0 doc_id grade)
write_qrel_file(qrel_out_file=Path("output.qrels"), qrels=qrels)
# Load from TREC format
qrels = read_qrel_file(Path("input.qrels"))
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autojudge_base-0.3.0.tar.gz.
File metadata
- Download URL: autojudge_base-0.3.0.tar.gz
- Upload date:
- Size: 85.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
091a5188608b2ad9ef1292589974bd1b1e57f12c50c4fd1e50ce42501b2f5cdb
|
|
| MD5 |
b4109ba22c37ff5b67ed46de9623ec02
|
|
| BLAKE2b-256 |
3a6ec57a78c7ea15ebb2ebf57ffa11fed0269e311135e9f329b05a2a285d7b5c
|
Provenance
The following attestation bundles were made for autojudge_base-0.3.0.tar.gz:
Publisher:
publish.yml on trec-auto-judge/auto-judge-base
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autojudge_base-0.3.0.tar.gz -
Subject digest:
091a5188608b2ad9ef1292589974bd1b1e57f12c50c4fd1e50ce42501b2f5cdb - Sigstore transparency entry: 931601124
- Sigstore integration time:
-
Permalink:
trec-auto-judge/auto-judge-base@b3884d21fc977b16bc87a5dfca74cd9977807949 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/trec-auto-judge
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3884d21fc977b16bc87a5dfca74cd9977807949 -
Trigger Event:
push
-
Statement type:
File details
Details for the file autojudge_base-0.3.0-py3-none-any.whl.
File metadata
- Download URL: autojudge_base-0.3.0-py3-none-any.whl
- Upload date:
- Size: 81.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2186751bf394424217405f55a86b4cc17fbb0a082314b5e722dc050405d8985e
|
|
| MD5 |
c7109d429c2f76a6b70b3a0e32f072c1
|
|
| BLAKE2b-256 |
f5f4e277842ab0a7932eec5a11f4b3c5c8fdfcb6ab49de8c8407a45004a5ecc9
|
Provenance
The following attestation bundles were made for autojudge_base-0.3.0-py3-none-any.whl:
Publisher:
publish.yml on trec-auto-judge/auto-judge-base
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autojudge_base-0.3.0-py3-none-any.whl -
Subject digest:
2186751bf394424217405f55a86b4cc17fbb0a082314b5e722dc050405d8985e - Sigstore transparency entry: 931601175
- Sigstore integration time:
-
Permalink:
trec-auto-judge/auto-judge-base@b3884d21fc977b16bc87a5dfca74cd9977807949 -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/trec-auto-judge
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b3884d21fc977b16bc87a5dfca74cd9977807949 -
Trigger Event:
push
-
Statement type: