Skip to main content

Core infrastructure for implementing TREC AutoJudge systems

Project description

AutoJudge Base

Core infrastructure for implementing TREC AutoJudge systems.

Installation

pip install autojudge-base

Quick Start

from autojudge_base import (
    AutoJudge,
    Report,
    Request,
    Leaderboard,
    LeaderboardBuilder,
    LlmConfigBase,
)

# Define your judge
class MyJudge:
    nugget_banks_type = None  # Or your NuggetBanks class

    def judge(self, rag_responses, rag_topics, llm_config, **kwargs):
        # Your judging logic
        builder = LeaderboardBuilder(...)
        return builder.build()

    def create_nuggets(self, rag_responses, rag_topics, llm_config, **kwargs):
        return None  # Or create nuggets

    def create_qrels(self, rag_responses, rag_topics, llm_config, **kwargs):
        return None  # Or create qrels

Components

Protocols

  • AutoJudge - Combined protocol for all three phases (see Quick Start)
    • Example: NaiveJudge, RetrievalJudge in starterkit
  • LeaderboardJudgeProtocol - Produces leaderboard scores (see Writing Leaderboards)
    • Example: TinyJudge (minimal LLM judge)
  • QrelsCreatorProtocol - Creates relevance judgments (see Writing Qrels)
  • NuggetCreatorProtocol - Creates nugget banks (see Writing NuggetBanks)

For modular composition (separate classes per protocol), see CompleteExampleJudge in starterkit.

Input Data Models

Output Containers

Configuration

  • LlmConfigProtocol, LlmConfigBase - LLM configuration
  • load_llm_config() - Load config from env/yaml/cli

See the auto-judge-starterkit README for LLM configuration examples.

CLI

The trec-auto-judge CLI (provided by the trec_auto_judge package) uses autojudge-base:

# Run a judge workflow
trec-auto-judge run --workflow workflow.yml --rag-responses responses.jsonl

# Export corpus
trec-auto-judge export-corpus --output corpus.tar.gz

# List available models
trec-auto-judge list-models

See the Workflow Guide for details.

Data Loading Utilities

Loading Reports (RAG System Outputs)

A Report contains sentences with text and citations. Three sentence formats are supported:

Format Citations Field Description
NeuclirReportSentence List[str] Doc IDs ordered by priority
RagtimeReportSentence Dict[str, float] Doc ID → confidence score (0-100)
Rag24ReportSentence List[int] Indices into report.references

Getting text and citations from sentences:

from autojudge_base.report import Report, load_report

reports: list[Report] = load_report(Path("responses.jsonl"))

for report in reports:
    # Get sentences with citations in unified format (does not modify report)
    for sentence in report.get_sentences_with_citations():
        text = sentence.text                    # The sentence text
        citations = sentence.citations or []    # Doc IDs ordered by priority

        # Get the cited document content
        for doc_id in citations:
            if report.documents and doc_id in report.documents:
                doc = report.documents[doc_id]
                print(f"Citation: {doc.title} - {doc.text[:100]}...")

Convenience methods:

# Text only (no citations)
texts: list[str] = report.get_sentences()

# Text with citations (unified format, non-mutating)
sentences: List[NeuclirReportSentence] = report.get_sentences_with_citations()  

# Full response as single string
test: str = report.get_report_text()  

# Full text of cited documents (keyed by doc_id)
documents: Dict[str, Document] = report.documents

Report metadata and convenience methods:

for report in reports:
    print(report.metadata.run_id)      # Which system produced this
    print(report.metadata.topic_id)    # Which topic/query this answers

Note that formats of different TREC tasks differ slightly. This module will automatically load any and expose it as this one format. Task specific fields, such as narrative_id vs request_id are also available.

Loading Requests (Topics/Queries)

A Request represents an evaluation topic with the query and context:

from autojudge_base.request import Request, load_requests_from_file, load_requests_from_irds

# Load from JSONL file
requests: list[Request] = load_requests_from_file(Path("topics.jsonl"))

# Load from ir_datasets
requests: list[Request] = load_requests_from_irds("trec-rag-2025")

# Access request fields
for req in requests:
    print(req.request_id)         # Unique topic identifier
    print(req.title)              # The query/question (required)
    print(req.problem_statement)  # Detailed description of the information need 
    print(req.background)         # User background/context for personalization

RAG narratives were converted to this format, exposing narratives in the problem_statement field.

Loading Documents (Background Corpus)

Use this when you need to fetch additional documents from a background corpus beyond what's included in reports.

from autojudge_base.document import Document, load_documents, RetrievedDocuments, load_retrieved_docs

# Load corpus documents from JSONL
docs: list[Document] = load_documents(Path("corpus.jsonl"))

# Access document content
for doc in docs:
    print(doc.id)          # Document identifier
    print(doc.title)       # Document title (optional)
    print(doc.text)        # Document content
    print(doc.get_text())  # Title + text combined

# Load pre-retrieved documents with rankings
retrieved: list[RetrievedDocuments] = load_retrieved_docs(Path("retrieved.jsonl"))
for result in retrieved:
    print(result.query_id)
    for ranked_doc in result.ranked_docs:
        print(f"  Rank {ranked_doc.rank}: {ranked_doc.doc.id} (score: {ranked_doc.score})")

Writing Leaderboards

Use LeaderboardBuilder with a LeaderboardSpec to create type-safe leaderboards:

from autojudge_base import LeaderboardBuilder, LeaderboardSpec, MeasureSpec

# Define your measures
spec = LeaderboardSpec(measures=(
    MeasureSpec("RELEVANCE"),           # Default: mean aggregation, float values
    MeasureSpec("FLUENCY"),
    MeasureSpec("CITATION_QUALITY", aggregate=sum),  # Custom aggregation
))

# Build the leaderboard
builder = LeaderboardBuilder(spec)

for report in reports:
    builder.add(
        run_id=report.metadata.run_id,
        topic_id=report.metadata.topic_id,
        values={
            "RELEVANCE": 0.85,
            "FLUENCY": 0.92,
            "CITATION_QUALITY": 3,
        }
    )

# Finalize with expected topics (handles missing data)
leaderboard = builder.build(
    expected_topic_ids=["topic1", "topic2", "topic3"],
    on_missing="fix_aggregate"  # or "error", "warn", "ignore"
)

In your judge:

def judge(self, rag_responses, rag_topics, llm_config, **kwargs) -> Leaderboard:
    # ... build leaderboard as above ...
    return leaderboard

Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.

# Write (formats: "ir_measures", "tot", "ir_measures", "jsonl")
leaderboard.write(Path("output.eval"), format="ir_measures")

# Load
leaderboard = Leaderboard.load(Path("output.eval"), format="ir_measures")

Writing NuggetBanks

NuggetBanks store evaluation nuggets (questions/claims) per topic. Two formats are supported:

Format Configure in workflow.yml Description
NuggetBanks nugget_banks_type: "autojudge_base.nugget_data.NuggetBanks" AutoArgue format with questions and claims
NuggetizerNuggetBanks nugget_banks_type: "autojudge_base.nugget_data.NuggetizerNuggetBanks" Nuggetizer format
from autojudge_base.nugget_data import (
    NuggetBanks,
    NuggetBank,
    NuggetQuestion,
    load_nugget_banks_from_file,
    write_nugget_banks,
)

# Load existing nuggets
nuggets: NuggetBanks = load_nugget_banks_from_file(Path("nuggets.jsonl"))

# Access by topic
bank = nuggets.banks["topic-1"]
for question in bank.nuggets_as_list():
    print(question.question)
    print(question.gold_answers)

# Create new nuggets
bank = NuggetBank(query_id="topic-1")
bank.add_nuggets([
    NuggetQuestion.from_lazy(
        query_id="topic-1",
        question="What is the capital of France?",
        gold_answers=["Paris"],
        references=["doc123"],
        creator="my-judge",
    )
])

nuggets = NuggetBanks.from_banks_list([bank])

In your judge:

nugget_banks_type = NuggetBanks  # Required class attribute

def create_nuggets(self, rag_responses, rag_topics, llm_config, **kwargs) -> NuggetBanks:
    # ... create nugget banks as above ...
    return nuggets

Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.

# Write
write_nugget_banks(nuggets, Path("nuggets.jsonl"))

# Load
nuggets = load_nugget_banks_from_file(Path("nuggets.jsonl"))

Writing Qrels

Qrels store relevance judgments as (topic_id, doc_id, grade) tuples.

from autojudge_base.qrels import Qrels, QrelRow, QrelsSpec, build_qrels, write_qrel_file

# Option 1: Build directly from QrelRow objects
rows = [
    QrelRow(topic_id="topic1", doc_id="doc123", grade=2),
    QrelRow(topic_id="topic1", doc_id="doc456", grade=1),
    QrelRow(topic_id="topic2", doc_id="doc789", grade=3),
]
qrels = Qrels(rows=rows)

# Option 2: Build from arbitrary records using QrelsSpec
@dataclass
class MyJudgment:
    query: str
    document: str
    relevance: int

judgments = [
    MyJudgment("topic1", "doc123", 2),
    MyJudgment("topic1", "doc456", 1),
]

spec = QrelsSpec(
    topic_id=lambda j: j.query,
    doc_id=lambda j: j.document,
    grade=lambda j: j.relevance,
    on_duplicate="keep_max",  # or "error", "keep_last"
)
qrels = build_qrels(records=judgments, spec=spec)

In your judge:

def create_qrels(self, rag_responses, rag_topics, llm_config, **kwargs) -> Qrels:
    # ... build qrels as above ...
    return qrels

Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.

# Write to TREC format (topic_id  0  doc_id  grade)
write_qrel_file(qrel_out_file=Path("output.qrels"), qrels=qrels)

# Load from TREC format
qrels = read_qrel_file(Path("input.qrels"))

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autojudge_base-0.3.14.tar.gz (89.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autojudge_base-0.3.14-py3-none-any.whl (86.1 kB view details)

Uploaded Python 3

File details

Details for the file autojudge_base-0.3.14.tar.gz.

File metadata

  • Download URL: autojudge_base-0.3.14.tar.gz
  • Upload date:
  • Size: 89.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autojudge_base-0.3.14.tar.gz
Algorithm Hash digest
SHA256 5eddd9bdf5c047b31bbbd68c9f190873a7d63c37dc42f207b997e6203f8d7343
MD5 e7e8ecf2b52c3b6ecce6e2b1a5a5743e
BLAKE2b-256 c14b77e4d3bb8f48a4d1af351ac7b7fc5ded2bcb1183fe66b9dbbc42ff4683fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for autojudge_base-0.3.14.tar.gz:

Publisher: publish.yml on trec-auto-judge/auto-judge-base

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autojudge_base-0.3.14-py3-none-any.whl.

File metadata

  • Download URL: autojudge_base-0.3.14-py3-none-any.whl
  • Upload date:
  • Size: 86.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for autojudge_base-0.3.14-py3-none-any.whl
Algorithm Hash digest
SHA256 798fba9b173473e209b164838e314b6f210713ad5ea0e200bebf940b427fee8a
MD5 9669091c35fb7c6eccd987b4cd1a2d10
BLAKE2b-256 6fdb79d5666f59b425132d3fa01c959a516d7cb088c8e7e7320cbe4ef2ba72bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for autojudge_base-0.3.14-py3-none-any.whl:

Publisher: publish.yml on trec-auto-judge/auto-judge-base

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page