Core infrastructure for implementing TREC AutoJudge systems

These details have not been verified by PyPI

Project links

Project description

AutoJudge Base

Core infrastructure for implementing TREC AutoJudge systems.

Installation

pip install autojudge-base

Quick Start

from autojudge_base import (
    AutoJudge,
    Report,
    Request,
    Leaderboard,
    LeaderboardBuilder,
    LlmConfigBase,
)

# Define your judge
class MyJudge:
    nugget_banks_type = None  # Or your NuggetBanks class

    def judge(self, rag_responses, rag_topics, llm_config, **kwargs):
        # Your judging logic
        builder = LeaderboardBuilder(...)
        return builder.build()

    def create_nuggets(self, rag_responses, rag_topics, llm_config, **kwargs):
        return None  # Or create nuggets

    def create_qrels(self, rag_responses, rag_topics, llm_config, **kwargs):
        return None  # Or create qrels

Components

Protocols

AutoJudge - Combined protocol for all three phases (see Quick Start)
- Example: NaiveJudge, RetrievalJudge in starterkit
LeaderboardJudgeProtocol - Produces leaderboard scores (see Writing Leaderboards)
- Example: TinyJudge (minimal LLM judge)
QrelsCreatorProtocol - Creates relevance judgments (see Writing Qrels)
NuggetCreatorProtocol - Creates nugget banks (see Writing NuggetBanks)

For modular composition (separate classes per protocol), see CompleteExampleJudge in starterkit.

Input Data Models

Report - RAG system output (see Loading Reports)
Request - Evaluation topic/query (see Loading Requests)
Document - Document with content (see Loading Documents)

Output Containers

Leaderboard - System rankings (see Writing Leaderboards)
Qrels - Relevance judgments (see Writing Qrels)
NuggetBanks, NuggetizerNuggetBanks - Nugget collections (see Writing NuggetBanks)

Configuration

LlmConfigProtocol, LlmConfigBase - LLM configuration
load_llm_config() - Load config from env/yaml/cli

See the auto-judge-starterkit README for LLM configuration examples.

CLI

# Run a judge workflow
auto-judge run --workflow workflow.yml --rag-responses responses.jsonl

# Export corpus
auto-judge export-corpus --output corpus.tar.gz

# List available models
auto-judge list-models

Data Loading Utilities

Loading Reports (RAG System Outputs)

A Report contains sentences with text and citations. Three sentence formats are supported:

Format	Citations Field	Description
`NeuclirReportSentence`	`List[str]`	Doc IDs ordered by priority
`RagtimeReportSentence`	`Dict[str, float]`	Doc ID → confidence score (0-100)
`Rag24ReportSentence`	`List[int]`	Indices into `report.references`

Getting text and citations from sentences:

from autojudge_base.report import Report, load_report

reports: list[Report] = load_report(Path("responses.jsonl"))

for report in reports:
    # Get sentences with citations in unified format (does not modify report)
    for sentence in report.get_sentences_with_citations():
        text = sentence.text                    # The sentence text
        citations = sentence.citations or []    # Doc IDs ordered by priority

        # Get the cited document content
        for doc_id in citations:
            if report.documents and doc_id in report.documents:
                doc = report.documents[doc_id]
                print(f"Citation: {doc.title} - {doc.text[:100]}...")

Convenience methods:

# Text only (no citations)
texts: list[str] = report.get_sentences()

# Text with citations (unified format, non-mutating)
sentences: List[NeuclirReportSentence] = report.get_sentences_with_citations()  

# Full response as single string
test: str = report.get_report_text()  

# Full text of cited documents (keyed by doc_id)
documents: Dict[str, Document] = report.documents

Report metadata and convenience methods:

for report in reports:
    print(report.metadata.run_id)      # Which system produced this
    print(report.metadata.topic_id)    # Which topic/query this answers

Note that formats of different TREC tasks differ slightly. This module will automatically load any and expose it as this one format. Task specific fields, such as narrative_id vs request_id are also available.

Loading Requests (Topics/Queries)

A Request represents an evaluation topic with the query and context:

from autojudge_base.request import Request, load_requests_from_file, load_requests_from_irds

# Load from JSONL file
requests: list[Request] = load_requests_from_file(Path("topics.jsonl"))

# Load from ir_datasets
requests: list[Request] = load_requests_from_irds("trec-rag-2025")

# Access request fields
for req in requests:
    print(req.request_id)         # Unique topic identifier
    print(req.title)              # The query/question (required)
    print(req.problem_statement)  # Detailed description of the information need 
    print(req.background)         # User background/context for personalization

RAG narratives were converted to this format, exposing narratives in the problem_statement field.

Loading Documents (Background Corpus)

Use this when you need to fetch additional documents from a background corpus beyond what's included in reports.

from autojudge_base.document import Document, load_documents, RetrievedDocuments, load_retrieved_docs

# Load corpus documents from JSONL
docs: list[Document] = load_documents(Path("corpus.jsonl"))

# Access document content
for doc in docs:
    print(doc.id)          # Document identifier
    print(doc.title)       # Document title (optional)
    print(doc.text)        # Document content
    print(doc.get_text())  # Title + text combined

# Load pre-retrieved documents with rankings
retrieved: list[RetrievedDocuments] = load_retrieved_docs(Path("retrieved.jsonl"))
for result in retrieved:
    print(result.query_id)
    for ranked_doc in result.ranked_docs:
        print(f"  Rank {ranked_doc.rank}: {ranked_doc.doc.id} (score: {ranked_doc.score})")

Writing Leaderboards

Use LeaderboardBuilder with a LeaderboardSpec to create type-safe leaderboards:

from autojudge_base import LeaderboardBuilder, LeaderboardSpec, MeasureSpec

# Define your measures
spec = LeaderboardSpec(measures=(
    MeasureSpec("RELEVANCE"),           # Default: mean aggregation, float values
    MeasureSpec("FLUENCY"),
    MeasureSpec("CITATION_QUALITY", aggregate=sum),  # Custom aggregation
))

# Build the leaderboard
builder = LeaderboardBuilder(spec)

for report in reports:
    builder.add(
        run_id=report.metadata.run_id,
        topic_id=report.metadata.topic_id,
        values={
            "RELEVANCE": 0.85,
            "FLUENCY": 0.92,
            "CITATION_QUALITY": 3,
        }
    )

# Finalize with expected topics (handles missing data)
leaderboard = builder.build(
    expected_topic_ids=["topic1", "topic2", "topic3"],
    on_missing="fix_aggregate"  # or "error", "warn", "ignore"
)

In your judge:

def judge(self, rag_responses, rag_topics, llm_config, **kwargs) -> Leaderboard:
    # ... build leaderboard as above ...
    return leaderboard

Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.

# Write (formats: "trec_eval", "tot", "ir_measures", "jsonl")
leaderboard.write(Path("output.eval"), format="trec_eval")

# Load
leaderboard = Leaderboard.load(Path("output.eval"), format="trec_eval")

Writing NuggetBanks

NuggetBanks store evaluation nuggets (questions/claims) per topic. Two formats are supported:

Format	Configure in workflow.yml	Description
`NuggetBanks`	`nugget_banks_type: "autojudge_base.nugget_data.NuggetBanks"`	AutoArgue format with questions and claims
`NuggetizerNuggetBanks`	`nugget_banks_type: "autojudge_base.nugget_data.NuggetizerNuggetBanks"`	Nuggetizer format

from autojudge_base.nugget_data import (
    NuggetBanks,
    NuggetBank,
    NuggetQuestion,
    load_nugget_banks_from_file,
    write_nugget_banks,
)

# Load existing nuggets
nuggets: NuggetBanks = load_nugget_banks_from_file(Path("nuggets.jsonl"))

# Access by topic
bank = nuggets.banks["topic-1"]
for question in bank.nuggets_as_list():
    print(question.question)
    print(question.gold_answers)

# Create new nuggets
bank = NuggetBank(query_id="topic-1")
bank.add_nuggets([
    NuggetQuestion.from_lazy(
        query_id="topic-1",
        question="What is the capital of France?",
        gold_answers=["Paris"],
        references=["doc123"],
        creator="my-judge",
    )
])

nuggets = NuggetBanks.from_banks_list([bank])

In your judge:

nugget_banks_type = NuggetBanks  # Required class attribute

def create_nuggets(self, rag_responses, rag_topics, llm_config, **kwargs) -> NuggetBanks:
    # ... create nugget banks as above ...
    return nuggets

Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.

# Write
write_nugget_banks(nuggets, Path("nuggets.jsonl"))

# Load
nuggets = load_nugget_banks_from_file(Path("nuggets.jsonl"))

Writing Qrels

Qrels store relevance judgments as (topic_id, doc_id, grade) tuples.

from autojudge_base.qrels import Qrels, QrelRow, QrelsSpec, build_qrels, write_qrel_file

# Option 1: Build directly from QrelRow objects
rows = [
    QrelRow(topic_id="topic1", doc_id="doc123", grade=2),
    QrelRow(topic_id="topic1", doc_id="doc456", grade=1),
    QrelRow(topic_id="topic2", doc_id="doc789", grade=3),
]
qrels = Qrels(rows=rows)

# Option 2: Build from arbitrary records using QrelsSpec
@dataclass
class MyJudgment:
    query: str
    document: str
    relevance: int

judgments = [
    MyJudgment("topic1", "doc123", 2),
    MyJudgment("topic1", "doc456", 1),
]

spec = QrelsSpec(
    topic_id=lambda j: j.query,
    doc_id=lambda j: j.document,
    grade=lambda j: j.relevance,
    on_duplicate="keep_max",  # or "error", "keep_last"
)
qrels = build_qrels(records=judgments, spec=spec)

In your judge:

def create_qrels(self, rag_responses, rag_topics, llm_config, **kwargs) -> Qrels:
    # ... build qrels as above ...
    return qrels

Manual file I/O (FYI): Judge implementations return objects; the framework handles persistence.

# Write to TREC format (topic_id  0  doc_id  grade)
write_qrel_file(qrel_out_file=Path("output.qrels"), qrels=qrels)

# Load from TREC format
qrels = read_qrel_file(Path("input.qrels"))

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.3.14

Apr 21, 2026

0.3.13

Apr 17, 2026

0.3.12

Apr 16, 2026

0.3.11

Apr 9, 2026

0.3.10

Apr 2, 2026

0.3.9

Apr 2, 2026

0.3.8

Apr 2, 2026

0.3.7

Mar 29, 2026

0.3.2

Feb 15, 2026

0.3.1

Feb 9, 2026

This version

0.3.0

Feb 9, 2026

0.2.1

Feb 15, 2026

0.1.1

Feb 7, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autojudge_base-0.3.0.tar.gz (85.3 kB view details)

Uploaded Feb 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

autojudge_base-0.3.0-py3-none-any.whl (81.7 kB view details)

Uploaded Feb 9, 2026 Python 3

File details

Details for the file autojudge_base-0.3.0.tar.gz.

File metadata

Download URL: autojudge_base-0.3.0.tar.gz
Upload date: Feb 9, 2026
Size: 85.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autojudge_base-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`091a5188608b2ad9ef1292589974bd1b1e57f12c50c4fd1e50ce42501b2f5cdb`
MD5	`b4109ba22c37ff5b67ed46de9623ec02`
BLAKE2b-256	`3a6ec57a78c7ea15ebb2ebf57ffa11fed0269e311135e9f329b05a2a285d7b5c`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autojudge_base-0.3.0.tar.gz:

Publisher: publish.yml on trec-auto-judge/auto-judge-base

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autojudge_base-0.3.0.tar.gz
- Subject digest: 091a5188608b2ad9ef1292589974bd1b1e57f12c50c4fd1e50ce42501b2f5cdb
- Sigstore transparency entry: 931601124
- Sigstore integration time: Feb 9, 2026
Source repository:
- Permalink: trec-auto-judge/auto-judge-base@b3884d21fc977b16bc87a5dfca74cd9977807949
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/trec-auto-judge
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b3884d21fc977b16bc87a5dfca74cd9977807949
- Trigger Event: push

File details

Details for the file autojudge_base-0.3.0-py3-none-any.whl.

File metadata

Download URL: autojudge_base-0.3.0-py3-none-any.whl
Upload date: Feb 9, 2026
Size: 81.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for autojudge_base-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2186751bf394424217405f55a86b4cc17fbb0a082314b5e722dc050405d8985e`
MD5	`c7109d429c2f76a6b70b3a0e32f072c1`
BLAKE2b-256	`f5f4e277842ab0a7932eec5a11f4b3c5c8fdfcb6ab49de8c8407a45004a5ecc9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for autojudge_base-0.3.0-py3-none-any.whl:

Publisher: publish.yml on trec-auto-judge/auto-judge-base

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: autojudge_base-0.3.0-py3-none-any.whl
- Subject digest: 2186751bf394424217405f55a86b4cc17fbb0a082314b5e722dc050405d8985e
- Sigstore transparency entry: 931601175
- Sigstore integration time: Feb 9, 2026
Source repository:
- Permalink: trec-auto-judge/auto-judge-base@b3884d21fc977b16bc87a5dfca74cd9977807949
- Branch / Tag: refs/tags/v0.3.0
- Owner: https://github.com/trec-auto-judge
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b3884d21fc977b16bc87a5dfca74cd9977807949
- Trigger Event: push

autojudge-base 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AutoJudge Base

Installation

Quick Start

Components

Protocols

Input Data Models

Output Containers

Configuration

CLI

Data Loading Utilities

Loading Reports (RAG System Outputs)

Loading Requests (Topics/Queries)

Loading Documents (Background Corpus)

Writing Leaderboards

Writing NuggetBanks

Writing Qrels

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance