Forge better rankings from candidate documents with LLM reranking.
Project description
ranksmith
Forge better rankings from candidate documents.
ranksmith is a small Python package for LLM-based reranking. Version 1 focuses
on Azure OpenAI powered zero-shot listwise reranking for candidate documents.
Install
pip install ranksmith
Quick Start
from ranksmith import AzureOpenAIReranker, Document
reranker = AzureOpenAIReranker(
api_key="...",
azure_endpoint="https://example.openai.azure.com",
azure_deployment="gpt-4o-mini",
)
results = reranker.rerank(
query="What is listwise reranking?",
documents=[
Document(id="a", text="Listwise reranking compares candidates together."),
Document(id="b", text="Vector search retrieves candidate documents."),
],
top_k=2,
)
for result in results:
print(result.rank, result.original_index, result.document.id)
rank is 1-based for display. original_index is 0-based so it maps back to
the input list.
Supported Strategies & Algorithms
ranksmith separates the evaluation methodology (Strategy) from its specific execution logic (Algorithm). Version 1 supports listwise reranking and pairwise PRP reranking.
1. ListwiseStrategy (RankGPT)
This strategy places multiple documents into a single prompt and asks the LLM to rank them all at once.
rankgpt_sliding_windowAlgorithm (Default)- Implements the RankGPT-style back-to-first sliding window with bubble-up behavior.
- Useful when you want RankGPT's window traversal semantics while keeping ranksmith's strict JSON output validation.
2. PairwiseStrategy (PRP)
This strategy compares two documents at a time using Pairwise Ranking Prompting.
prp_sliding_kAlgorithm- Starts from the bottom of the current ranking and compares adjacent pairs.
- Calls the provider twice per pair, swapping A/B order to reduce position bias.
- Conflicting valid comparisons are treated as ties and keep the current order.
- Default
passes=10, matching the PRP-Sliding-10 setting from the reference paper. - Expected provider calls per query:
2 * passes * max(document_count - 1, 0). AsyncPairwiseStrategycan run each pair's A/B and B/A calls concurrently withpair_order_parallelism=2without changing PRP traversal or call count.
3. TourRankStrategy (TourRank-r)
This strategy treats candidate documents as tournament participants. In each
stage, the provider selects the top-m documents from each group; selected
documents advance and earn points. The final ranking is sorted by accumulated
points.
tourrank_rAlgorithm- Default
rounds=2for a practical cost/performance trade-off. - Prefer
rounds=10for quality-focused evaluation, paper-style reproduction, or final offline reranking when the extra LLM calls are acceptable. - Default stages assume exactly 100 candidate documents:
100 -> 50 -> 20 -> 10 -> 5 -> 2. - For other candidate counts, pass explicit
stage_configs; ranksmith fast fails instead of silently deriving or trimming stages. TourRankStrategydefaults togroup_parallelism=1for serial sync calls. Increase it to run groups in the same stage concurrently. If one parallel group fails, already-started group calls may still finish.AsyncTourRankStrategyruns groups concurrently by default. Setgroup_parallelismto cap concurrent provider calls.
- Default
How to Apply a Strategy
You can configure and inject a custom strategy into the AzureOpenAIReranker.
from ranksmith import AzureOpenAIReranker, ListwiseStrategy, PairwiseStrategy
# 1. Configure the strategy and algorithm
strategy = ListwiseStrategy(
algorithm="rankgpt_sliding_window",
window_size=20, # Number of documents evaluated at once
stride=10, # Number of overlapping documents between windows
max_document_chars=4000, # Max characters allowed per document
)
# 2. Inject into the Reranker
reranker = AzureOpenAIReranker(
api_key="...",
azure_endpoint="https://example.openai.azure.com",
azure_deployment="gpt-4o-mini",
strategy=strategy, # <-- Inject the strategy here
)
results = reranker.rerank("query", documents)
Pairwise PRP can be injected the same way:
strategy = PairwiseStrategy(
algorithm="prp_sliding_k",
passes=10,
max_document_chars=4000,
)
reranker = AzureOpenAIReranker(
api_key="...",
azure_endpoint="https://example.openai.azure.com",
azure_deployment="gpt-4o-mini",
strategy=strategy,
)
TourRank-r can also be injected:
from ranksmith import AzureOpenAIReranker, TourRankStrategy
reranker = AzureOpenAIReranker(
api_key="...",
azure_endpoint="https://example.openai.azure.com",
azure_deployment="gpt-4o-mini",
strategy=TourRankStrategy(rounds=2, group_parallelism=1),
)
For quality-focused runs, explicitly switch to TourRank-10:
reranker = AzureOpenAIReranker(
api_key="...",
azure_endpoint="https://example.openai.azure.com",
azure_deployment="gpt-4o-mini",
strategy=TourRankStrategy(rounds=10),
)
Note: If
strategyis not provided, it defaults toListwiseStrategy(algorithm="rankgpt_sliding_window"). Pairwise PRP and TourRank-r use more LLM calls than listwise reranking, so check call estimates before live benchmarks.
Custom Strategies
Custom reranking methods should be implemented as new strategy classes instead
of adding new string values to ListwiseStrategy.algorithm. A strategy receives
the normalized Document objects, a provider, and optional top_k, then returns
RerankResult objects.
from collections.abc import Sequence
from ranksmith import (
AzureOpenAIReranker,
Document,
RerankResult,
)
class LengthStrategy:
def rerank(
self,
*,
query: str,
documents: Sequence[Document],
provider: object,
top_k: int | None = None,
) -> list[RerankResult]:
del query, provider
ordered_indexes = sorted(
range(len(documents)),
key=lambda index: len(documents[index].text),
reverse=True,
)
results = [
RerankResult(
document=documents[original_index],
rank=rank,
original_index=original_index,
metadata={"strategy": "length"},
)
for rank, original_index in enumerate(ordered_indexes, start=1)
]
return results if top_k is None else results[:top_k]
reranker = AzureOpenAIReranker(
api_key="...",
azure_endpoint="https://example.openai.azure.com",
azure_deployment="gpt-4o-mini",
strategy=LengthStrategy(),
)
A custom strategy can also use the public provider protocols directly.
from collections.abc import Sequence
from ranksmith import (
Document,
LLMProvider,
RerankResult,
parse_ranking_response,
)
class ProviderBackedStrategy:
def rerank(
self,
*,
query: str,
documents: Sequence[Document],
provider: LLMProvider,
top_k: int | None = None,
) -> list[RerankResult]:
ranking = parse_ranking_response(
provider.rank(query, list(documents)),
expected_count=len(documents),
)
ordered_indexes = [number - 1 for number in ranking]
results = [
RerankResult(
document=documents[original_index],
rank=rank,
original_index=original_index,
metadata={"strategy": "provider-backed"},
)
for rank, original_index in enumerate(ordered_indexes, start=1)
]
return results if top_k is None else results[:top_k]
Async strategies use the same contract with async def rerank(...) and can be
typed with AsyncRerankStrategy. If a custom strategy fails with an unexpected
exception, AzureOpenAIReranker wraps it as RerankStrategyError. Raise
RerankError subclasses directly when the error category matters.
See examples/custom_strategy.py for a runnable
offline example that covers deterministic strategies, provider-backed
strategies, strict ranking parsing, and provider error classification.
For lower PRP wall time, use the async strategy. This preserves the PRP-Sliding-K method: adjacent pairs are still processed bottom-to-top, while only the two order-swapped calls for the same pair are concurrent.
from ranksmith import AsyncAzureOpenAIReranker, AsyncPairwiseStrategy
reranker = AsyncAzureOpenAIReranker(
api_key="...",
azure_endpoint="https://example.openai.azure.com",
azure_deployment="gpt-4o-mini",
strategy=AsyncPairwiseStrategy(
passes=10,
pair_order_parallelism=2,
),
)
Async Support
ranksmith provides first-class asynchronous support for high-throughput environments like FastAPI.
from ranksmith import AsyncAzureOpenAIReranker
reranker = AsyncAzureOpenAIReranker(
api_key="...",
azure_endpoint="https://example.openai.azure.com",
azure_deployment="gpt-4o-mini",
)
results = await reranker.rerank("query", documents)
Examples
Ready-to-use example code for integrating the RankGPT algorithm into your production environment can be found in the examples/ directory.
examples/rankgpt_sync.py: Synchronous RankGPT integration guideexamples/rankgpt_async.py: High-performance asynchronous RankGPT integration guide
Benchmarking
ranksmith includes a qrels-backed comparison runner for reranking algorithms. It
can run against the committed smoke fixture or a local BEIR/SciFact cache. BEIR
mode requires a first-stage candidate TSV, because qrels alone are not a valid
reranking benchmark.
Expected BEIR/SciFact cache layout:
.benchmark-cache/scifact/
corpus.jsonl
queries.jsonl
qrels/test.tsv
Candidate TSV rows must start with query_id and document_id:
query_id document_id rank
Run a live Azure comparison and write a JSON artifact:
python scripts/compare_reranking.py \
--dataset beir-scifact \
--cache-dir .benchmark-cache/scifact \
--split test \
--candidates path/to/candidates.tsv \
--algorithm all \
--top-k 10 \
--window-size 20 \
--stride 10 \
--output benchmark-results/scifact.json \
--allow-live
The JSON report includes per-query metrics and macro-averaged NDCG@k, MRR@k,
and Recall@k. Raw benchmark artifacts are intentionally ignored by git; publish
only reviewed summaries. The committed smoke fixture currently verifies the
deterministic offline RankGPT path at NDCG@3, MRR@3, and Recall@3 = 1.000.
Call accounting
compare_reranking.py estimates and prints the number of live LLM reranking
calls before execution. The count depends on the number of benchmark cases, the
selected algorithms, window_size, stride, passes, and candidate count per query:
rankgpt_sliding_window: one LLM call per back-to-front RankGPT window.prp_sliding_k:2 * passes * max(document_count - 1, 0)pairwise LLM calls per query.tourrank_r:tourrank_rounds * sum(stage.group_count)selection LLM calls per query. The runner uses the paper top-100 stages for exactly 100 candidates, and an explicit single-group halving stage plan for other candidate counts. With the paper top-100 stages, TourRank-2 uses 26 calls per query and TourRank-10 uses 130 calls per query.
The runner does not create first-stage candidates, embeddings, or communities. If your candidate TSV is produced by an upstream retrieval or community-building pipeline, account for those calls separately. A typical full pipeline has two cost surfaces:
- Candidate generation: embedding calls for corpus/query vectors, plus any LLM calls used to create or summarize communities.
- Reranking: LLM calls made by
ranksmithfor the selected reranking algorithms.
Benchmark summaries should report both numbers when community retrieval is part
of the experiment, for example: embedding calls=<n>, community LLM calls=<n>,
and reranking LLM calls=<n>.
Result Model
result.document # Document
result.rank # 1-based rank
result.original_index # 0-based input index
result.metadata # strategy-specific metadata
Error Handling
ranksmith fails fast. It does not silently truncate long documents, repair
invalid rankings, or return unvalidated LLM output.
from ranksmith import (
DocumentTooLongError,
RerankParseError,
RerankProviderError,
RerankStrategyError,
)
try:
results = reranker.rerank("query", documents)
except DocumentTooLongError:
...
except RerankParseError:
...
except RerankProviderError:
...
except RerankStrategyError:
...
MTEB Reranking Reference Evaluation
These results are intended as practical reference points, not a universal ranking. Results depend on dataset, model, candidate count, latency budget, and invalid output rate. This benchmark measures reranking over fixed native MTEB candidate sets, not first-stage retrieval.
uv run python scripts/evaluate_mteb_reranking.py \
--tasks AskUbuntuDupQuestions SciDocsRR StackOverflowDupQuestions \
--methods \
original \
rankgpt_sliding_window@20 \
prp_sliding_k@20 \
tourrank_r@20:r2 \
tourrank_r@20:r10 \
--output-dir benchmark-results/mteb-reranking/example \
--max-queries 50 \
--max-document-chars 4000 \
--shuffle-candidates --shuffle-seed 13 \
--rankgpt-window-size 20 --rankgpt-step 10 \
--prp-passes 10 \
--concurrency 4 \
--input-token-price-per-1m 2.50 \
--output-token-price-per-1m 10.00 \
--allow-live
TourRank methods use tourrank_r@N:rR, where N is the number of native MTEB
candidates to rerank and R is the number of tournament rounds. If :rR is
omitted, the runner normalizes to :r2. The recommended compact comparison is
tourrank_r@20:r2 versus tourrank_r@20:r10, which keeps the candidate scope
fixed and isolates the effect of TourRank rounds.
PRP and TourRank methods use AsyncAzureOpenAIReranker in this runner.
--concurrency parallelizes independent query-method executions; it does not
change each strategy's traversal or call count.
TourRank-r live smoke snapshot
The smoke comparison below is an actual live Azure run that verifies the
TourRank-r execution path. It uses only one AskUbuntuDupQuestions query, so
treat it as an integration and call accounting check, not as a quality
conclusion.
uv run python scripts/evaluate_mteb_reranking.py \
--tasks AskUbuntuDupQuestions \
--methods \
original \
rankgpt_sliding_window@20 \
prp_sliding_k@20 \
tourrank_r@20:r2 \
tourrank_r@20:r10 \
--output-dir benchmark-results/mteb-reranking/tourrank-smoke-20260520-121112 \
--max-queries 1 \
--max-document-chars 4000 \
--shuffle-candidates --shuffle-seed 13 \
--rankgpt-window-size 20 --rankgpt-step 10 \
--prp-passes 10 \
--concurrency 2 \
--input-token-price-per-1m 2.50 \
--output-token-price-per-1m 10.00 \
--allow-live
Scope:
- Task:
AskUbuntuDupQuestions - Split:
test - Queries:
1 - Candidate order: shuffled with seed
13 - Max document length:
4000characters - Validation: strict JSON validation, invalid outputs score
0 - Artifact:
benchmark-results/mteb-reranking/tourrank-smoke-20260520-121112
| Method | NDCG@10 | MRR@10 | MAP | Recall@10 | p50 latency | Invalid rate | LLM calls/query | Queries |
|---|---|---|---|---|---|---|---|---|
original |
0.7126 | 1.0000 | 0.7784 | 0.5000 | 0.0 ms | 0.000 | 0 | 1 |
rankgpt_sliding_window@20 |
0.9337 | 1.0000 | 0.9248 | 0.7500 | 3645.6 ms | 0.000 | 1 | 1 |
prp_sliding_k@20 |
0.7569 | 1.0000 | 0.8209 | 0.5833 | 198438.8 ms | 0.000 | 380 | 1 |
tourrank_r@20:r2 |
0.8630 | 1.0000 | 0.8941 | 0.6667 | 9285.5 ms | 0.000 | 8 | 1 |
tourrank_r@20:r10 |
0.8580 | 1.0000 | 0.8738 | 0.6667 | 41412.7 ms | 0.000 | 40 | 1 |
On this smoke run, both TourRank-r variants completed with valid strict JSON
outputs. tourrank_r@20:r10 used 5x the selection calls of
tourrank_r@20:r2, matching the configured round count.
Current MTEB snapshot
The committed reference snapshot below is from
benchmark-results/mteb-reranking/n30-ask-fixed.
Scope:
- Task:
AskUbuntuDupQuestions - Split:
test - Queries:
30 - Candidate order: shuffled with seed
13 - Max document length:
4000characters - Validation: strict JSON validation, invalid outputs score
0 - Measured methods:
original,rankgpt_sliding_window@20
| Method | NDCG@10 | MRR@10 | MAP | Recall@10 | p50 latency | p95 latency | Invalid rate | Queries |
|---|---|---|---|---|---|---|---|---|
original |
0.4431 | 0.5668 | 0.3895 | 0.5871 | 0.0 ms | 0.0 ms | 0.000 | 30 |
rankgpt_sliding_window@20 |
0.6825 | 0.6753 | 0.6424 | 0.7870 | 1953.3 ms | 2893.9 ms | 0.000 | 30 |
On this small snapshot, rankgpt_sliding_window@20 improved NDCG@10 and
Recall@10 over the original candidate order. This is not a general claim about
all datasets; it is a smoke-sized reference result for this task and
configuration.
PRP vs RankGPT Snapshot
The PRP comparison run below uses the same AskUbuntuDupQuestions setup and is
saved under benchmark-results/mteb-reranking/n30-prp-vs-rankgpt-rerun.
This is a native MTEB candidate-set benchmark: this task exposes 20 candidates
per query, so it is not the standard top-100 RankGPT setting.
| Method | NDCG@10 | MRR@10 | MAP | Recall@10 | p50 latency | p95 latency | Invalid rate | LLM calls/query | Total LLM calls | Mean cost/query | Queries |
|---|---|---|---|---|---|---|---|---|---|---|---|
original |
0.4431 | 0.5668 | 0.3895 | 0.5871 | 0.0 ms | 0.0 ms | 0.000 | 0 | 0 | - | 30 |
rankgpt_sliding_window@20 |
0.6830 | 0.6834 | 0.6400 | 0.7706 | 1842.6 ms | 2542.6 ms | 0.033 | 1 | 30 | $0.001530 | 30 |
prp_sliding_k@20 |
0.6714 | 0.7837 | 0.6132 | 0.7451 | 213583.6 ms | 230670.9 ms | 0.000 | 380 | 11,400 | $0.172772 | 30 |
RankGPT listwise led on NDCG@10, MAP, Recall@10, latency, and cost. PRP led on
MRR@10, but it required about 380 pairwise LLM calls per query with passes=10
and 20 candidates. Strict validation is applied: the RankGPT row includes one
invalid LLM output scored as zero.
For the common top-100 RankGPT setup with window_size=20 and step=10,
rankgpt_sliding_window@100 would use 9 listwise LLM calls per query. The
matching prp_sliding_k@100 setting would use
2 * 10 * (100 - 1) = 1,980 pairwise LLM calls per query.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ranksmith-0.3.1.tar.gz.
File metadata
- Download URL: ranksmith-0.3.1.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ceb06a1e95b47a1e1c768e89d287fbed555d42afef01639269700e2a6e35daf
|
|
| MD5 |
2dd9450e16c94c1e50249958738f4f8d
|
|
| BLAKE2b-256 |
cac2f51bf91cf2e2737151f2b49f764aba37024ff5986f62723a6ea2cb30e4b8
|
Provenance
The following attestation bundles were made for ranksmith-0.3.1.tar.gz:
Publisher:
ci.yml on pko89403/ranksmith
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ranksmith-0.3.1.tar.gz -
Subject digest:
9ceb06a1e95b47a1e1c768e89d287fbed555d42afef01639269700e2a6e35daf - Sigstore transparency entry: 1576645842
- Sigstore integration time:
-
Permalink:
pko89403/ranksmith@9e17748286125a66ce262a08b8680ff24eb6a099 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/pko89403
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@9e17748286125a66ce262a08b8680ff24eb6a099 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ranksmith-0.3.1-py3-none-any.whl.
File metadata
- Download URL: ranksmith-0.3.1-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e6f1e897901214838e3c418563b50e026762dc888457089fafeee3152cfe156
|
|
| MD5 |
53856c0b555228115a6634384484e39f
|
|
| BLAKE2b-256 |
fe26d07ebb86ede608f6238c12e80dffeafba2d001e2173daceb5515dee4cc5d
|
Provenance
The following attestation bundles were made for ranksmith-0.3.1-py3-none-any.whl:
Publisher:
ci.yml on pko89403/ranksmith
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ranksmith-0.3.1-py3-none-any.whl -
Subject digest:
0e6f1e897901214838e3c418563b50e026762dc888457089fafeee3152cfe156 - Sigstore transparency entry: 1576646057
- Sigstore integration time:
-
Permalink:
pko89403/ranksmith@9e17748286125a66ce262a08b8680ff24eb6a099 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/pko89403
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
ci.yml@9e17748286125a66ce262a08b8680ff24eb6a099 -
Trigger Event:
push
-
Statement type: