CoREB: Code Retrieval and Reranking Benchmark — a graded-relevance benchmark for code retrieval and reranking models
Project description
CoREB: Code Retrieval and Reranking Benchmark
CoREB is a graded-relevance benchmark for evaluating code retrieval and reranking models across three tasks:
| Task | Query | Target | Example |
|---|---|---|---|
| Text-to-Code (T2C) | Natural language description | Code solution | "Find the longest substring without repeating characters" → Python solution |
| Code-to-Code (C2C) | Code in language A | Equivalent code in language B | Python solution → Java translation |
| Code-to-Text (C2T) | Code snippet | Problem description | Python solution → problem statement |
Key Features
- Graded relevance: 3-level qrel scheme (rel=2: positive, rel=1: hard negative, rel=0: irrelevant) — hard negatives are same-problem distractors that penalize nDCG when retrieved above true positives
- 5 programming languages: Python, C++, Java, Go, Ruby
- Problem-disjoint train/test splits: v202602 (training) and v202603 (testing) cover non-overlapping contest windows
- Drop-in evaluation: compatible with standard IR evaluation (pytrec_eval) with
relevance_level=2
Installation
pip install coreb
For HuggingFace model support:
pip install coreb[hf] # transformers backend
pip install coreb[gemini] # Google Gemini API
pip install coreb[all] # everything
Quick Start
Load the Dataset
from datasets import load_dataset
# Load v202603 release (latest)
code_corpus = load_dataset("hq-bench/coreb", "code_corpus", split="release_v2603")
text_corpus = load_dataset("hq-bench/coreb", "text_corpus", split="release_v2603")
# Load task-specific queries and qrels
t2c_queries = load_dataset("hq-bench/coreb", "text2code_queries", split="release_v2603")
t2c_qrels = load_dataset("hq-bench/coreb", "text2code_qrels", split="release_v2603")
print(f"Code corpus: {len(code_corpus)} documents")
print(f"T2C queries: {len(t2c_queries)} queries, {len(t2c_qrels)} qrels")
Run Evaluation
from coreb_runner.benchmark import (
load_jsonl,
convert_corpus_to_coir_format,
convert_queries_to_coir_format,
convert_qrels_to_coir_format,
EvaluateRetrieval,
DenseRetrievalExactSearch,
create_model_wrapper,
)
# Load data (from local JSONL files or convert from HF datasets)
corpus = convert_corpus_to_coir_format(load_jsonl("code_corpus.jsonl"))
queries = convert_queries_to_coir_format(load_jsonl("text2code_queries.jsonl"))
qrels = convert_qrels_to_coir_format(load_jsonl("text2code_qrels.jsonl"))
# Create model wrapper
model = create_model_wrapper("jinaai/jina-embeddings-v3", model_type="huggingface")
# Run retrieval + evaluation
retriever = DenseRetrievalExactSearch(model, batch_size=64)
evaluator = EvaluateRetrieval(retriever, k_values=[1, 3, 5, 10])
results = evaluator.retrieve(corpus, queries)
ndcg, _map, recall, precision = evaluator.evaluate(qrels, results, evaluator.k_values)
print(f"nDCG@10: {ndcg['NDCG@10']:.4f}")
print(f"Recall@10: {recall['Recall@10']:.4f}")
Evaluation with Graded Relevance
CoREB uses relevance_level=2 — only rel>=2 items count as relevant for binary metrics (Recall, MAP, Precision). Hard negatives (rel=1) penalize nDCG by occupying top ranks with zero gain but do not inflate Recall/MRR.
# The EvaluateRetrieval class handles this automatically:
# - rel=1 (hard negatives) are zeroed out for nDCG computation
# - relevance_level=2 is set for pytrec_eval binary metrics
print(f"Relevance threshold: {EvaluateRetrieval.RELEVANCE_LEVEL}") # 2
Dataset Structure
Available on HuggingFace: hq-bench/coreb
8 configs x 2 splits (release_v2602, release_v2603):
| Config | v2603 Rows | Description |
|---|---|---|
code_corpus |
1,744 | Code solutions (5 languages, 2 generator models) |
text_corpus |
875 | Problem descriptions (175 original + 700 LLM noise) |
text2code_queries |
1,123 | T2C queries (canonical, full, search subtasks) |
text2code_qrels |
5,950 | T2C relevance judgments (2,814 pos + 3,136 hard neg) |
code2code_queries |
278 | C2C queries (cross-language) |
code2code_qrels |
1,457 | C2C relevance judgments (623 pos + 834 hard neg) |
code2text_queries |
1,200 | C2T queries (canonical, full, match subtasks) |
code2text_qrels |
4,610 | C2T relevance judgments (820 pos + 2,650 hard neg) |
Benchmark Results (v202603, nDCG@10)
| Rank | Model | Avg | T2C | C2C | C2T |
|---|---|---|---|---|---|
| 1 | GemEmb-2 | 0.639 | 0.434 | 0.698 | 0.784 |
| 2 | C2LLM-7B | 0.623 | 0.443 | 0.659 | 0.766 |
| 3 | jina-code-1.5b | 0.607 | 0.414 | 0.671 | 0.735 |
| 4 | C2LLM-0.5B | 0.604 | 0.430 | 0.657 | 0.725 |
| 5 | jina-code-0.5b | 0.596 | 0.386 | 0.677 | 0.725 |
| 6 | F2LLM-4B | 0.547 | 0.407 | 0.500 | 0.735 |
| 7 | Qwen3-Emb-4B | 0.495 | 0.390 | 0.392 | 0.704 |
| 8 | F2LLM-1.7B | 0.485 | 0.383 | 0.383 | 0.690 |
| 9 | Qwen3-Emb-0.6B | 0.443 | 0.349 | 0.384 | 0.597 |
| 10 | F2LLM-0.6B | 0.439 | 0.344 | 0.334 | 0.641 |
| 11 | Qwen3-Emb-8B | 0.428 | 0.328 | 0.320 | 0.635 |
Citation
Coming soon.
License
This project is licensed under the Apache License 2.0 — see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file coreb-0.1.0.tar.gz.
File metadata
- Download URL: coreb-0.1.0.tar.gz
- Upload date:
- Size: 118.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f44a42102dd348b5b54069a0ebc15c039d78cc275d13f7ea1459dc70801779e
|
|
| MD5 |
abda193122626a2beb5dbee38d29518d
|
|
| BLAKE2b-256 |
cfe6a5f6a51c61275bbdd9a3be38ddc12a2fed39e1c217b0214fe7cad0d1f677
|
Provenance
The following attestation bundles were made for coreb-0.1.0.tar.gz:
Publisher:
publish.yml on hq-bench/coreb
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
coreb-0.1.0.tar.gz -
Subject digest:
9f44a42102dd348b5b54069a0ebc15c039d78cc275d13f7ea1459dc70801779e - Sigstore transparency entry: 1440196576
- Sigstore integration time:
-
Permalink:
hq-bench/coreb@d46f85e1d12b50b98f2cfe2844dcd310c55c12b6 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/hq-bench
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d46f85e1d12b50b98f2cfe2844dcd310c55c12b6 -
Trigger Event:
release
-
Statement type:
File details
Details for the file coreb-0.1.0-py3-none-any.whl.
File metadata
- Download URL: coreb-0.1.0-py3-none-any.whl
- Upload date:
- Size: 129.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2dabbf6a805800e3e20faa6a294fad650d4b4282af55021a2e0d310d9f78d20b
|
|
| MD5 |
f7463f3bfaa9f791e2c1eca3e9131dae
|
|
| BLAKE2b-256 |
9b67bd33b1bc5a9a35dfbef397c9c02892157702a4f4945d577b182aea5abc69
|
Provenance
The following attestation bundles were made for coreb-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on hq-bench/coreb
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
coreb-0.1.0-py3-none-any.whl -
Subject digest:
2dabbf6a805800e3e20faa6a294fad650d4b4282af55021a2e0d310d9f78d20b - Sigstore transparency entry: 1440196626
- Sigstore integration time:
-
Permalink:
hq-bench/coreb@d46f85e1d12b50b98f2cfe2844dcd310c55c12b6 -
Branch / Tag:
refs/tags/0.1.0 - Owner: https://github.com/hq-bench
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@d46f85e1d12b50b98f2cfe2844dcd310c55c12b6 -
Trigger Event:
release
-
Statement type: