Skip to main content

CoREB: Code Retrieval and Reranking Benchmark — a graded-relevance benchmark for code retrieval and reranking models

Project description

CoREB: Code Retrieval and Reranking Benchmark

PyPI version Downloads License Dataset

CoREB is a graded-relevance benchmark for evaluating code retrieval and reranking models across three tasks:

Task Query Target Example
Text-to-Code (T2C) Natural language description Code solution "Find the longest substring without repeating characters" → Python solution
Code-to-Code (C2C) Code in language A Equivalent code in language B Python solution → Java translation
Code-to-Text (C2T) Code snippet Problem description Python solution → problem statement

Key Features

  • Graded relevance: 3-level qrel scheme (rel=2: positive, rel=1: hard negative, rel=0: irrelevant) — hard negatives are same-problem distractors that penalize nDCG when retrieved above true positives
  • 5 programming languages: Python, C++, Java, Go, Ruby
  • Problem-disjoint train/test splits: v202602 (training) and v202603 (testing) cover non-overlapping contest windows
  • Drop-in evaluation: compatible with standard IR evaluation (pytrec_eval) with relevance_level=2

Installation

pip install coreb

For HuggingFace model support:

pip install coreb[hf]        # transformers backend
pip install coreb[gemini]    # Google Gemini API
pip install coreb[all]       # everything

Quick Start

Load the Dataset

from datasets import load_dataset

# Load v202603 release (latest)
code_corpus = load_dataset("hq-bench/coreb", "code_corpus", split="release_v2603")
text_corpus = load_dataset("hq-bench/coreb", "text_corpus", split="release_v2603")

# Load task-specific queries and qrels
t2c_queries = load_dataset("hq-bench/coreb", "text2code_queries", split="release_v2603")
t2c_qrels = load_dataset("hq-bench/coreb", "text2code_qrels", split="release_v2603")

print(f"Code corpus: {len(code_corpus)} documents")
print(f"T2C queries: {len(t2c_queries)} queries, {len(t2c_qrels)} qrels")

Run Evaluation

from coreb_runner.benchmark import (
    load_jsonl,
    convert_corpus_to_coir_format,
    convert_queries_to_coir_format,
    convert_qrels_to_coir_format,
    EvaluateRetrieval,
    DenseRetrievalExactSearch,
    create_model_wrapper,
)

# Load data (from local JSONL files or convert from HF datasets)
corpus = convert_corpus_to_coir_format(load_jsonl("code_corpus.jsonl"))
queries = convert_queries_to_coir_format(load_jsonl("text2code_queries.jsonl"))
qrels = convert_qrels_to_coir_format(load_jsonl("text2code_qrels.jsonl"))

# Create model wrapper
model = create_model_wrapper("jinaai/jina-embeddings-v3", model_type="huggingface")

# Run retrieval + evaluation
retriever = DenseRetrievalExactSearch(model, batch_size=64)
evaluator = EvaluateRetrieval(retriever, k_values=[1, 3, 5, 10])
results = evaluator.retrieve(corpus, queries)
ndcg, _map, recall, precision = evaluator.evaluate(qrels, results, evaluator.k_values)

print(f"nDCG@10: {ndcg['NDCG@10']:.4f}")
print(f"Recall@10: {recall['Recall@10']:.4f}")

Evaluation with Graded Relevance

CoREB uses relevance_level=2 — only rel>=2 items count as relevant for binary metrics (Recall, MAP, Precision). Hard negatives (rel=1) penalize nDCG by occupying top ranks with zero gain but do not inflate Recall/MRR.

# The EvaluateRetrieval class handles this automatically:
# - rel=1 (hard negatives) are zeroed out for nDCG computation
# - relevance_level=2 is set for pytrec_eval binary metrics
print(f"Relevance threshold: {EvaluateRetrieval.RELEVANCE_LEVEL}")  # 2

Dataset Structure

Available on HuggingFace: hq-bench/coreb

8 configs x 2 splits (release_v2602, release_v2603):

Config v2603 Rows Description
code_corpus 1,744 Code solutions (5 languages, 2 generator models)
text_corpus 875 Problem descriptions (175 original + 700 LLM noise)
text2code_queries 1,123 T2C queries (canonical, full, search subtasks)
text2code_qrels 5,950 T2C relevance judgments (2,814 pos + 3,136 hard neg)
code2code_queries 278 C2C queries (cross-language)
code2code_qrels 1,457 C2C relevance judgments (623 pos + 834 hard neg)
code2text_queries 1,200 C2T queries (canonical, full, match subtasks)
code2text_qrels 4,610 C2T relevance judgments (820 pos + 2,650 hard neg)

Benchmark Results (v202603, nDCG@10)

Rank Model Avg T2C C2C C2T
1 GemEmb-2 0.639 0.434 0.698 0.784
2 C2LLM-7B 0.623 0.443 0.659 0.766
3 jina-code-1.5b 0.607 0.414 0.671 0.735
4 C2LLM-0.5B 0.604 0.430 0.657 0.725
5 jina-code-0.5b 0.596 0.386 0.677 0.725
6 F2LLM-4B 0.547 0.407 0.500 0.735
7 Qwen3-Emb-4B 0.495 0.390 0.392 0.704
8 F2LLM-1.7B 0.485 0.383 0.383 0.690
9 Qwen3-Emb-0.6B 0.443 0.349 0.384 0.597
10 F2LLM-0.6B 0.439 0.344 0.334 0.641
11 Qwen3-Emb-8B 0.428 0.328 0.320 0.635

Citation

Coming soon.

License

This project is licensed under the Apache License 2.0 — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreb-0.1.0.tar.gz (118.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coreb-0.1.0-py3-none-any.whl (129.5 kB view details)

Uploaded Python 3

File details

Details for the file coreb-0.1.0.tar.gz.

File metadata

  • Download URL: coreb-0.1.0.tar.gz
  • Upload date:
  • Size: 118.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for coreb-0.1.0.tar.gz
Algorithm Hash digest
SHA256 9f44a42102dd348b5b54069a0ebc15c039d78cc275d13f7ea1459dc70801779e
MD5 abda193122626a2beb5dbee38d29518d
BLAKE2b-256 cfe6a5f6a51c61275bbdd9a3be38ddc12a2fed39e1c217b0214fe7cad0d1f677

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreb-0.1.0.tar.gz:

Publisher: publish.yml on hq-bench/coreb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file coreb-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: coreb-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 129.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for coreb-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2dabbf6a805800e3e20faa6a294fad650d4b4282af55021a2e0d310d9f78d20b
MD5 f7463f3bfaa9f791e2c1eca3e9131dae
BLAKE2b-256 9b67bd33b1bc5a9a35dfbef397c9c02892157702a4f4945d577b182aea5abc69

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreb-0.1.0-py3-none-any.whl:

Publisher: publish.yml on hq-bench/coreb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page