Retrieve + rerank over a closed label bank: LLM bi-encoders with self-mined hard negatives and a generative listwise reranker
Project description
labelbank is the generalized core of a silver-medal (top 5%) solution to Kaggle's Eedi — Mining Misconceptions in Mathematics, extracted into a small, tested library you can run on your own label catalog with any Hugging Face backbone. The exact competition artifacts are preserved untouched in competition/, and golden tests pin the library's default behavior to the medal-winning code byte for byte.
Use it when your problem looks like this: given a piece of free text, find the matching entry in a fixed catalog of labels — a few hundred to a few tens of thousands of entries that all look frustratingly similar. Support tickets → known-issue KB, error logs → root-cause catalog, symptoms → diagnosis codes, content → policy categories, student mistakes → misconception taxonomies (the original task: 2,587 fine-grained math misconceptions).
Why not just an off-the-shelf embedding model?
Generic embedders retrieve "something related". In a fine-grained bank, related isn't enough — "ignores order of operations" and "evaluates left to right" are nearly identical sentences and different labels. Three design choices close that gap, and they are exactly what this library packages:
1. No in-batch negatives — mined pools instead.
Standard contrastive recipes use other in-batch examples as negatives. In a closed bank that's poison: another query's positive is often a sibling label of your gold (a false negative), and random negatives are trivially easy. labelbank trains on explicit per-query pools — [gold, hard negatives…] — with cross-entropy over the group (no_in_batch_neg_loss, temperature 0.01).
2. The hard negatives come from the model itself.
Train round N → rank the whole bank for every training query → take each query's own top-k as round N+1's negative pool, gold forced to the front (gold_first_pool). A self-bootstrapping curriculum: every round, the negatives are precisely the mistakes the current model still makes. This loop was decisive for the medal.
flowchart LR
T["labeled pairs<br>(text → label id)"] --> R1["bi-encoder round N<br>(LoRA fine-tune)"]
R1 -- "rank full bank<br>per training query" --> M["top-k pools<br>gold first"]
M -- "hard negatives" --> R2["bi-encoder round N+1"]
R2 -- "top-k candidates" --> RR["generative listwise reranker<br>(letters A–E, completion-only SFT)"]
RR --> O["final ranking"]
3. A generative listwise reranker with no position prior.
The retriever's top-k candidates are inlined into one prompt as lettered options; a causal LLM is fine-tuned (completion-only) to answer the letter. The gold's position is shuffled at training time — the reranker must judge content, not slot — and at inference the next-token logits over A…E re-order the candidates (ListwiseReranker).
Install
pip install labelbank # core: metrics, mining, formatting, data (no torch)
pip install labelbank[retrieve] # + bi-encoder retrieval (torch, transformers, peft)
pip install labelbank[train] # + both training stages (adds trl, datasets)
60 seconds
from labelbank import LabelBank, BiEncoderRetriever, gold_first_pool
# 1. Your closed catalog, and some labeled (text -> label id) pairs.
bank = LabelBank.from_csv("catalog.csv", id_col="LabelId", text_col="LabelText")
queries = ["my failing log line…", "another report…"] # free text
gold_ids = [1042, 17] # matching catalog ids
# 2. Retrieve with any HF backbone (last-token pooling + L2 norm).
retriever = BiEncoderRetriever.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct", trainable=True,
query_prefix="<instruct>Match the text to the best catalog entry.\n<query>",
)
ranked = retriever.retrieve(queries, bank, top_k=25)
# 3. Mine hard negatives from the model's own rankings, then retrain.
pools = [gold_first_pool(r, g, top_k=25) for r, g in zip(ranked, gold_ids)]
from labelbank import RetrieverTrainConfig, train_retriever
train_retriever(retriever, queries, [bank.texts_of(p) for p in pools],
RetrieverTrainConfig(epochs=1, temperature=0.01))
# 4. Evaluate against the whole bank.
metrics = retriever.evaluate(queries, gold_ids, bank) # map@25 + recall@{1,10,25,50,100}
Rerank the top-5 with a generative judge:
from labelbank import build_training_rows, ListwiseReranker
rows = build_training_rows(queries, candidate_texts, gold_texts, k=5) # gold position shuffled
reranker = ListwiseReranker.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
reranker.train(rows, output_dir="out/reranker", lora={"r": 16})
order = reranker.rerank(query_text, candidate_texts) # letter-logit reorder
Configs for both scales ship in examples/configs/: quickstart.yaml (0.5B, one consumer GPU) and reproduce_competition.yaml (Qwen2.5-32B + 4-bit NF4 + LoRA — the medal setup).
Measured: the competition run
Numbers from the preserved training logs (competition/stage1_train.log) — retriever stage, Qwen2.5-32B-Instruct + LoRA over a 2,587-entry bank, scored on held-out fold:
| metric | value |
|---|---|
| MAP@25 | 0.4238 |
| Recall@1 | 0.3017 |
| Recall@10 | 0.6906 |
| Recall@25 | 0.8126 |
| Recall@50 | 0.8978 |
| Recall@100 | 0.9391 |
With the listwise reranker on top, the full two-stage system scored 0.50 on the private leaderboard — silver medal, top 5%. For intuition: Recall@25 of 0.81 means the retriever alone puts the right label among 25 candidates four times out of five — out of 2,587 that all describe subtly different math mistakes.
How it relates to existing tools
| sentence-transformers / BGE | RAG over a corpus | labelbank |
|
|---|---|---|---|
| Target | open-ended similarity | open document collection | closed catalog (can re-embed every eval) |
| Negatives | in-batch by default | n/a | explicit mined pools, no in-batch |
| Mining loop | bring your own | n/a | built in, gold-first, iterative |
| Reranker | cross-encoder (pointwise) | LLM reads retrieved docs | generative listwise letters, position-shuffled |
| Backbone | encoder models | any | any HF causal model as bi-encoder (last-token pool, LoRA, 4-bit) |
If you need general-purpose embeddings, use sentence-transformers. If your labels are a fixed, fine-grained catalog and generic embeddings keep confusing siblings, this is the recipe that medaled on exactly that problem.
Provenance & validation
- The competition scripts, configs, training logs, inference notebook, certificate and the full original write-up are preserved verbatim in
competition/. - Golden tests pin the library to the medal-winning code: the contrastive loss, last-token pooling, hard-negative pool construction, both prompt templates, and the Eedi data pipeline are each fuzz-tested against verbatim copies of the originals (
tests/reference_impl.py) and assert identical output — the library is the competition code, not a reimplementation of it. - Final result: silver medal (top 5%), private LB 0.50 (certificate).
Citation
@misc{li2024labelbank,
author = {Daoyuan Li},
title = {labelbank: retrieval and listwise reranking over closed label banks with self-mined hard negatives},
year = {2024},
url = {https://github.com/DaoyuanLi2816/labelbank},
note = {Generalized from a silver-medal solution, Kaggle Eedi — Mining Misconceptions in Mathematics}
}
License
MIT — see LICENSE.
Author
Daoyuan Li — Kaggle (distiller) · lidaoyuan2816@gmail.com
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file labelbank-0.1.0.tar.gz.
File metadata
- Download URL: labelbank-0.1.0.tar.gz
- Upload date:
- Size: 25.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
df70e9ae81d52d60b0daebe3939fa085b668ff1e0ff511ca454109ec1bd70959
|
|
| MD5 |
3d5cd42725f0c92d7e3d16d7ff0dff3b
|
|
| BLAKE2b-256 |
9ddb482724b2d7db6c28d89b44a94a2b0eadb269f56601485b91d864c6379124
|
Provenance
The following attestation bundles were made for labelbank-0.1.0.tar.gz:
Publisher:
release.yml on DaoyuanLi2816/labelbank
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
labelbank-0.1.0.tar.gz -
Subject digest:
df70e9ae81d52d60b0daebe3939fa085b668ff1e0ff511ca454109ec1bd70959 - Sigstore transparency entry: 1785319384
- Sigstore integration time:
-
Permalink:
DaoyuanLi2816/labelbank@8cb5a894080bc819714eb21dbc34943c7a895f36 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/DaoyuanLi2816
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8cb5a894080bc819714eb21dbc34943c7a895f36 -
Trigger Event:
release
-
Statement type:
File details
Details for the file labelbank-0.1.0-py3-none-any.whl.
File metadata
- Download URL: labelbank-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fb281d87a4d8d605f67276a7fab3d1c009676f612cf3b2e3ed45d0650c9f6bfc
|
|
| MD5 |
c0af8997fef3f055fd68b5dbd787f050
|
|
| BLAKE2b-256 |
175dacbe3e87886240ef110617867c2282a576450b1959b539656e31b2b55a63
|
Provenance
The following attestation bundles were made for labelbank-0.1.0-py3-none-any.whl:
Publisher:
release.yml on DaoyuanLi2816/labelbank
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
labelbank-0.1.0-py3-none-any.whl -
Subject digest:
fb281d87a4d8d605f67276a7fab3d1c009676f612cf3b2e3ed45d0650c9f6bfc - Sigstore transparency entry: 1785319471
- Sigstore integration time:
-
Permalink:
DaoyuanLi2816/labelbank@8cb5a894080bc819714eb21dbc34943c7a895f36 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/DaoyuanLi2816
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8cb5a894080bc819714eb21dbc34943c7a895f36 -
Trigger Event:
release
-
Statement type: