Skip to main content

Retrieve + rerank over a closed label bank: LLM bi-encoders with self-mined hard negatives and a generative listwise reranker

Project description

labelbank — retrieve and rerank over a closed label bank: LLM bi-encoders, self-mined hard negatives, generative listwise reranking. Kaggle Silver, Eedi.

CI PyPI Python License: MIT Kaggle Silver

labelbank is the generalized core of a silver-medal (top 5%) solution to Kaggle's Eedi — Mining Misconceptions in Mathematics, extracted into a small, tested library you can run on your own label catalog with any Hugging Face backbone. The exact competition artifacts are preserved untouched in competition/, and golden tests pin the library's default behavior to the medal-winning code byte for byte.

Use it when your problem looks like this: given a piece of free text, find the matching entry in a fixed catalog of labels — a few hundred to a few tens of thousands of entries that all look frustratingly similar. Support tickets → known-issue KB, error logs → root-cause catalog, symptoms → diagnosis codes, content → policy categories, student mistakes → misconception taxonomies (the original task: 2,587 fine-grained math misconceptions).

Why not just an off-the-shelf embedding model?

Generic embedders retrieve "something related". In a fine-grained bank, related isn't enough — "ignores order of operations" and "evaluates left to right" are nearly identical sentences and different labels. Three design choices close that gap, and they are exactly what this library packages:

1. No in-batch negatives — mined pools instead. Standard contrastive recipes use other in-batch examples as negatives. In a closed bank that's poison: another query's positive is often a sibling label of your gold (a false negative), and random negatives are trivially easy. labelbank trains on explicit per-query pools — [gold, hard negatives…] — with cross-entropy over the group (no_in_batch_neg_loss, temperature 0.01).

2. The hard negatives come from the model itself. Train round N → rank the whole bank for every training query → take each query's own top-k as round N+1's negative pool, gold forced to the front (gold_first_pool). A self-bootstrapping curriculum: every round, the negatives are precisely the mistakes the current model still makes. This loop was decisive for the medal.

flowchart LR
    T["labeled pairs<br>(text → label id)"] --> R1["bi-encoder round N<br>(LoRA fine-tune)"]
    R1 -- "rank full bank<br>per training query" --> M["top-k pools<br>gold first"]
    M -- "hard negatives" --> R2["bi-encoder round N+1"]
    R2 -- "top-k candidates" --> RR["generative listwise reranker<br>(letters A–E, completion-only SFT)"]
    RR --> O["final ranking"]

3. A generative listwise reranker with no position prior. The retriever's top-k candidates are inlined into one prompt as lettered options; a causal LLM is fine-tuned (completion-only) to answer the letter. The gold's position is shuffled at training time — the reranker must judge content, not slot — and at inference the next-token logits over A…E re-order the candidates (ListwiseReranker).

Install

pip install labelbank              # core: metrics, mining, formatting, data (no torch)
pip install labelbank[retrieve]    # + bi-encoder retrieval (torch, transformers, peft)
pip install labelbank[train]       # + both training stages (adds trl, datasets)

60 seconds

from labelbank import LabelBank, BiEncoderRetriever, gold_first_pool

# 1. Your closed catalog, and some labeled (text -> label id) pairs.
bank = LabelBank.from_csv("catalog.csv", id_col="LabelId", text_col="LabelText")
queries = ["my failing log line…", "another report…"]   # free text
gold_ids = [1042, 17]                                    # matching catalog ids

# 2. Retrieve with any HF backbone (last-token pooling + L2 norm).
retriever = BiEncoderRetriever.from_pretrained(
    "Qwen/Qwen2.5-0.5B-Instruct", trainable=True,
    query_prefix="<instruct>Match the text to the best catalog entry.\n<query>",
)
ranked = retriever.retrieve(queries, bank, top_k=25)

# 3. Mine hard negatives from the model's own rankings, then retrain.
pools = [gold_first_pool(r, g, top_k=25) for r, g in zip(ranked, gold_ids)]

from labelbank import RetrieverTrainConfig, train_retriever
train_retriever(retriever, queries, [bank.texts_of(p) for p in pools],
                RetrieverTrainConfig(epochs=1, temperature=0.01))

# 4. Evaluate against the whole bank.
metrics = retriever.evaluate(queries, gold_ids, bank)   # map@25 + recall@{1,10,25,50,100}

Rerank the top-5 with a generative judge:

from labelbank import build_training_rows, ListwiseReranker

rows = build_training_rows(queries, candidate_texts, gold_texts, k=5)   # gold position shuffled
reranker = ListwiseReranker.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
reranker.train(rows, output_dir="out/reranker", lora={"r": 16})
order = reranker.rerank(query_text, candidate_texts)                     # letter-logit reorder

Configs for both scales ship in examples/configs/: quickstart.yaml (0.5B, one consumer GPU) and reproduce_competition.yaml (Qwen2.5-32B + 4-bit NF4 + LoRA — the medal setup).

Measured: the competition run

Numbers from the preserved training logs (competition/stage1_train.log) — retriever stage, Qwen2.5-32B-Instruct + LoRA over a 2,587-entry bank, scored on held-out fold:

metric value
MAP@25 0.4238
Recall@1 0.3017
Recall@10 0.6906
Recall@25 0.8126
Recall@50 0.8978
Recall@100 0.9391

With the listwise reranker on top, the full two-stage system scored 0.50 on the private leaderboard — silver medal, top 5%. For intuition: Recall@25 of 0.81 means the retriever alone puts the right label among 25 candidates four times out of five — out of 2,587 that all describe subtly different math mistakes.

How it relates to existing tools

sentence-transformers / BGE RAG over a corpus labelbank
Target open-ended similarity open document collection closed catalog (can re-embed every eval)
Negatives in-batch by default n/a explicit mined pools, no in-batch
Mining loop bring your own n/a built in, gold-first, iterative
Reranker cross-encoder (pointwise) LLM reads retrieved docs generative listwise letters, position-shuffled
Backbone encoder models any any HF causal model as bi-encoder (last-token pool, LoRA, 4-bit)

If you need general-purpose embeddings, use sentence-transformers. If your labels are a fixed, fine-grained catalog and generic embeddings keep confusing siblings, this is the recipe that medaled on exactly that problem.

Provenance & validation

  • The competition scripts, configs, training logs, inference notebook, certificate and the full original write-up are preserved verbatim in competition/.
  • Golden tests pin the library to the medal-winning code: the contrastive loss, last-token pooling, hard-negative pool construction, both prompt templates, and the Eedi data pipeline are each fuzz-tested against verbatim copies of the originals (tests/reference_impl.py) and assert identical output — the library is the competition code, not a reimplementation of it.
  • Final result: silver medal (top 5%), private LB 0.50 (certificate).

Kaggle Eedi silver medal certificate — Daoyuan Li

Citation

@misc{li2024labelbank,
  author = {Daoyuan Li},
  title  = {labelbank: retrieval and listwise reranking over closed label banks with self-mined hard negatives},
  year   = {2024},
  url    = {https://github.com/DaoyuanLi2816/labelbank},
  note   = {Generalized from a silver-medal solution, Kaggle Eedi — Mining Misconceptions in Mathematics}
}

License

MIT — see LICENSE.

Author

Daoyuan Li — Kaggle (distiller) · lidaoyuan2816@gmail.com

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

labelbank-0.1.0.tar.gz (25.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

labelbank-0.1.0-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file labelbank-0.1.0.tar.gz.

File metadata

  • Download URL: labelbank-0.1.0.tar.gz
  • Upload date:
  • Size: 25.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for labelbank-0.1.0.tar.gz
Algorithm Hash digest
SHA256 df70e9ae81d52d60b0daebe3939fa085b668ff1e0ff511ca454109ec1bd70959
MD5 3d5cd42725f0c92d7e3d16d7ff0dff3b
BLAKE2b-256 9ddb482724b2d7db6c28d89b44a94a2b0eadb269f56601485b91d864c6379124

See more details on using hashes here.

Provenance

The following attestation bundles were made for labelbank-0.1.0.tar.gz:

Publisher: release.yml on DaoyuanLi2816/labelbank

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file labelbank-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: labelbank-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for labelbank-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb281d87a4d8d605f67276a7fab3d1c009676f612cf3b2e3ed45d0650c9f6bfc
MD5 c0af8997fef3f055fd68b5dbd787f050
BLAKE2b-256 175dacbe3e87886240ef110617867c2282a576450b1959b539656e31b2b55a63

See more details on using hashes here.

Provenance

The following attestation bundles were made for labelbank-0.1.0-py3-none-any.whl:

Publisher: release.yml on DaoyuanLi2816/labelbank

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page