LLM-based Query Reformulation Toolkit
Project description
A lightweight, reproducible toolkit for LLM-based query reformulation
📚 Documentation • 📊 Leaderboard • 📦 PyPI • 📄 Paper
Features
- Single Prompt Bank (YAML) with metadata
- Simple DataLoader: Dependency-free file loading for queries, qrels, and contexts
- Format Loaders: Optional BEIR and MS MARCO format loaders in
querygym.loaders - OpenAI-compatible LLM client (works with any OpenAI API–compatible endpoint)
- Pyserini optional: either pass contexts (JSONL) or pass a retriever instance to build contexts
- Export-only: emits reformulated queries; optionally generates a bash script for Pyserini +
trec_eval
Supported Methods
QueryGym implements the following query reformulation methods:
| Method | Description | Paper |
|---|---|---|
| GenQR | Generic keyword expansion using LLM | Wang et al., 2023 |
| GenQR Ensemble | Ensemble of 10 instruction variants for diverse keyword expansion | Dhole & Agichtein, 2024 |
| Query2Doc | Generates pseudo-documents from LLM knowledge | Wang et al., 2023 |
| QA Expand | Question-answer based expansion with sub-questions | Seo et al., 2025 |
| MuGI | Multi-granularity information expansion with adaptive concatenation | Zhang et al., 2024 |
| LameR | Context-based passage synthesis using retrieved documents | Mackie et al., 2023 |
| CSQE | Context-based sentence-level query expansion (KEQE + CSQE) | Lee et al., 2024 |
| Query2E | Query to entity/keyword expansion | Jagerman et al., 2023 |
For detailed usage and parameters, see the Methods Reference.
Installation
Option 1: Install from PyPI
pip install querygym
Option 2: Use Docker (Recommended for Quick Start)
# GPU version (default)
docker pull ghcr.io/ls3-lab/querygym:latest
docker run -it --gpus all ghcr.io/ls3-lab/querygym:latest
# CPU version (lightweight)
docker pull ghcr.io/ls3-lab/querygym:cpu
docker run -it ghcr.io/ls3-lab/querygym:cpu
# Or use Docker Compose
docker compose run --rm querygym
📖 Docker Setup: See DOCKER_SETUP.md for quick start or the full Docker guide for detailed usage.
Quickstart
Python API (Recommended)
import querygym as qg
# Load data
queries = qg.load_queries("queries.tsv")
qrels = qg.load_qrels("qrels.txt")
contexts = qg.load_contexts("contexts.jsonl")
# Create reformulator
reformulator = qg.create_reformulator("genqr_ensemble", model="gpt-4")
# Reformulate
results = reformulator.reformulate_batch(queries)
# Save
qg.DataLoader.save_queries(
[qg.QueryItem(r.qid, r.reformulated) for r in results],
"reformulated.tsv"
)
CLI
pip install -e .[hf,beir,dev]
export OPENAI_API_KEY=sk-...
# Run a method (e.g., genqr_ensemble)
querygym run --method genqr_ensemble \
--queries-tsv queries.tsv \
--output-tsv reformulated.tsv \
--cfg-path querygym/config/defaults.yaml
Loading Datasets
BEIR:
import querygym as qg
# Download with BEIR library
from beir.datasets.data_loader import GenericDataLoader
data_path = GenericDataLoader("nfcorpus").download_and_unzip()
# Load with querygym
queries = qg.loaders.beir.load_queries(data_path)
qrels = qg.loaders.beir.load_qrels(data_path)
MS MARCO:
import querygym as qg
# Load from local files (download with ir_datasets)
queries = qg.loaders.msmarco.load_queries("queries.tsv")
qrels = qg.loaders.msmarco.load_qrels("qrels.tsv")
Examples
See the examples directory for:
- Code snippets - Quick reference examples
- Docker examples - Containerized workflows with Jupyter notebooks
- QueryGym + Pyserini - Complete retrieval pipelines
- Methods Reference - Complete guide to all query reformulation methods
Check examples/README.md for the full guide.
Contributing
We welcome contributions! Here's how you can help:
Adding a New Prompt
- Edit
querygym/prompt_bank.yaml - Add an entry with fields:
id,method_family,version,introduced_by,license,authors,tags,template:{system,user},notes
Adding a New Method
- Create a class under
querygym/methods/*.py - Subclass
BaseReformulator, annotateVERSION, and register with@register_method("name") - Pull templates via
PromptBank.render(prompt_id, query=...)
Reporting Issues
- Found a bug? Open an issue
- Have a feature request? We'd love to hear it!
For detailed development guidelines, see the Contributing Guide in our documentation.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Citation
If you use QueryGym in your research, please cite:
@misc{bigdeli2025querygymtoolkitreproduciblellmbased,
title={QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation},
author={Amin Bigdeli and Radin Hamidi Rad and Mert Incesu and Negar Arabzadeh and Charles L. A. Clarke and Ebrahim Bagheri},
year={2025},
eprint={2511.15996},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2511.15996},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file querygym-0.1.6.tar.gz.
File metadata
- Download URL: querygym-0.1.6.tar.gz
- Upload date:
- Size: 94.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce25654401f87882956514d0bf3b2dfe6eb20345fbb70bc3eb56259bd8c1419f
|
|
| MD5 |
38778767be44188cf4c187f6825a60d0
|
|
| BLAKE2b-256 |
b71baca6ce9dab953f05a5b4923cd07b3b0cabfeb8277888315e156450d1c1e6
|
File details
Details for the file querygym-0.1.6-py3-none-any.whl.
File metadata
- Download URL: querygym-0.1.6-py3-none-any.whl
- Upload date:
- Size: 66.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da8b91a18f55c29839ef51eba649fd07fd3c26a1f80fd91908ad30765287f77c
|
|
| MD5 |
b57a9223a9fbd07c51c9c421f9491f2e
|
|
| BLAKE2b-256 |
2ce1103dac269e747354cadd5d88715a40a3b332ee4061e6ab3a30c5cc7e78b3
|