Skip to main content

LLM-based Query Reformulation Toolkit

Project description

Ask DeepWiki about this repo Publish to PyPI Build and Push Docker Images PyPI version PyPI - Downloads Python 3.9+ License: Apache 2.0

QueryGym Logo

A lightweight, reproducible toolkit for LLM-based query reformulation

📚 Documentation📊 Leaderboard📦 PyPI📄 Paper


Features

  • Single Prompt Bank (YAML) with metadata
  • Simple DataLoader: Dependency-free file loading for queries, qrels, and contexts
  • Format Loaders: Optional BEIR and MS MARCO format loaders in querygym.loaders
  • OpenAI-compatible LLM client (works with any OpenAI API–compatible endpoint)
  • Pyserini optional: either pass contexts (JSONL) or pass a retriever instance to build contexts
  • Export-only: emits reformulated queries; optionally generates a bash script for Pyserini + trec_eval

Supported Methods

QueryGym implements the following query reformulation methods:

Method Description Paper
GenQR Generic keyword expansion using LLM Wang et al., 2023
GenQR Ensemble Ensemble of 10 instruction variants for diverse keyword expansion Dhole & Agichtein, 2024
Query2Doc Generates pseudo-documents from LLM knowledge Wang et al., 2023
QA Expand Question-answer based expansion with sub-questions Seo et al., 2025
MuGI Multi-granularity information expansion with adaptive concatenation Zhang et al., 2024
LameR Context-based passage synthesis using retrieved documents Mackie et al., 2023
CSQE Context-based sentence-level query expansion (KEQE + CSQE) Lee et al., 2024
Query2E Query to entity/keyword expansion Jagerman et al., 2023

For detailed usage and parameters, see the Methods Reference.

Installation

Option 1: Install from PyPI

pip install querygym

Option 2: Use Docker (Recommended for Quick Start)

# GPU version (default)
docker pull ghcr.io/ls3-lab/querygym:latest
docker run -it --gpus all ghcr.io/ls3-lab/querygym:latest

# CPU version (lightweight)
docker pull ghcr.io/ls3-lab/querygym:cpu
docker run -it ghcr.io/ls3-lab/querygym:cpu

# Or use Docker Compose
docker compose run --rm querygym

📖 Docker Setup: See DOCKER_SETUP.md for quick start or the full Docker guide for detailed usage.

Quickstart

Python API (Recommended)

import querygym as qg

# Load data
queries = qg.load_queries("queries.tsv")
qrels = qg.load_qrels("qrels.txt")
contexts = qg.load_contexts("contexts.jsonl")

# Create reformulator
reformulator = qg.create_reformulator("genqr_ensemble", model="gpt-4")

# Reformulate
results = reformulator.reformulate_batch(queries)

# Save
qg.DataLoader.save_queries(
    [qg.QueryItem(r.qid, r.reformulated) for r in results],
    "reformulated.tsv"
)

CLI

pip install -e .[hf,beir,dev]
export OPENAI_API_KEY=sk-...

# Run a method (e.g., genqr_ensemble)
querygym run --method genqr_ensemble \
  --queries-tsv queries.tsv \
  --output-tsv reformulated.tsv \
  --cfg-path querygym/config/defaults.yaml

Loading Datasets

BEIR:

import querygym as qg

# Download with BEIR library
from beir.datasets.data_loader import GenericDataLoader
data_path = GenericDataLoader("nfcorpus").download_and_unzip()

# Load with querygym
queries = qg.loaders.beir.load_queries(data_path)
qrels = qg.loaders.beir.load_qrels(data_path)

MS MARCO:

import querygym as qg

# Load from local files (download with ir_datasets)
queries = qg.loaders.msmarco.load_queries("queries.tsv")
qrels = qg.loaders.msmarco.load_qrels("qrels.tsv")

Examples

See the examples directory for:

Check examples/README.md for the full guide.

Contributing

We welcome contributions! Here's how you can help:

Adding a New Prompt

  1. Edit querygym/prompt_bank.yaml
  2. Add an entry with fields: id, method_family, version, introduced_by, license, authors, tags, template:{system,user}, notes

Adding a New Method

  1. Create a class under querygym/methods/*.py
  2. Subclass BaseReformulator, annotate VERSION, and register with @register_method("name")
  3. Pull templates via PromptBank.render(prompt_id, query=...)

Reporting Issues

  • Found a bug? Open an issue
  • Have a feature request? We'd love to hear it!

For detailed development guidelines, see the Contributing Guide in our documentation.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If you use QueryGym in your research, please cite:

@misc{bigdeli2025querygymtoolkitreproduciblellmbased,
      title={QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation}, 
      author={Amin Bigdeli and Radin Hamidi Rad and Mert Incesu and Negar Arabzadeh and Charles L. A. Clarke and Ebrahim Bagheri},
      year={2025},
      eprint={2511.15996},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2511.15996}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

querygym-0.1.6.tar.gz (94.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

querygym-0.1.6-py3-none-any.whl (66.3 kB view details)

Uploaded Python 3

File details

Details for the file querygym-0.1.6.tar.gz.

File metadata

  • Download URL: querygym-0.1.6.tar.gz
  • Upload date:
  • Size: 94.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for querygym-0.1.6.tar.gz
Algorithm Hash digest
SHA256 ce25654401f87882956514d0bf3b2dfe6eb20345fbb70bc3eb56259bd8c1419f
MD5 38778767be44188cf4c187f6825a60d0
BLAKE2b-256 b71baca6ce9dab953f05a5b4923cd07b3b0cabfeb8277888315e156450d1c1e6

See more details on using hashes here.

File details

Details for the file querygym-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: querygym-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 66.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for querygym-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 da8b91a18f55c29839ef51eba649fd07fd3c26a1f80fd91908ad30765287f77c
MD5 b57a9223a9fbd07c51c9c421f9491f2e
BLAKE2b-256 2ce1103dac269e747354cadd5d88715a40a3b332ee4061e6ab3a30c5cc7e78b3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page