Skip to main content

LLM-based Query Reformulation Toolkit

Project description

Ask DeepWiki about this repo Publish to PyPI Build and Push Docker Images PyPI version PyPI - Downloads Python 3.9+ License: Apache 2.0

QueryGym Logo

A lightweight, reproducible toolkit for LLM-based query reformulation

📚 Documentation📊 Leaderboard📦 PyPI📄 Paper


Features

  • Single Prompt Bank (YAML) with metadata
  • Simple DataLoader: Dependency-free file loading for queries, qrels, and contexts
  • Format Loaders: Optional BEIR and MS MARCO format loaders in querygym.loaders
  • OpenAI-compatible LLM client (works with any OpenAI API–compatible endpoint)
  • Pyserini optional: either pass contexts (JSONL) or pass a retriever instance to build contexts
  • Export-only: emits reformulated queries; optionally generates a bash script for Pyserini + trec_eval

Supported Methods

QueryGym implements the following query reformulation methods:

Method Description Paper
GenQR Generic keyword expansion using LLM Wang et al., 2023
GenQR Ensemble Ensemble of 10 instruction variants for diverse keyword expansion Dhole & Agichtein, 2024
Query2Doc Generates pseudo-documents from LLM knowledge Wang et al., 2023
QA Expand Question-answer based expansion with sub-questions Seo et al., 2025
MuGI Multi-granularity information expansion with adaptive concatenation Zhang et al., 2024
LameR Context-based passage synthesis using retrieved documents Mackie et al., 2023
CSQE Context-based sentence-level query expansion (KEQE + CSQE) Lee et al., 2024
ThinkQE Multi-round reasoning-based query expansion with corpus feedback Le et al., 2025
Query2E Query to entity/keyword expansion Jagerman et al., 2023

For detailed usage and parameters, see the Methods Reference.

Installation

Option 1: Install from PyPI

pip install querygym

Option 2: Use Docker (Recommended for Quick Start)

# GPU version (default)
docker pull ghcr.io/ls3-lab/querygym:latest
docker run -it --gpus all ghcr.io/ls3-lab/querygym:latest

# CPU version (lightweight)
docker pull ghcr.io/ls3-lab/querygym:cpu
docker run -it ghcr.io/ls3-lab/querygym:cpu

# Or use Docker Compose
docker compose run --rm querygym

📖 Docker Setup: See DOCKER_SETUP.md for quick start or the full Docker guide for detailed usage.

Quickstart

Python API (Recommended)

import querygym as qg

# Load data
queries = qg.load_queries("queries.tsv")
qrels = qg.load_qrels("qrels.txt")
contexts = qg.load_contexts("contexts.jsonl")

# Create reformulator
reformulator = qg.create_reformulator("genqr_ensemble", model="gpt-4")

# Reformulate
results = reformulator.reformulate_batch(queries)

# Save
qg.DataLoader.save_queries(
    [qg.QueryItem(r.qid, r.reformulated) for r in results],
    "reformulated.tsv"
)

CLI

pip install -e .[hf,beir,dev]
export OPENAI_API_KEY=sk-...

# Run a method (e.g., genqr_ensemble)
querygym run --method genqr_ensemble \
  --queries-tsv queries.tsv \
  --output-tsv reformulated.tsv \
  --cfg-path querygym/config/defaults.yaml

Loading Datasets

BEIR:

import querygym as qg

# Download with BEIR library
from beir.datasets.data_loader import GenericDataLoader
data_path = GenericDataLoader("nfcorpus").download_and_unzip()

# Load with querygym
queries = qg.loaders.beir.load_queries(data_path)
qrels = qg.loaders.beir.load_qrels(data_path)

MS MARCO:

import querygym as qg

# Load from local files (download with ir_datasets)
queries = qg.loaders.msmarco.load_queries("queries.tsv")
qrels = qg.loaders.msmarco.load_qrels("qrels.tsv")

Examples

See the examples directory for:

Check examples/README.md for the full guide.

Contributing

We welcome contributions! Here's how you can help:

Adding a New Prompt

  1. Edit querygym/prompt_bank.yaml
  2. Add an entry with fields: id, method_family, version, introduced_by, license, authors, tags, template:{system,user}, notes

Adding a New Method

  1. Create a class under querygym/methods/*.py
  2. Subclass BaseReformulator, annotate VERSION, and register with @register_method("name")
  3. Pull templates via PromptBank.render(prompt_id, query=...)

Reporting Issues

  • Found a bug? Open an issue
  • Have a feature request? We'd love to hear it!

For detailed development guidelines, see the Contributing Guide in our documentation.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Citation

If you use QueryGym in your research, please cite:

@misc{bigdeli2025querygymtoolkitreproduciblellmbased,
      title={QueryGym: A Toolkit for Reproducible LLM-Based Query Reformulation}, 
      author={Amin Bigdeli and Radin Hamidi Rad and Mert Incesu and Negar Arabzadeh and Charles L. A. Clarke and Ebrahim Bagheri},
      year={2025},
      eprint={2511.15996},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2511.15996}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

querygym-0.2.0.tar.gz (100.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

querygym-0.2.0-py3-none-any.whl (70.4 kB view details)

Uploaded Python 3

File details

Details for the file querygym-0.2.0.tar.gz.

File metadata

  • Download URL: querygym-0.2.0.tar.gz
  • Upload date:
  • Size: 100.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for querygym-0.2.0.tar.gz
Algorithm Hash digest
SHA256 fbf8dab85b361bcb7930c938a1057bfcc0fa183fe345f6231128715fda2db603
MD5 8ff63f9cd3bcb94a79860ab3468b6b88
BLAKE2b-256 2033a8f18516bf3b21ab84996f73804b4ee69d730851e2c4cf5e6247b2ab3c10

See more details on using hashes here.

File details

Details for the file querygym-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: querygym-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 70.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for querygym-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 856747bea26723681f3c2f9803042d283041a72aca1dcf9495ba8717831c67c6
MD5 1d65eb0a04b2edccf0b0470b35535c1e
BLAKE2b-256 15c9d65a6d67912f5e88dd13c62b92e83fafe8e6dc69d014c0749d64dddc2735

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page