No project description provided
Project description
Query Generator
Query Generator is a typed Python library and CLI for generating search-query records from document rows. It is designed for retrieval and evaluation workflows where each source document should be expanded into natural-language queries while preserving the original document content and metadata.
Highlights
- Configurable
QueryGeneratorcontract for producing structured query output. - OpenAI and Ollama-compatible generator implementations.
- Prompt abstraction for project-specific query-generation instructions.
- CLI for generating queries from a JSON row, file, or standard input.
- RetrievalBase
TextPreprocessorintegration for expanding datasets into query-enriched rows. - Typed Pydantic settings models for config-driven component loading.
- Retry handling for empty, invalid, or provider-failed model responses.
Overview
Query Generator turns document rows into structured query records. A row is expected to look like a RetrievalBase text row:
{
"page_content": "Retrieval augmented generation combines search with language models.",
"metadata": {
"source": "paper-1",
"page": 3
}
}
A generator renders a prompt from the row, calls a model provider, and returns JSON in this shape:
{
"queries": [
{
"query": "what is retrieval augmented generation?"
}
]
}
When used as a TextPreprocessor, each generated query becomes a new dataset row with the original page_content and metadata plus a query metadata field. This is useful for retrieval evaluation datasets, synthetic search-query generation, query-document pair creation, and batch preparation before indexing or scoring.
Installation
This project requires Python 3.11 or newer.
For local development from this repository, use uv:
uv sync --group dev --all-extras
Install production dependencies only:
make install
Usage
Define a Prompt
Prompts are application-specific. Implement Prompt.render() to turn a document row into model instructions.
from typing import Any
from query_generator.prompt import Prompt
from query_generator.settings import PromptSettings
class RetrievalPrompt(Prompt[PromptSettings]):
def render(self, row: dict[str, Any]) -> str:
return (
f"Generate {self.config.n_queries} search queries for this passage.\n"
"Return JSON with a top-level 'queries' list. "
"Each item must contain a string field named 'query'.\n\n"
f"Passage:\n{row['page_content']}"
)
Generate with OpenAI
from query_generator.generators.openai import OpenAIQueryGenerator
from query_generator.settings import OpenAIQueryGeneratorSettings, PromptSettings
generator = OpenAIQueryGenerator(
OpenAIQueryGeneratorSettings(
module_path="query_generator.generators.openai.OpenAIQueryGenerator",
prompt=PromptSettings(
module_path="your_package.prompts.RetrievalPrompt",
n_queries=3,
),
model_name="gpt-4.1-mini",
temperature=0.2,
max_retries=3,
)
)
result = generator.generate(
{
"page_content": "Retrieval augmented generation combines search with language models.",
"metadata": {"source": "paper-1", "page": 3},
}
)
Set provider credentials in the environment expected by the OpenAI Python client, for example:
export OPENAI_API_KEY="..."
Generate with Ollama
The Ollama generator uses Ollama's OpenAI-compatible API and automatically normalizes the base URL to include /v1.
from query_generator.generators.ollama import OllamaQueryGenerator
from query_generator.settings import OllamaQueryGeneratorSettings, PromptSettings
generator = OllamaQueryGenerator(
OllamaQueryGeneratorSettings(
module_path="query_generator.generators.ollama.OllamaQueryGenerator",
prompt=PromptSettings(
module_path="your_package.prompts.RetrievalPrompt",
n_queries=3,
),
model="llama3.1",
base_url="http://localhost:11434",
temperature=0.2,
max_retries=3,
)
)
result = generator.generate(
{
"page_content": "A document passage to turn into search queries.",
"metadata": {"source": "local-doc"},
}
)
Use the CLI
Create a YAML config that resolves to a QueryGenerator:
generator:
module_path: query_generator.generators.openai.OpenAIQueryGenerator
prompt:
module_path: your_package.prompts.RetrievalPrompt
n_queries: 3
model_name: gpt-4.1-mini
temperature: 0.2
max_retries: 3
Generate queries from an inline JSON row:
query-generator generate \
--config config.yaml \
--config-key generator \
--row-json '{"page_content":"Retrieval augmented generation combines search with language models.","metadata":{"source":"paper-1","page":3}}'
Generate from a row file:
query-generator generate \
--config config.yaml \
--config-key generator \
--row-file row.json
Or pipe JSON through standard input:
cat row.json | query-generator generate --config config.yaml --config-key generator
Use as a RetrievalBase Preprocessor
QueryGeneratorPreprocessor expands each input row into one output row per generated query. The output row keeps the original page_content; metadata is copied and augmented with query.
from query_generator.preprocessor import QueryGeneratorPreprocessor
from query_generator.settings import (
OpenAIQueryGeneratorSettings,
PromptSettings,
QueryGeneratorPreprocessorSettings,
)
settings = QueryGeneratorPreprocessorSettings[OpenAIQueryGeneratorSettings](
module_path="query_generator.preprocessor.QueryGeneratorPreprocessor",
kind="query_generator",
query_generator=OpenAIQueryGeneratorSettings(
module_path="query_generator.generators.openai.OpenAIQueryGenerator",
prompt=PromptSettings(
module_path="your_package.prompts.RetrievalPrompt",
n_queries=2,
),
model_name="gpt-4.1-mini",
temperature=0.2,
max_retries=3,
),
)
preprocessor = QueryGeneratorPreprocessor.from_config(settings)
expanded_dataset = preprocessor.apply(text_dataset)
Expected Output Contract
Generators must return a dictionary with a top-level queries list. Each item must be a dictionary with a string query field:
{
"queries": [
{"query": "first generated query"},
{"query": "second generated query"},
]
}
The preprocessor raises InvalidQueryGeneratorOutputError when this shape is not met.
Project Structure
query-generator/
|-- src/query_generator/
| |-- generators/
| | |-- openai.py # OpenAI chat-completions generator
| | `-- ollama.py # Ollama OpenAI-compatible generator
| |-- prompt/ # Prompt base contract
| |-- exceptions.py # Package and CLI exceptions
| |-- main.py # query-generator CLI
| |-- preprocessor.py # RetrievalBase TextPreprocessor integration
| |-- settings.py # Typed component settings
| `-- py.typed # Type information marker
|-- tests/
| |-- fixtures/ # Shared pytest fixtures and test components
| |-- unit/ # Unit tests
| `-- integration/ # Config-loading tests
|-- pyproject.toml
|-- Makefile
|-- uv.lock
`-- README.md
Common Use Cases
- Generate synthetic queries for retrieval evaluation datasets.
- Expand document rows into query-document training or scoring records.
- Keep model-provider logic replaceable behind a common generator interface.
- Run query generation from YAML or JSON component configs.
- Use local Ollama models for development before switching to a hosted provider.
- Integrate query generation into RetrievalBase dataset preprocessing pipelines.
Development
Run tests:
make test
Run formatting and linting:
make format
make lint
Run type checking:
make type-check
Run security checks:
make security
Run the local CI equivalent:
make ci
Feedback and Contributing
Bug reports, feature requests, and implementation ideas are welcome. Include:
- What you expected to happen.
- What actually happened.
- A minimal config, row, or test case when possible.
- The Python version, provider, model, and relevant dependency versions.
Good contributions include new generator providers, reusable prompt implementations, validation improvements, tests, examples, and documentation updates.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file query_generator-2.0.0.tar.gz.
File metadata
- Download URL: query_generator-2.0.0.tar.gz
- Upload date:
- Size: 118.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebd3b7f07dfe838bdf7823ae10d942c43cc20695a8b44de07b233724e88e88a3
|
|
| MD5 |
186bb592a3ab42419ed12234a472848e
|
|
| BLAKE2b-256 |
e77e137a4aba96a962d86a9a99a53fca751a4c35d524b785b39713e0f38a9610
|
File details
Details for the file query_generator-2.0.0-py3-none-any.whl.
File metadata
- Download URL: query_generator-2.0.0-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c6e23ebd1dc76090845fb68abb2d17bbb110d9c57464b1dec58ade2641464da
|
|
| MD5 |
ec38f2c69ebbca72586acf9a8e27e324
|
|
| BLAKE2b-256 |
3d1c949510010cc779e4e3ed94ab27ad0016a3f5d660252dc21d6067e726f9c1
|