No project description provided

Project description

Query Generator

Query Generator is a typed Python library and CLI for generating search-query records from document rows. It is designed for retrieval and evaluation workflows where each source document should be expanded into natural-language queries while preserving the original document content and metadata.

Highlights

Configurable QueryGenerator contract for producing structured query output.
OpenAI and Ollama-compatible generator implementations.
Prompt abstraction for project-specific query-generation instructions.
CLI for generating queries from a JSON row, file, or standard input.
RetrievalBase TextPreprocessor integration for expanding datasets into query-enriched rows.
Typed Pydantic settings models for config-driven component loading.
Retry handling for empty, invalid, or provider-failed model responses.

Overview

Query Generator turns document rows into structured query records. A row is expected to look like a RetrievalBase text row:

{
  "page_content": "Retrieval augmented generation combines search with language models.",
  "metadata": {
    "source": "paper-1",
    "page": 3
  }
}

A generator renders a prompt from the row, calls a model provider, and returns JSON in this shape:

{
  "queries": [
    {
      "query": "what is retrieval augmented generation?"
    }
  ]
}

When used as a TextPreprocessor, each generated query becomes a new dataset row with the original page_content and metadata plus a query metadata field. This is useful for retrieval evaluation datasets, synthetic search-query generation, query-document pair creation, and batch preparation before indexing or scoring.

Installation

This project requires Python 3.11 or newer.

For local development from this repository, use uv:

uv sync --group dev --all-extras

Install production dependencies only:

make install

Usage

Define a Prompt

Prompts are application-specific. Implement Prompt.render() to turn a document row into model instructions.

from typing import Any

from query_generator.prompt import Prompt
from query_generator.settings import PromptSettings


class RetrievalPrompt(Prompt[PromptSettings]):
    def render(self, row: dict[str, Any]) -> str:
        return (
            f"Generate {self.config.n_queries} search queries for this passage.\n"
            "Return JSON with a top-level 'queries' list. "
            "Each item must contain a string field named 'query'.\n\n"
            f"Passage:\n{row['page_content']}"
        )

Generate with OpenAI

from query_generator.generators.openai import OpenAIQueryGenerator
from query_generator.settings import OpenAIQueryGeneratorSettings, PromptSettings


generator = OpenAIQueryGenerator(
    OpenAIQueryGeneratorSettings(
        module_path="query_generator.generators.openai.OpenAIQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=3,
        ),
        model_name="gpt-4.1-mini",
        temperature=0.2,
        max_retries=3,
    )
)

result = generator.generate(
    {
        "page_content": "Retrieval augmented generation combines search with language models.",
        "metadata": {"source": "paper-1", "page": 3},
    }
)

Set provider credentials in the environment expected by the OpenAI Python client, for example:

export OPENAI_API_KEY="..."

Generate with Ollama

The Ollama generator uses Ollama's OpenAI-compatible API and automatically normalizes the base URL to include /v1.

from query_generator.generators.ollama import OllamaQueryGenerator
from query_generator.settings import OllamaQueryGeneratorSettings, PromptSettings


generator = OllamaQueryGenerator(
    OllamaQueryGeneratorSettings(
        module_path="query_generator.generators.ollama.OllamaQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=3,
        ),
        model="llama3.1",
        base_url="http://localhost:11434",
        temperature=0.2,
        max_retries=3,
    )
)

result = generator.generate(
    {
        "page_content": "A document passage to turn into search queries.",
        "metadata": {"source": "local-doc"},
    }
)

Use the CLI

Create a YAML config that resolves to a QueryGenerator:

generator:
  module_path: query_generator.generators.openai.OpenAIQueryGenerator
  prompt:
    module_path: your_package.prompts.RetrievalPrompt
    n_queries: 3
  model_name: gpt-4.1-mini
  temperature: 0.2
  max_retries: 3

Generate queries from an inline JSON row:

query-generator generate \
  --config config.yaml \
  --config-key generator \
  --row-json '{"page_content":"Retrieval augmented generation combines search with language models.","metadata":{"source":"paper-1","page":3}}'

Generate from a row file:

query-generator generate \
  --config config.yaml \
  --config-key generator \
  --row-file row.json

Or pipe JSON through standard input:

cat row.json | query-generator generate --config config.yaml --config-key generator

Use as a RetrievalBase Preprocessor

QueryGeneratorPreprocessor expands each input row into one output row per generated query. The output row keeps the original page_content; metadata is copied and augmented with query.

from query_generator.preprocessor import QueryGeneratorPreprocessor
from query_generator.settings import (
    OpenAIQueryGeneratorSettings,
    PromptSettings,
    QueryGeneratorPreprocessorSettings,
)


settings = QueryGeneratorPreprocessorSettings[OpenAIQueryGeneratorSettings](
    module_path="query_generator.preprocessor.QueryGeneratorPreprocessor",
    kind="query_generator",
    query_generator=OpenAIQueryGeneratorSettings(
        module_path="query_generator.generators.openai.OpenAIQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=2,
        ),
        model_name="gpt-4.1-mini",
        temperature=0.2,
        max_retries=3,
    ),
)

preprocessor = QueryGeneratorPreprocessor.from_config(settings)
expanded_dataset = preprocessor.apply(text_dataset)

Expected Output Contract

Generators must return a dictionary with a top-level queries list. Each item must be a dictionary with a string query field:

{
    "queries": [
        {"query": "first generated query"},
        {"query": "second generated query"},
    ]
}

The preprocessor raises InvalidQueryGeneratorOutputError when this shape is not met.

Project Structure

query-generator/
|-- src/query_generator/
|   |-- generators/
|   |   |-- openai.py       # OpenAI chat-completions generator
|   |   `-- ollama.py       # Ollama OpenAI-compatible generator
|   |-- prompt/             # Prompt base contract
|   |-- exceptions.py       # Package and CLI exceptions
|   |-- main.py             # query-generator CLI
|   |-- preprocessor.py     # RetrievalBase TextPreprocessor integration
|   |-- settings.py         # Typed component settings
|   `-- py.typed            # Type information marker
|-- tests/
|   |-- fixtures/           # Shared pytest fixtures and test components
|   |-- unit/               # Unit tests
|   `-- integration/        # Config-loading tests
|-- pyproject.toml
|-- Makefile
|-- uv.lock
`-- README.md

Common Use Cases

Generate synthetic queries for retrieval evaluation datasets.
Expand document rows into query-document training or scoring records.
Keep model-provider logic replaceable behind a common generator interface.
Run query generation from YAML or JSON component configs.
Use local Ollama models for development before switching to a hosted provider.
Integrate query generation into RetrievalBase dataset preprocessing pipelines.

Development

Run tests:

make test

Run formatting and linting:

make format
make lint

Run type checking:

make type-check

Run security checks:

make security

Run the local CI equivalent:

make ci

Feedback and Contributing

Bug reports, feature requests, and implementation ideas are welcome. Include:

What you expected to happen.
What actually happened.
A minimal config, row, or test case when possible.
The Python version, provider, model, and relevant dependency versions.

Good contributions include new generator providers, reusable prompt implementations, validation improvements, tests, examples, and documentation updates.

Project details

Release history Release notifications | RSS feed

This version

2.0.0

May 21, 2026

1.0.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

query_generator-2.0.0.tar.gz (118.6 kB view details)

Uploaded May 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

query_generator-2.0.0-py3-none-any.whl (10.5 kB view details)

Uploaded May 21, 2026 Python 3

File details

Details for the file query_generator-2.0.0.tar.gz.

File metadata

Download URL: query_generator-2.0.0.tar.gz
Upload date: May 21, 2026
Size: 118.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for query_generator-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`ebd3b7f07dfe838bdf7823ae10d942c43cc20695a8b44de07b233724e88e88a3`
MD5	`186bb592a3ab42419ed12234a472848e`
BLAKE2b-256	`e77e137a4aba96a962d86a9a99a53fca751a4c35d524b785b39713e0f38a9610`

See more details on using hashes here.

File details

Details for the file query_generator-2.0.0-py3-none-any.whl.

File metadata

Download URL: query_generator-2.0.0-py3-none-any.whl
Upload date: May 21, 2026
Size: 10.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for query_generator-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9c6e23ebd1dc76090845fb68abb2d17bbb110d9c57464b1dec58ade2641464da`
MD5	`ec38f2c69ebbca72586acf9a8e27e324`
BLAKE2b-256	`3d1c949510010cc779e4e3ed94ab27ad0016a3f5d660252dc21d6067e726f9c1`

See more details on using hashes here.

query-generator 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Query Generator

Highlights

Overview

Installation

Usage

Define a Prompt

Generate with OpenAI

Generate with Ollama

Use the CLI

Use as a RetrievalBase Preprocessor

Expected Output Contract

Project Structure

Common Use Cases

Development

Feedback and Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes