Skip to main content

No project description provided

Project description

Query Generator

Query Generator is a typed Python library and CLI for generating search-query records from document rows. It is designed for retrieval and evaluation workflows where each source document should be expanded into natural-language queries while preserving the original document content and metadata.

Highlights

  • Configurable QueryGenerator contract for producing structured query output.
  • OpenAI and Ollama-compatible generator implementations.
  • Prompt abstraction for project-specific query-generation instructions.
  • CLI for generating queries from a JSON row, file, or standard input.
  • RetrievalBase TextPreprocessor integration for expanding datasets into query-enriched rows.
  • Typed Pydantic settings models for config-driven component loading.
  • Retry handling for empty, invalid, or provider-failed model responses.

Overview

Query Generator turns document rows into structured query records. A row is expected to look like a RetrievalBase text row:

{
  "page_content": "Retrieval augmented generation combines search with language models.",
  "metadata": {
    "source": "paper-1",
    "page": 3
  }
}

A generator renders a prompt from the row, calls a model provider, and returns JSON in this shape:

{
  "queries": [
    {
      "query": "what is retrieval augmented generation?"
    }
  ]
}

When used as a TextPreprocessor, each generated query becomes a new dataset row with the original page_content and metadata plus a query metadata field. This is useful for retrieval evaluation datasets, synthetic search-query generation, query-document pair creation, and batch preparation before indexing or scoring.

Installation

This project requires Python 3.11 or newer.

For local development from this repository, use uv:

uv sync --group dev --all-extras

Install production dependencies only:

make install

Usage

Define a Prompt

Prompts are application-specific. Implement Prompt.render() to turn a document row into model instructions.

from typing import Any

from query_generator.prompt import Prompt
from query_generator.settings import PromptSettings


class RetrievalPrompt(Prompt[PromptSettings]):
    def render(self, row: dict[str, Any]) -> str:
        return (
            f"Generate {self.config.n_queries} search queries for this passage.\n"
            "Return JSON with a top-level 'queries' list. "
            "Each item must contain a string field named 'query'.\n\n"
            f"Passage:\n{row['page_content']}"
        )

Generate with OpenAI

from query_generator.generators.openai import OpenAIQueryGenerator
from query_generator.settings import OpenAIQueryGeneratorSettings, PromptSettings


generator = OpenAIQueryGenerator(
    OpenAIQueryGeneratorSettings(
        module_path="query_generator.generators.openai.OpenAIQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=3,
        ),
        model_name="gpt-4.1-mini",
        temperature=0.2,
        max_retries=3,
    )
)

result = generator.generate(
    {
        "page_content": "Retrieval augmented generation combines search with language models.",
        "metadata": {"source": "paper-1", "page": 3},
    }
)

Set provider credentials in the environment expected by the OpenAI Python client, for example:

export OPENAI_API_KEY="..."

Generate with Ollama

The Ollama generator uses Ollama's OpenAI-compatible API and automatically normalizes the base URL to include /v1.

from query_generator.generators.ollama import OllamaQueryGenerator
from query_generator.settings import OllamaQueryGeneratorSettings, PromptSettings


generator = OllamaQueryGenerator(
    OllamaQueryGeneratorSettings(
        module_path="query_generator.generators.ollama.OllamaQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=3,
        ),
        model="llama3.1",
        base_url="http://localhost:11434",
        temperature=0.2,
        max_retries=3,
    )
)

result = generator.generate(
    {
        "page_content": "A document passage to turn into search queries.",
        "metadata": {"source": "local-doc"},
    }
)

Use the CLI

Create a YAML config that resolves to a QueryGenerator:

generator:
  module_path: query_generator.generators.openai.OpenAIQueryGenerator
  prompt:
    module_path: your_package.prompts.RetrievalPrompt
    n_queries: 3
  model_name: gpt-4.1-mini
  temperature: 0.2
  max_retries: 3

Generate queries from an inline JSON row:

query-generator generate \
  --config config.yaml \
  --config-key generator \
  --row-json '{"page_content":"Retrieval augmented generation combines search with language models.","metadata":{"source":"paper-1","page":3}}'

Generate from a row file:

query-generator generate \
  --config config.yaml \
  --config-key generator \
  --row-file row.json

Or pipe JSON through standard input:

cat row.json | query-generator generate --config config.yaml --config-key generator

Use as a RetrievalBase Preprocessor

QueryGeneratorPreprocessor expands each input row into one output row per generated query. The output row keeps the original page_content; metadata is copied and augmented with query.

from query_generator.preprocessor import QueryGeneratorPreprocessor
from query_generator.settings import (
    OpenAIQueryGeneratorSettings,
    PromptSettings,
    QueryGeneratorPreprocessorSettings,
)


settings = QueryGeneratorPreprocessorSettings[OpenAIQueryGeneratorSettings](
    module_path="query_generator.preprocessor.QueryGeneratorPreprocessor",
    kind="query_generator",
    query_generator=OpenAIQueryGeneratorSettings(
        module_path="query_generator.generators.openai.OpenAIQueryGenerator",
        prompt=PromptSettings(
            module_path="your_package.prompts.RetrievalPrompt",
            n_queries=2,
        ),
        model_name="gpt-4.1-mini",
        temperature=0.2,
        max_retries=3,
    ),
)

preprocessor = QueryGeneratorPreprocessor.from_config(settings)
expanded_dataset = preprocessor.apply(text_dataset)

Expected Output Contract

Generators must return a dictionary with a top-level queries list. Each item must be a dictionary with a string query field:

{
    "queries": [
        {"query": "first generated query"},
        {"query": "second generated query"},
    ]
}

The preprocessor raises InvalidQueryGeneratorOutputError when this shape is not met.

Project Structure

query-generator/
|-- src/query_generator/
|   |-- generators/
|   |   |-- openai.py       # OpenAI chat-completions generator
|   |   `-- ollama.py       # Ollama OpenAI-compatible generator
|   |-- prompt/             # Prompt base contract
|   |-- exceptions.py       # Package and CLI exceptions
|   |-- main.py             # query-generator CLI
|   |-- preprocessor.py     # RetrievalBase TextPreprocessor integration
|   |-- settings.py         # Typed component settings
|   `-- py.typed            # Type information marker
|-- tests/
|   |-- fixtures/           # Shared pytest fixtures and test components
|   |-- unit/               # Unit tests
|   `-- integration/        # Config-loading tests
|-- pyproject.toml
|-- Makefile
|-- uv.lock
`-- README.md

Common Use Cases

  • Generate synthetic queries for retrieval evaluation datasets.
  • Expand document rows into query-document training or scoring records.
  • Keep model-provider logic replaceable behind a common generator interface.
  • Run query generation from YAML or JSON component configs.
  • Use local Ollama models for development before switching to a hosted provider.
  • Integrate query generation into RetrievalBase dataset preprocessing pipelines.

Development

Run tests:

make test

Run formatting and linting:

make format
make lint

Run type checking:

make type-check

Run security checks:

make security

Run the local CI equivalent:

make ci

Feedback and Contributing

Bug reports, feature requests, and implementation ideas are welcome. Include:

  • What you expected to happen.
  • What actually happened.
  • A minimal config, row, or test case when possible.
  • The Python version, provider, model, and relevant dependency versions.

Good contributions include new generator providers, reusable prompt implementations, validation improvements, tests, examples, and documentation updates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

query_generator-1.0.0.tar.gz (118.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

query_generator-1.0.0-py3-none-any.whl (10.5 kB view details)

Uploaded Python 3

File details

Details for the file query_generator-1.0.0.tar.gz.

File metadata

  • Download URL: query_generator-1.0.0.tar.gz
  • Upload date:
  • Size: 118.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for query_generator-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c31c94ad619ce93d39c55f09ee559e197cac75968122462a7afaba97f217841f
MD5 116771e96d95fddfb2d44f70119f3dd1
BLAKE2b-256 b873df7693a8dd23c0ae784793e942ca39bce74ee0e4f209a5d15dc1eae4b838

See more details on using hashes here.

File details

Details for the file query_generator-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: query_generator-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 10.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.15 {"installer":{"name":"uv","version":"0.11.15","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"12","id":"bookworm","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for query_generator-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 481245b7879671ed5d168d8bf1a528ce17772fb6c3f3be0e443710a43b85c499
MD5 7e05d35f3ea8b0c76aaf53d63fa44cd3
BLAKE2b-256 9694d11c499f4fcc0c599dc057b616cdc032ce38d45630140083b1c73dc645e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page