Skip to main content

Lightweight Natural Language Query validator — keep your LLM assistant on-topic

Project description

nlq-validator

A lightweight Natural Language Query (NLQ) validator that keeps your LLM assistant on-topic. Train it on a handful of example questions, and it will accept in-domain queries while rejecting off-topic ones — no server, no API key required for the core functionality.

Features

  • TF-IDF scoring out of the box — no model downloads needed
  • Semantic embeddings via sentence-transformers for paraphrase-aware matching
  • Threshold calibration — find the F1-optimal cutoff for your domain
  • Incremental retraining — add examples without rebuilding from scratch
  • LLM-powered question generation — auto-generate training data from your system prompt (Claude, ChatGPT, Gemini, Mistral, Grok, Perplexity)
  • Async support for all LLM integrations
  • Zero runtime dependencies beyond scikit-learn for the core validator

Installation

pip install nlq-validator

Optional extras

pip install 'nlq-validator[embeddings]'   # sentence-transformers for semantic matching
pip install 'nlq-validator[anthropic]'    # Claude integration
pip install 'nlq-validator[openai]'       # ChatGPT, Grok, Perplexity integrations
pip install 'nlq-validator[gemini]'       # Google Gemini integration
pip install 'nlq-validator[mistral]'      # Mistral integration
pip install 'nlq-validator[all-llm]'      # All LLM integrations

Quick start

from nlq_validator import NLQValidator

SYSTEM_PROMPT = (
    "You are a SQL assistant. You help users write queries, "
    "understand JOINs, indexes, and query optimization."
)

# Train from a plain-text file (one question per line)
v = NLQValidator.from_training_file("questions.txt", SYSTEM_PROMPT)

result = v.validate("How do I write a SELECT statement?")
print(result.is_valid)   # True

result = v.validate("What is my horoscope today?")
print(result.is_valid)   # False
print(result.errors)     # ['Query appears off-topic (score=0.000, threshold=0.250)']

Training data format

Supported file formats: .txt (one question per line), .csv (first column), .json (list of strings or list of {"text": "..."} objects).

# questions.txt
How do I write a SELECT statement?
What is a SQL JOIN?
How do I filter rows with WHERE clause?
What is the difference between INNER JOIN and LEFT JOIN?
...

Threshold calibration

The default threshold of 0.25 is a conservative starting point. Use calibrate() to find the optimal value for your domain:

in_domain = ["How do I use GROUP BY?", "What is a primary key?", ...]
off_domain = ["How do I bake bread?", "What is my horoscope?", ...]

result = v.calibrate(in_domain, off_domain)
result.summary()          # prints precision/recall/F1 table
v.apply_calibration(result)  # applies suggested threshold

Incremental retraining

v.retrain(["How do I write a CTE?", "What is a window function?"])
# or from a file:
v.retrain_from_file("more_questions.txt")

Semantic embeddings

For queries that use different words but mean the same thing:

v = NLQValidator.from_training_file(
    "questions.txt",
    SYSTEM_PROMPT,
    embedding_model="all-MiniLM-L6-v2",   # requires nlq-validator[embeddings]
)

LLM-powered question generation

Generate training data automatically from your system prompt:

from nlq_validator.integrations.claude import ClaudeIntegration

llm = ClaudeIntegration()   # reads ANTHROPIC_API_KEY env var
v = NLQValidator.from_llm(llm, SYSTEM_PROMPT, count=50)

Async variant:

v = await NLQValidator.from_llm_async(llm, SYSTEM_PROMPT, count=50)

Supported LLM providers

Provider Extra Class Env vars
Claude [anthropic] ClaudeIntegration ANTHROPIC_API_KEY, ANTHROPIC_MODEL
ChatGPT [openai] ChatGPTIntegration OPENAI_API_KEY, OPENAI_MODEL
Gemini [gemini] GeminiIntegration GEMINI_API_KEY, GEMINI_MODEL
Mistral [mistral] MistralIntegration MISTRAL_API_KEY, MISTRAL_MODEL
Grok [openai] GrokIntegration XAI_API_KEY, XAI_MODEL
Perplexity [openai] PerplexityIntegration PERPLEXITY_API_KEY, PERPLEXITY_MODEL

Save and load

v.save("my_model.pkl")
v2 = NLQValidator.load("my_model.pkl")

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlq_validator-0.1.0.tar.gz (15.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlq_validator-0.1.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file nlq_validator-0.1.0.tar.gz.

File metadata

  • Download URL: nlq_validator-0.1.0.tar.gz
  • Upload date:
  • Size: 15.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nlq_validator-0.1.0.tar.gz
Algorithm Hash digest
SHA256 468a9fdde3956fae09ef8dfd693c2ff5d57dc9e9cba20eb5f0b8d794cfe0c828
MD5 d937683d1bea319e6bdd1ce239345a28
BLAKE2b-256 8195107978ce9a49487b9d3fc6f8513397bffff13e5481718cbd026edaab55e0

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlq_validator-0.1.0.tar.gz:

Publisher: publish.yml on balajeekalyan/nlq-validator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file nlq_validator-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nlq_validator-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for nlq_validator-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27e2b622219c06c38408e6bc8bae382de0d20a4ecee6ce709213b5ed3d8b7fef
MD5 4f60f137ee4c6c9c11005fee8e24865d
BLAKE2b-256 b2aa55625e1f82e4961ad37c82c7faff5ce6f0e618cb7a8d51dc4830c85326fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for nlq_validator-0.1.0-py3-none-any.whl:

Publisher: publish.yml on balajeekalyan/nlq-validator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page