Lightweight Natural Language Query validator — keep your LLM assistant on-topic
Project description
nlq-validator
A lightweight Natural Language Query (NLQ) validator that keeps your LLM assistant on-topic. Train it on a handful of example questions, and it will accept in-domain queries while rejecting off-topic ones — no server, no API key required for the core functionality.
Features
- TF-IDF scoring out of the box — no model downloads needed
- Semantic embeddings via sentence-transformers for paraphrase-aware matching
- Threshold calibration — find the F1-optimal cutoff for your domain
- Incremental retraining — add examples without rebuilding from scratch
- LLM-powered question generation — auto-generate training data from your system prompt (Claude, ChatGPT, Gemini, Mistral, Grok, Perplexity)
- Async support for all LLM integrations
- Zero runtime dependencies beyond scikit-learn for the core validator
Installation
pip install nlq-validator
Optional extras
pip install 'nlq-validator[embeddings]' # sentence-transformers for semantic matching
pip install 'nlq-validator[anthropic]' # Claude integration
pip install 'nlq-validator[openai]' # ChatGPT, Grok, Perplexity integrations
pip install 'nlq-validator[gemini]' # Google Gemini integration
pip install 'nlq-validator[mistral]' # Mistral integration
pip install 'nlq-validator[all-llm]' # All LLM integrations
Quick start
from nlq_validator import NLQValidator
SYSTEM_PROMPT = (
"You are a SQL assistant. You help users write queries, "
"understand JOINs, indexes, and query optimization."
)
# Train from a plain-text file (one question per line)
v = NLQValidator.from_training_file("questions.txt", SYSTEM_PROMPT)
result = v.validate("How do I write a SELECT statement?")
print(result.is_valid) # True
result = v.validate("What is my horoscope today?")
print(result.is_valid) # False
print(result.errors) # ['Query appears off-topic (score=0.000, threshold=0.250)']
Training data format
Supported file formats: .txt (one question per line), .csv (first column), .json (list of strings or list of {"text": "..."} objects).
# questions.txt
How do I write a SELECT statement?
What is a SQL JOIN?
How do I filter rows with WHERE clause?
What is the difference between INNER JOIN and LEFT JOIN?
...
Threshold calibration
The default threshold of 0.25 is a conservative starting point. Use calibrate() to find the optimal value for your domain:
in_domain = ["How do I use GROUP BY?", "What is a primary key?", ...]
off_domain = ["How do I bake bread?", "What is my horoscope?", ...]
result = v.calibrate(in_domain, off_domain)
result.summary() # prints precision/recall/F1 table
v.apply_calibration(result) # applies suggested threshold
Incremental retraining
v.retrain(["How do I write a CTE?", "What is a window function?"])
# or from a file:
v.retrain_from_file("more_questions.txt")
Semantic embeddings
For queries that use different words but mean the same thing:
v = NLQValidator.from_training_file(
"questions.txt",
SYSTEM_PROMPT,
embedding_model="all-MiniLM-L6-v2", # requires nlq-validator[embeddings]
)
LLM-powered question generation
Generate training data automatically from your system prompt:
from nlq_validator.integrations.claude import ClaudeIntegration
llm = ClaudeIntegration() # reads ANTHROPIC_API_KEY env var
v = NLQValidator.from_llm(llm, SYSTEM_PROMPT, count=50)
Async variant:
v = await NLQValidator.from_llm_async(llm, SYSTEM_PROMPT, count=50)
Supported LLM providers
| Provider | Extra | Class | Env vars |
|---|---|---|---|
| Claude | [anthropic] |
ClaudeIntegration |
ANTHROPIC_API_KEY, ANTHROPIC_MODEL |
| ChatGPT | [openai] |
ChatGPTIntegration |
OPENAI_API_KEY, OPENAI_MODEL |
| Gemini | [gemini] |
GeminiIntegration |
GEMINI_API_KEY, GEMINI_MODEL |
| Mistral | [mistral] |
MistralIntegration |
MISTRAL_API_KEY, MISTRAL_MODEL |
| Grok | [openai] |
GrokIntegration |
XAI_API_KEY, XAI_MODEL |
| Perplexity | [openai] |
PerplexityIntegration |
PERPLEXITY_API_KEY, PERPLEXITY_MODEL |
Save and load
v.save("my_model.pkl")
v2 = NLQValidator.load("my_model.pkl")
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlq_validator-0.1.0.tar.gz.
File metadata
- Download URL: nlq_validator-0.1.0.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
468a9fdde3956fae09ef8dfd693c2ff5d57dc9e9cba20eb5f0b8d794cfe0c828
|
|
| MD5 |
d937683d1bea319e6bdd1ce239345a28
|
|
| BLAKE2b-256 |
8195107978ce9a49487b9d3fc6f8513397bffff13e5481718cbd026edaab55e0
|
Provenance
The following attestation bundles were made for nlq_validator-0.1.0.tar.gz:
Publisher:
publish.yml on balajeekalyan/nlq-validator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nlq_validator-0.1.0.tar.gz -
Subject digest:
468a9fdde3956fae09ef8dfd693c2ff5d57dc9e9cba20eb5f0b8d794cfe0c828 - Sigstore transparency entry: 1529439532
- Sigstore integration time:
-
Permalink:
balajeekalyan/nlq-validator@8555e1ff36692c5d731840f310b0ad0c9f60bfe9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/balajeekalyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8555e1ff36692c5d731840f310b0ad0c9f60bfe9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file nlq_validator-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nlq_validator-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27e2b622219c06c38408e6bc8bae382de0d20a4ecee6ce709213b5ed3d8b7fef
|
|
| MD5 |
4f60f137ee4c6c9c11005fee8e24865d
|
|
| BLAKE2b-256 |
b2aa55625e1f82e4961ad37c82c7faff5ce6f0e618cb7a8d51dc4830c85326fd
|
Provenance
The following attestation bundles were made for nlq_validator-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on balajeekalyan/nlq-validator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
nlq_validator-0.1.0-py3-none-any.whl -
Subject digest:
27e2b622219c06c38408e6bc8bae382de0d20a4ecee6ce709213b5ed3d8b7fef - Sigstore transparency entry: 1529439742
- Sigstore integration time:
-
Permalink:
balajeekalyan/nlq-validator@8555e1ff36692c5d731840f310b0ad0c9f60bfe9 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/balajeekalyan
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8555e1ff36692c5d731840f310b0ad0c9f60bfe9 -
Trigger Event:
push
-
Statement type: