LLM-free response quality scoring. Grade every response. No second LLM call.
Project description
LLM-free response quality scoring
Grade every response. No second LLM call. Zero cost. Deterministic.
Why llamella?
Teams deploying LLM agents and RAG systems can't manually review every response. Existing tools use LLM-as-judge - a second LLM call per evaluation - which costs $0.01–0.05/eval, takes 2–5s, and gives non-deterministic results. llamella scores every response locally using NLI models and embedding similarity. Zero cost. Deterministic. 100% coverage.
How it's different
| Feature | llamella | DeepEval | RAGAS | TruthScore |
|---|---|---|---|---|
| Cost per eval | $0.00 | $0.01–0.05 | $0.01–0.05 | Requires LLM |
| Latency (GPU) | 10–50ms | 2–5s | 2–5s | 2–5s+ |
| Latency (CPU) | 600ms–2s | 2–5s | 2–5s | 2–5s+ |
| LLM call required | No | Yes | Yes | Yes (claim decomposition) |
| Deterministic | Yes | No | No | No |
| Runs offline | Yes | No | No | Partial (Ollama) |
| Feedback loop | Yes | No | No | No |
| Metrics | 5 + composite | 50+ (LLM-judged) | 4 (LLM-judged) | 1 |
Quick start
pip install llamella
from llamella import Auditor
auditor = Auditor()
result = auditor.score(
query="What is our refund policy?",
response="We offer a 30-day full refund at no extra cost.",
context=["All customers are eligible for a 30-day full refund at no extra cost."]
)
print(result.iqs) # 0.93 - composite Information Quality Score
print(result.groundedness) # 0.97
print(result.flags) # [] - no issues
Two convenience functions for one-off scoring:
import llamella
# Returns full EntailmentResult
result = llamella.score(
query="What is our refund policy?",
response="We offer a 30-day full refund.",
context=["30-day full refund at no extra cost."]
)
# Returns True if IQS >= threshold
passed = llamella.verify(
query="What is our refund policy?",
response="We offer a 30-day full refund.",
context=["30-day full refund at no extra cost."],
threshold=0.7
)
For repeated scoring, instantiate Auditor once and reuse it —
models are loaded once and cached.
How it works
Every response flows through the scoring engine which checks five independent quality dimensions using NLI cross-encoders and embedding similarity - no LLM calls anywhere in the pipeline. Responses below threshold enter the feedback loop where corrections become guardrails for future responses.
Metrics
Five independent quality dimensions, each scored 0–1:
| Metric | What it measures | How | Typical CPU latency |
|---|---|---|---|
| Groundedness | Is the response faithful to source context? | NLI cross-encoder per claim, batched | ~800ms |
| Completeness | Did the response address all parts of the query? | Embedding similarity per query segment | ~150ms |
| Relevance | Is the response on-topic? | Cosine similarity query↔response | ~100ms |
| Consistency | Does the response contradict itself? | Pairwise NLI between sentences (capped at 25) | ~400ms |
| Confidence | How assertive vs hedged is the response? | Regex pattern matching | <1ms |
Note: Latencies are for CPU (DeBERTa-v3-base). Use
device="cuda"for 10–50× speedup on GPU. First call also loads model weights (~5s).
IQS - the composite score
IQS (Information Quality Score) is the weighted harmonic mean of all five metrics. Harmonic mean penalizes low scores hard: a response with 0.95 groundedness but 0.1 completeness scores ~0.3, not 0.5.
Default weights:
groundedness 0.35 # most important - is it faithful?
completeness 0.25 # did it answer the full question?
relevance 0.20 # is it on topic?
consistency 0.15 # does it contradict itself?
confidence 0.05 # calibration check
When no context is provided, the groundedness weight is redistributed proportionally across the other four metrics.
Flags
llamella automatically flags specific quality issues:
| Flag | Condition |
|---|---|
hallucination_risk |
groundedness < 0.5 AND confidence > 0.7 |
off_topic |
relevance < 0.3 |
self_contradictory |
consistency < 0.7 |
incomplete |
completeness < 0.3 |
ungrounded |
groundedness < 0.3 |
No LLM anywhere
Unlike every competitor, llamella uses zero LLM calls:
- No LLM for judging - NLI cross-encoders evaluate entailment, not GPT-4
- No LLM for claim extraction - deterministic regex and sentence splitting, not a second model call
- No LLM for scoring - embedding similarity, not generated text
- No API key required - works offline, air-gapped, on a laptop
The only neural models used are a 350 MB NLI cross-encoder (DeBERTa-v3-base) and a 90 MB sentence embedding model (all-MiniLM-L6-v2). Both run locally on CPU or GPU.
Configuration
auditor = Auditor(
nli_model="cross-encoder/nli-deberta-v3-base",
embedding_model="all-MiniLM-L6-v2",
device="cpu", # or "cuda"
weights={
"groundedness": 0.40,
"completeness": 0.20,
"relevance": 0.20,
"consistency": 0.15,
"confidence": 0.05,
},
entailment_threshold=0.5,
coverage_threshold=0.45,
contradiction_threshold=0.7,
max_sentences=25,
max_query_length=10_000,
max_response_length=50_000,
max_context_items=50,
max_context_item_length=10_000,
max_batch_size=1_000,
)
Custom models
from llamella.models import trust_model
trust_model("myorg/fine-tuned-nli")
auditor = Auditor(nli_model="myorg/fine-tuned-nli")
No-context mode
Works without source context. Groundedness is skipped; IQS is computed from the remaining metrics.
result = auditor.score(
query="Explain quantum computing",
response="Quantum computing uses qubits that can be in superposition..."
)
print(result.groundedness) # None
print(result.iqs) # computed from remaining metrics
Batch scoring
results = auditor.score_batch([
{"query": "...", "response": "...", "context": ["..."]},
{"query": "...", "response": "..."}, # no context
])
Agent registry
Route scoring through per-agent configuration while sharing one model instance:
from llamella import Auditor, AgentRegistry
auditor = Auditor()
registry = AgentRegistry(auditor)
registry.register("support_bot",
weights={"groundedness": 0.45, "completeness": 0.20,
"relevance": 0.15, "consistency": 0.15, "confidence": 0.05},
iqs_threshold=0.8,
context_required=True,
)
registry.register("code_assistant",
weights={"completeness": 0.40, "relevance": 0.30,
"groundedness": 0.10, "consistency": 0.15, "confidence": 0.05},
iqs_threshold=0.7,
)
result = registry.score("support_bot",
query="What is the refund policy?",
response="We offer 30-day refunds.",
context=["30-day refund policy..."],
)
stats = registry.get_stats("support_bot")
The registry is duck-type compatible with Auditor - pass it to
sample_and_score or DatabaseConnector directly.
Sampling
Score a statistically meaningful subset instead of every response:
from llamella import Auditor, sample_and_score
auditor = Auditor()
items = [
{"query": "...", "response": "...", "context": ["..."]}
for _ in range(50_000)
]
# Random sample
results = sample_and_score(
auditor, items, strategy="random", sample_size=500, seed=42
)
# Auto-compute size for 95% confidence, ±3% margin
results = sample_and_score(
auditor, items, strategy="confidence",
confidence_level=0.95, margin_of_error=0.03, seed=42
)
print(results.summary())
# Sampled 500/50000 (1.0%) using random strategy.
# Mean IQS: 0.872 (±0.041), 95% CI: [0.868, 0.876]
# Flags: hallucination_risk: 12 (2.4%), incomplete: 5 (1.0%)
Five strategies: random, percentage, stratified,
confidence, priority.
Database connector
Score responses directly from a SQL database:
pip install "llamella[database]"
from llamella import Auditor
from llamella.connectors import DatabaseConnector
auditor = Auditor()
connector = DatabaseConnector(
connection_string="postgresql://user:pass@localhost:5432/mydb",
source_table="llm_responses",
column_map={
"query": "user_query",
"response": "agent_response",
"context": "rag_chunks",
},
result_table="llamella_scores",
)
connector.score_all(auditor)
connector.score_incremental(auditor, cursor_column="created_at")
connector.score_sampled(auditor, strategy="random", sample_size=500, seed=42)
Supports PostgreSQL, MySQL, SQLite, BigQuery, and Snowflake.
Feedback loop
llamella doesn't just score - it learns. Flagged responses enter a correction pipeline. Human-reviewed corrections are stored and injected back into future prompts as guardrails, preventing the same mistake twice.
from llamella.feedback.store import FeedbackStore, CorrectionRecord
from llamella.feedback.injector import GuardrailInjector
store = FeedbackStore("corrections.jsonl")
injector = GuardrailInjector(store)
result = auditor.score(query=query, response=response, context=context)
if result.iqs < 0.7:
store.add(CorrectionRecord(
id="abc123",
timestamp="2026-06-02T00:00:00Z",
query=query,
response=response,
scores=result.to_dict(),
flags=result.flags,
correction="The correct answer...",
reason="Why the original was wrong",
context_used=context,
corrected_by="human",
))
guardrails = injector.build_context(query=query, strategy="relevant")
system_prompt = f"You are a helpful agent.\n{guardrails}"
Encrypted storage
from cryptography.fernet import Fernet
key = Fernet.generate_key()
store = FeedbackStore("corrections.jsonl", encryption_key=key)
Data retention
store = FeedbackStore(
"corrections.jsonl",
max_records=10_000,
ttl_days=90,
)
store.delete("record-id")
store.purge(before_date="2026-01-01")
store.validate_integrity()
Performance
All numbers measured on CPU (Intel i7, single thread).
Use device="cuda" for GPU acceleration.
| Operation | CPU latency | Notes |
|---|---|---|
import llamella |
~250ms | No model loaded at import |
First score() call |
~5–6s | Model weights downloaded and cached |
Subsequent score(), no context |
~600ms | Embedding + regex only |
score(), 1 context chunk |
~1s | +NLI inference |
score(), 10 context chunks |
~2.5s | Batched NLI |
score(), 50 context chunks |
~10s | Consider GPU for large context |
score_batch(100) |
~60s | Sequential |
GPU latency estimated at 10–50ms per response with preloaded models. Run
python -m benchmarks.run_all --only speedon your hardware for actual numbers.
Improved sentence splitting
llamella uses a regex sentence splitter by default. For better accuracy on complex text, enable NLTK once after installation:
python -c "import llamella; llamella.setup_nltk()"
Security
- Model allowlist - only pre-approved model names are loaded.
Use
trust_model()to authorize custom models. - Prompt injection protection -
GuardrailInjectorsanitizes all feedback fields before system-prompt interpolation. - Encrypted feedback store - pass
encryption_key=(Fernet) to encrypt records at rest. - PII scrubbing - SSNs, emails, phone numbers, and credit cards are masked before guardrail injection.
- Input limits - configurable length limits on all inputs prevent memory exhaustion.
- Tamper detection - per-record SHA-256 hashing with sequential numbering.
Install
pip install llamella
Optional extras:
pip install "llamella[security]" # encrypted feedback storage
pip install "llamella[database]" # database connector (SQLAlchemy)
pip install "llamella[bench]" # benchmark suite dependencies
pip install "llamella[dev]" # development tools (pytest, ruff)
Contributing
See CONTRIBUTING.md for development setup, coding guidelines, and the PR process.
git clone https://github.com/sunnyguntuka/llamella
cd llamella
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/ tests/
Changelog
See CHANGELOG.md.
License
Apache-2.0 - see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llamella-0.1.0.tar.gz.
File metadata
- Download URL: llamella-0.1.0.tar.gz
- Upload date:
- Size: 61.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dbe23daff9f1b7d41ba8229e4f58e906221b7cbb2ad140b7bd84bb071167820a
|
|
| MD5 |
549ac4b4caeec73c2f0db8c4c57de39e
|
|
| BLAKE2b-256 |
79c2b33ceb843f0ca42bcb3a9b7527e6657e4b9f433fe54f8ea3adbcd358db13
|
Provenance
The following attestation bundles were made for llamella-0.1.0.tar.gz:
Publisher:
publish.yml on sunnyguntuka/llamella
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llamella-0.1.0.tar.gz -
Subject digest:
dbe23daff9f1b7d41ba8229e4f58e906221b7cbb2ad140b7bd84bb071167820a - Sigstore transparency entry: 1729586171
- Sigstore integration time:
-
Permalink:
sunnyguntuka/llamella@0716b65ab57d8dde67daba750ce361631891e0bb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sunnyguntuka
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0716b65ab57d8dde67daba750ce361631891e0bb -
Trigger Event:
release
-
Statement type:
File details
Details for the file llamella-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llamella-0.1.0-py3-none-any.whl
- Upload date:
- Size: 46.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2c2c028a65f528ea82cdb9e6b0c5a6eae69eee8a0e17b7c42d1b7853c8157e24
|
|
| MD5 |
f578004319bb3b2dbafd87cef80f8fbd
|
|
| BLAKE2b-256 |
15e8c609569b4384c1b9ecfb7675e5179264c488605bbf713a1a1747ecb0d194
|
Provenance
The following attestation bundles were made for llamella-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on sunnyguntuka/llamella
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
llamella-0.1.0-py3-none-any.whl -
Subject digest:
2c2c028a65f528ea82cdb9e6b0c5a6eae69eee8a0e17b7c42d1b7853c8157e24 - Sigstore transparency entry: 1729586347
- Sigstore integration time:
-
Permalink:
sunnyguntuka/llamella@0716b65ab57d8dde67daba750ce361631891e0bb -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/sunnyguntuka
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0716b65ab57d8dde67daba750ce361631891e0bb -
Trigger Event:
release
-
Statement type: