Skip to main content

LLM-free response quality scoring. Grade every response. No second LLM call.

Project description

llamella - LLM-free response quality scoring

LLM-free response quality scoring

Grade every response. No second LLM call. Zero cost. Deterministic.

PyPI Tests Coverage License Downloads

Features: NLI scoring, evidence mapping, five metrics, IQS composite score, feedback loop, quality gate

Why llamella?

Teams deploying LLM agents and RAG systems can't manually review every response. Existing tools use LLM-as-judge - a second LLM call per evaluation - which costs $0.01–0.05/eval, takes 2–5s, and gives non-deterministic results. llamella scores every response locally using NLI models and embedding similarity. Zero cost. Deterministic. 100% coverage.

How it's different

LLM-as-judge approach sends each response to GPT for evaluation; llamella scores locally using NLI cross-encoders and embedding similarity with no API call

Feature llamella DeepEval RAGAS TruthScore
Cost per eval $0.00 $0.01–0.05 $0.01–0.05 Requires LLM
Latency (GPU) 10–50ms 2–5s 2–5s 2–5s+
Latency (CPU) 600ms–2s 2–5s 2–5s 2–5s+
LLM call required No Yes Yes Yes (claim decomposition)
Deterministic Yes No No No
Runs offline Yes No No Partial (Ollama)
Feedback loop Yes No No No
Metrics 5 + composite 50+ (LLM-judged) 4 (LLM-judged) 1

Quick start

pip install llamella
from llamella import Auditor

auditor = Auditor()

result = auditor.score(
    query="What is our refund policy?",
    response="We offer a 30-day full refund at no extra cost.",
    context=["All customers are eligible for a 30-day full refund at no extra cost."]
)

print(result.iqs)           # 0.93 - composite Information Quality Score
print(result.groundedness)  # 0.97
print(result.flags)         # [] - no issues

Two convenience functions for one-off scoring:

import llamella

# Returns full EntailmentResult
result = llamella.score(
    query="What is our refund policy?",
    response="We offer a 30-day full refund.",
    context=["30-day full refund at no extra cost."]
)

# Returns True if IQS >= threshold
passed = llamella.verify(
    query="What is our refund policy?",
    response="We offer a 30-day full refund.",
    context=["30-day full refund at no extra cost."],
    threshold=0.7
)

For repeated scoring, instantiate Auditor once and reuse it — models are loaded once and cached.

How it works

Architecture diagram: user query and LLM response enter the llamella scoring engine, which runs five parallel metrics (groundedness via NLI, completeness via embeddings, relevance via cosine similarity, consistency via pairwise NLI, confidence via regex), combines them into an IQS score, raises quality flags, and feeds corrections back as guardrails

Every response flows through the scoring engine which checks five independent quality dimensions using NLI cross-encoders and embedding similarity - no LLM calls anywhere in the pipeline. Responses below threshold enter the feedback loop where corrections become guardrails for future responses.

Metrics

Five independent quality dimensions, each scored 0–1:

Metric What it measures How Typical CPU latency
Groundedness Is the response faithful to source context? NLI cross-encoder per claim, batched ~800ms
Completeness Did the response address all parts of the query? Embedding similarity per query segment ~150ms
Relevance Is the response on-topic? Cosine similarity query↔response ~100ms
Consistency Does the response contradict itself? Pairwise NLI between sentences (capped at 25) ~400ms
Confidence How assertive vs hedged is the response? Regex pattern matching <1ms

Note: Latencies are for CPU (DeBERTa-v3-base). Use device="cuda" for 10–50× speedup on GPU. First call also loads model weights (~5s).

IQS - the composite score

IQS (Information Quality Score) is the weighted harmonic mean of all five metrics. Harmonic mean penalizes low scores hard: a response with 0.95 groundedness but 0.1 completeness scores ~0.3, not 0.5.

Default weights:
  groundedness  0.35    # most important - is it faithful?
  completeness  0.25    # did it answer the full question?
  relevance     0.20    # is it on topic?
  consistency   0.15    # does it contradict itself?
  confidence    0.05    # calibration check

When no context is provided, the groundedness weight is redistributed proportionally across the other four metrics.

Flags

llamella automatically flags specific quality issues:

Flag Condition
hallucination_risk groundedness < 0.5 AND confidence > 0.7
off_topic relevance < 0.3
self_contradictory consistency < 0.7
incomplete completeness < 0.3
ungrounded groundedness < 0.3

No LLM anywhere

Unlike every competitor, llamella uses zero LLM calls:

  • No LLM for judging - NLI cross-encoders evaluate entailment, not GPT-4
  • No LLM for claim extraction - deterministic regex and sentence splitting, not a second model call
  • No LLM for scoring - embedding similarity, not generated text
  • No API key required - works offline, air-gapped, on a laptop

The only neural models used are a 350 MB NLI cross-encoder (DeBERTa-v3-base) and a 90 MB sentence embedding model (all-MiniLM-L6-v2). Both run locally on CPU or GPU.

Configuration

auditor = Auditor(
    nli_model="cross-encoder/nli-deberta-v3-base",
    embedding_model="all-MiniLM-L6-v2",
    device="cpu",                   # or "cuda"
    weights={
        "groundedness": 0.40,
        "completeness": 0.20,
        "relevance": 0.20,
        "consistency": 0.15,
        "confidence": 0.05,
    },
    entailment_threshold=0.5,
    coverage_threshold=0.45,
    contradiction_threshold=0.7,
    max_sentences=25,
    max_query_length=10_000,
    max_response_length=50_000,
    max_context_items=50,
    max_context_item_length=10_000,
    max_batch_size=1_000,
)

Custom models

from llamella.models import trust_model

trust_model("myorg/fine-tuned-nli")
auditor = Auditor(nli_model="myorg/fine-tuned-nli")

No-context mode

Works without source context. Groundedness is skipped; IQS is computed from the remaining metrics.

result = auditor.score(
    query="Explain quantum computing",
    response="Quantum computing uses qubits that can be in superposition..."
)
print(result.groundedness)  # None
print(result.iqs)           # computed from remaining metrics

Batch scoring

results = auditor.score_batch([
    {"query": "...", "response": "...", "context": ["..."]},
    {"query": "...", "response": "..."},  # no context
])

Agent registry

Route scoring through per-agent configuration while sharing one model instance:

from llamella import Auditor, AgentRegistry

auditor = Auditor()
registry = AgentRegistry(auditor)

registry.register("support_bot",
    weights={"groundedness": 0.45, "completeness": 0.20,
             "relevance": 0.15, "consistency": 0.15, "confidence": 0.05},
    iqs_threshold=0.8,
    context_required=True,
)

registry.register("code_assistant",
    weights={"completeness": 0.40, "relevance": 0.30,
             "groundedness": 0.10, "consistency": 0.15, "confidence": 0.05},
    iqs_threshold=0.7,
)

result = registry.score("support_bot",
    query="What is the refund policy?",
    response="We offer 30-day refunds.",
    context=["30-day refund policy..."],
)

stats = registry.get_stats("support_bot")

The registry is duck-type compatible with Auditor - pass it to sample_and_score or DatabaseConnector directly.

Sampling

Score a statistically meaningful subset instead of every response:

from llamella import Auditor, sample_and_score

auditor = Auditor()
items = [
    {"query": "...", "response": "...", "context": ["..."]}
    for _ in range(50_000)
]

# Random sample
results = sample_and_score(
    auditor, items, strategy="random", sample_size=500, seed=42
)

# Auto-compute size for 95% confidence, ±3% margin
results = sample_and_score(
    auditor, items, strategy="confidence",
    confidence_level=0.95, margin_of_error=0.03, seed=42
)

print(results.summary())
# Sampled 500/50000 (1.0%) using random strategy.
# Mean IQS: 0.872 (±0.041), 95% CI: [0.868, 0.876]
# Flags: hallucination_risk: 12 (2.4%), incomplete: 5 (1.0%)

Five strategies: random, percentage, stratified, confidence, priority.

Database connector

Score responses directly from a SQL database:

pip install "llamella[database]"
from llamella import Auditor
from llamella.connectors import DatabaseConnector

auditor = Auditor()

connector = DatabaseConnector(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    source_table="llm_responses",
    column_map={
        "query": "user_query",
        "response": "agent_response",
        "context": "rag_chunks",
    },
    result_table="llamella_scores",
)

connector.score_all(auditor)
connector.score_incremental(auditor, cursor_column="created_at")
connector.score_sampled(auditor, strategy="random", sample_size=500, seed=42)

Supports PostgreSQL, MySQL, SQLite, BigQuery, and Snowflake.

Feedback loop

llamella doesn't just score - it learns. Flagged responses enter a correction pipeline. Human-reviewed corrections are stored and injected back into future prompts as guardrails, preventing the same mistake twice.

from llamella.feedback.store import FeedbackStore, CorrectionRecord
from llamella.feedback.injector import GuardrailInjector

store = FeedbackStore("corrections.jsonl")
injector = GuardrailInjector(store)

result = auditor.score(query=query, response=response, context=context)

if result.iqs < 0.7:
    store.add(CorrectionRecord(
        id="abc123",
        timestamp="2026-06-02T00:00:00Z",
        query=query,
        response=response,
        scores=result.to_dict(),
        flags=result.flags,
        correction="The correct answer...",
        reason="Why the original was wrong",
        context_used=context,
        corrected_by="human",
    ))

guardrails = injector.build_context(query=query, strategy="relevant")
system_prompt = f"You are a helpful agent.\n{guardrails}"

Encrypted storage

from cryptography.fernet import Fernet

key = Fernet.generate_key()
store = FeedbackStore("corrections.jsonl", encryption_key=key)

Data retention

store = FeedbackStore(
    "corrections.jsonl",
    max_records=10_000,
    ttl_days=90,
)
store.delete("record-id")
store.purge(before_date="2026-01-01")
store.validate_integrity()

Performance

All numbers measured on CPU (Intel i7, single thread). Use device="cuda" for GPU acceleration.

Operation CPU latency Notes
import llamella ~250ms No model loaded at import
First score() call ~5–6s Model weights downloaded and cached
Subsequent score(), no context ~600ms Embedding + regex only
score(), 1 context chunk ~1s +NLI inference
score(), 10 context chunks ~2.5s Batched NLI
score(), 50 context chunks ~10s Consider GPU for large context
score_batch(100) ~60s Sequential

GPU latency estimated at 10–50ms per response with preloaded models. Run python -m benchmarks.run_all --only speed on your hardware for actual numbers.

Improved sentence splitting

llamella uses a regex sentence splitter by default. For better accuracy on complex text, enable NLTK once after installation:

python -c "import llamella; llamella.setup_nltk()"

Security

  • Model allowlist - only pre-approved model names are loaded. Use trust_model() to authorize custom models.
  • Prompt injection protection - GuardrailInjector sanitizes all feedback fields before system-prompt interpolation.
  • Encrypted feedback store - pass encryption_key= (Fernet) to encrypt records at rest.
  • PII scrubbing - SSNs, emails, phone numbers, and credit cards are masked before guardrail injection.
  • Input limits - configurable length limits on all inputs prevent memory exhaustion.
  • Tamper detection - per-record SHA-256 hashing with sequential numbering.

Install

pip install llamella

Optional extras:

pip install "llamella[security]"   # encrypted feedback storage
pip install "llamella[database]"   # database connector (SQLAlchemy)
pip install "llamella[bench]"      # benchmark suite dependencies
pip install "llamella[dev]"        # development tools (pytest, ruff)

Contributing

See CONTRIBUTING.md for development setup, coding guidelines, and the PR process.

git clone https://github.com/sunnyguntuka/llamella
cd llamella
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/ tests/

Changelog

See CHANGELOG.md.

License

Apache-2.0 - see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llamella-0.1.0.tar.gz (61.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

llamella-0.1.0-py3-none-any.whl (46.0 kB view details)

Uploaded Python 3

File details

Details for the file llamella-0.1.0.tar.gz.

File metadata

  • Download URL: llamella-0.1.0.tar.gz
  • Upload date:
  • Size: 61.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llamella-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dbe23daff9f1b7d41ba8229e4f58e906221b7cbb2ad140b7bd84bb071167820a
MD5 549ac4b4caeec73c2f0db8c4c57de39e
BLAKE2b-256 79c2b33ceb843f0ca42bcb3a9b7527e6657e4b9f433fe54f8ea3adbcd358db13

See more details on using hashes here.

Provenance

The following attestation bundles were made for llamella-0.1.0.tar.gz:

Publisher: publish.yml on sunnyguntuka/llamella

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file llamella-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: llamella-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 46.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llamella-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2c2c028a65f528ea82cdb9e6b0c5a6eae69eee8a0e17b7c42d1b7853c8157e24
MD5 f578004319bb3b2dbafd87cef80f8fbd
BLAKE2b-256 15e8c609569b4384c1b9ecfb7675e5179264c488605bbf713a1a1747ecb0d194

See more details on using hashes here.

Provenance

The following attestation bundles were made for llamella-0.1.0-py3-none-any.whl:

Publisher: publish.yml on sunnyguntuka/llamella

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page