llamella

LLM-free response quality scoring. Grade every response. No second LLM call.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sguntuka

These details have not been verified by PyPI

Project description

llamella - LLM-free response quality scoring

LLM-free response quality scoring

Grade every response. No second LLM call. Zero cost. Deterministic.

Features: NLI scoring, evidence mapping, five metrics, IQS composite score, feedback loop, quality gate

Why llamella?

Teams deploying LLM agents and RAG systems can't manually review every response. Existing tools use LLM-as-judge - a second LLM call per evaluation - which costs $0.01–0.05/eval, takes 2–5s, and gives non-deterministic results. llamella scores every response locally using NLI models and embedding similarity. Zero cost. Deterministic. 100% coverage.

How it's different

LLM-as-judge approach sends each response to GPT for evaluation; llamella scores locally using NLI cross-encoders and embedding similarity with no API call

Feature	llamella	DeepEval	RAGAS	TruthScore
Cost per eval	$0.00	$0.01–0.05	$0.01–0.05	Requires LLM
Latency (GPU)	10–50ms	2–5s	2–5s	2–5s+
Latency (CPU)	600ms–2s	2–5s	2–5s	2–5s+
LLM call required	No	Yes	Yes	Yes (claim decomposition)
Deterministic	Yes	No	No	No
Runs offline	Yes	No	No	Partial (Ollama)
Feedback loop	Yes	No	No	No
Metrics	5 + composite	50+ (LLM-judged)	4 (LLM-judged)	1

Quick start

pip install llamella

from llamella import Auditor

auditor = Auditor()

result = auditor.score(
    query="What is our refund policy?",
    response="We offer a 30-day full refund at no extra cost.",
    context=["All customers are eligible for a 30-day full refund at no extra cost."]
)

print(result.iqs)           # 0.93 - composite Information Quality Score
print(result.groundedness)  # 0.97
print(result.flags)         # [] - no issues

Two convenience functions for one-off scoring:

import llamella

# Returns full EntailmentResult
result = llamella.score(
    query="What is our refund policy?",
    response="We offer a 30-day full refund.",
    context=["30-day full refund at no extra cost."]
)

# Returns True if IQS >= threshold
passed = llamella.verify(
    query="What is our refund policy?",
    response="We offer a 30-day full refund.",
    context=["30-day full refund at no extra cost."],
    threshold=0.7
)

For repeated scoring, instantiate Auditor once and reuse it — models are loaded once and cached.

How it works

Architecture diagram: user query and LLM response enter the llamella scoring engine, which runs five parallel metrics (groundedness via NLI, completeness via embeddings, relevance via cosine similarity, consistency via pairwise NLI, confidence via regex), combines them into an IQS score, raises quality flags, and feeds corrections back as guardrails

Every response flows through the scoring engine which checks five independent quality dimensions using NLI cross-encoders and embedding similarity - no LLM calls anywhere in the pipeline. Responses below threshold enter the feedback loop where corrections become guardrails for future responses.

Metrics

Five independent quality dimensions, each scored 0–1:

Metric	What it measures	How	Typical CPU latency
Groundedness	Is the response faithful to source context?	NLI cross-encoder per claim, batched	~800ms
Completeness	Did the response address all parts of the query?	Embedding similarity per query segment	~150ms
Relevance	Is the response on-topic?	Cosine similarity query↔response	~100ms
Consistency	Does the response contradict itself?	Pairwise NLI between sentences (capped at 25)	~400ms
Confidence	How assertive vs hedged is the response?	Regex pattern matching	<1ms

Note: Latencies are for CPU (DeBERTa-v3-base). Use device="cuda" for 10–50× speedup on GPU. First call also loads model weights (~5s).

IQS - the composite score

IQS (Information Quality Score) is the weighted harmonic mean of all five metrics. Harmonic mean penalizes low scores hard: a response with 0.95 groundedness but 0.1 completeness scores ~0.3, not 0.5.

Default weights:
  groundedness  0.35    # most important - is it faithful?
  completeness  0.25    # did it answer the full question?
  relevance     0.20    # is it on topic?
  consistency   0.15    # does it contradict itself?
  confidence    0.05    # calibration check

When no context is provided, the groundedness weight is redistributed proportionally across the other four metrics.

Flags

llamella automatically flags specific quality issues:

Flag	Condition
`hallucination_risk`	groundedness < 0.5 AND confidence > 0.7
`off_topic`	relevance < 0.3
`self_contradictory`	consistency < 0.7
`incomplete`	completeness < 0.3
`ungrounded`	groundedness < 0.3

No LLM anywhere

Unlike every competitor, llamella uses zero LLM calls:

No LLM for judging - NLI cross-encoders evaluate entailment, not GPT-4
No LLM for claim extraction - deterministic regex and sentence splitting, not a second model call
No LLM for scoring - embedding similarity, not generated text
No API key required - works offline, air-gapped, on a laptop

The only neural models used are a 350 MB NLI cross-encoder (DeBERTa-v3-base) and a 90 MB sentence embedding model (all-MiniLM-L6-v2). Both run locally on CPU or GPU.

Configuration

auditor = Auditor(
    nli_model="cross-encoder/nli-deberta-v3-base",
    embedding_model="all-MiniLM-L6-v2",
    device="cpu",                   # or "cuda"
    weights={
        "groundedness": 0.40,
        "completeness": 0.20,
        "relevance": 0.20,
        "consistency": 0.15,
        "confidence": 0.05,
    },
    entailment_threshold=0.5,
    coverage_threshold=0.45,
    contradiction_threshold=0.7,
    max_sentences=25,
    max_query_length=10_000,
    max_response_length=50_000,
    max_context_items=50,
    max_context_item_length=10_000,
    max_batch_size=1_000,
)

Custom models

from llamella.models import trust_model

trust_model("myorg/fine-tuned-nli")
auditor = Auditor(nli_model="myorg/fine-tuned-nli")

No-context mode

Works without source context. Groundedness is skipped; IQS is computed from the remaining metrics.

result = auditor.score(
    query="Explain quantum computing",
    response="Quantum computing uses qubits that can be in superposition..."
)
print(result.groundedness)  # None
print(result.iqs)           # computed from remaining metrics

Batch scoring

results = auditor.score_batch([
    {"query": "...", "response": "...", "context": ["..."]},
    {"query": "...", "response": "..."},  # no context
])

Agent registry

Route scoring through per-agent configuration while sharing one model instance:

from llamella import Auditor, AgentRegistry

auditor = Auditor()
registry = AgentRegistry(auditor)

registry.register("support_bot",
    weights={"groundedness": 0.45, "completeness": 0.20,
             "relevance": 0.15, "consistency": 0.15, "confidence": 0.05},
    iqs_threshold=0.8,
    context_required=True,
)

registry.register("code_assistant",
    weights={"completeness": 0.40, "relevance": 0.30,
             "groundedness": 0.10, "consistency": 0.15, "confidence": 0.05},
    iqs_threshold=0.7,
)

result = registry.score("support_bot",
    query="What is the refund policy?",
    response="We offer 30-day refunds.",
    context=["30-day refund policy..."],
)

stats = registry.get_stats("support_bot")

The registry is duck-type compatible with Auditor - pass it to sample_and_score or DatabaseConnector directly.

Sampling

Score a statistically meaningful subset instead of every response:

from llamella import Auditor, sample_and_score

auditor = Auditor()
items = [
    {"query": "...", "response": "...", "context": ["..."]}
    for _ in range(50_000)
]

# Random sample
results = sample_and_score(
    auditor, items, strategy="random", sample_size=500, seed=42
)

# Auto-compute size for 95% confidence, ±3% margin
results = sample_and_score(
    auditor, items, strategy="confidence",
    confidence_level=0.95, margin_of_error=0.03, seed=42
)

print(results.summary())
# Sampled 500/50000 (1.0%) using random strategy.
# Mean IQS: 0.872 (±0.041), 95% CI: [0.868, 0.876]
# Flags: hallucination_risk: 12 (2.4%), incomplete: 5 (1.0%)

Five strategies: random, percentage, stratified, confidence, priority.

Database connector

Score responses directly from a SQL database:

pip install "llamella[database]"

from llamella import Auditor
from llamella.connectors import DatabaseConnector

auditor = Auditor()

connector = DatabaseConnector(
    connection_string="postgresql://user:pass@localhost:5432/mydb",
    source_table="llm_responses",
    column_map={
        "query": "user_query",
        "response": "agent_response",
        "context": "rag_chunks",
    },
    result_table="llamella_scores",
)

connector.score_all(auditor)
connector.score_incremental(auditor, cursor_column="created_at")
connector.score_sampled(auditor, strategy="random", sample_size=500, seed=42)

Supports PostgreSQL, MySQL, SQLite, BigQuery, and Snowflake.

Feedback loop

llamella doesn't just score - it learns. Flagged responses enter a correction pipeline. Human-reviewed corrections are stored and injected back into future prompts as guardrails, preventing the same mistake twice.

from llamella.feedback.store import FeedbackStore, CorrectionRecord
from llamella.feedback.injector import GuardrailInjector

store = FeedbackStore("corrections.jsonl")
injector = GuardrailInjector(store)

result = auditor.score(query=query, response=response, context=context)

if result.iqs < 0.7:
    store.add(CorrectionRecord(
        id="abc123",
        timestamp="2026-06-02T00:00:00Z",
        query=query,
        response=response,
        scores=result.to_dict(),
        flags=result.flags,
        correction="The correct answer...",
        reason="Why the original was wrong",
        context_used=context,
        corrected_by="human",
    ))

guardrails = injector.build_context(query=query, strategy="relevant")
system_prompt = f"You are a helpful agent.\n{guardrails}"

Encrypted storage

from cryptography.fernet import Fernet

key = Fernet.generate_key()
store = FeedbackStore("corrections.jsonl", encryption_key=key)

Data retention

store = FeedbackStore(
    "corrections.jsonl",
    max_records=10_000,
    ttl_days=90,
)
store.delete("record-id")
store.purge(before_date="2026-01-01")
store.validate_integrity()

Performance

All numbers measured on CPU (Intel i7, single thread). Use device="cuda" for GPU acceleration.

Operation	CPU latency	Notes
`import llamella`	~250ms	No model loaded at import
First `score()` call	~5–6s	Model weights downloaded and cached
Subsequent `score()`, no context	~600ms	Embedding + regex only
`score()`, 1 context chunk	~1s	+NLI inference
`score()`, 10 context chunks	~2.5s	Batched NLI
`score()`, 50 context chunks	~10s	Consider GPU for large context
`score_batch(100)`	~60s	Sequential

GPU latency estimated at 10–50ms per response with preloaded models. Run python -m benchmarks.run_all --only speed on your hardware for actual numbers.

Improved sentence splitting

llamella uses a regex sentence splitter by default. For better accuracy on complex text, enable NLTK once after installation:

python -c "import llamella; llamella.setup_nltk()"

Security

Model allowlist - only pre-approved model names are loaded. Use trust_model() to authorize custom models.
Prompt injection protection - GuardrailInjector sanitizes all feedback fields before system-prompt interpolation.
Encrypted feedback store - pass encryption_key= (Fernet) to encrypt records at rest.
PII scrubbing - SSNs, emails, phone numbers, and credit cards are masked before guardrail injection.
Input limits - configurable length limits on all inputs prevent memory exhaustion.
Tamper detection - per-record SHA-256 hashing with sequential numbering.

Install

pip install llamella

Optional extras:

pip install "llamella[security]"   # encrypted feedback storage
pip install "llamella[database]"   # database connector (SQLAlchemy)
pip install "llamella[bench]"      # benchmark suite dependencies
pip install "llamella[dev]"        # development tools (pytest, ruff)

Contributing

See CONTRIBUTING.md for development setup, coding guidelines, and the PR process.

git clone https://github.com/sunnyguntuka/llamella
cd llamella
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/ tests/

Changelog

See CHANGELOG.md.

License

Apache-2.0 - see LICENSE.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

sguntuka

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llamella-0.1.0.tar.gz (61.7 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llamella-0.1.0-py3-none-any.whl (46.0 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file llamella-0.1.0.tar.gz.

File metadata

Download URL: llamella-0.1.0.tar.gz
Upload date: Jun 5, 2026
Size: 61.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llamella-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`dbe23daff9f1b7d41ba8229e4f58e906221b7cbb2ad140b7bd84bb071167820a`
MD5	`549ac4b4caeec73c2f0db8c4c57de39e`
BLAKE2b-256	`79c2b33ceb843f0ca42bcb3a9b7527e6657e4b9f433fe54f8ea3adbcd358db13`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llamella-0.1.0.tar.gz:

Publisher: publish.yml on sunnyguntuka/llamella

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llamella-0.1.0.tar.gz
- Subject digest: dbe23daff9f1b7d41ba8229e4f58e906221b7cbb2ad140b7bd84bb071167820a
- Sigstore transparency entry: 1729586171
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: sunnyguntuka/llamella@0716b65ab57d8dde67daba750ce361631891e0bb
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/sunnyguntuka
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0716b65ab57d8dde67daba750ce361631891e0bb
- Trigger Event: release

File details

Details for the file llamella-0.1.0-py3-none-any.whl.

File metadata

Download URL: llamella-0.1.0-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 46.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for llamella-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2c2c028a65f528ea82cdb9e6b0c5a6eae69eee8a0e17b7c42d1b7853c8157e24`
MD5	`f578004319bb3b2dbafd87cef80f8fbd`
BLAKE2b-256	`15e8c609569b4384c1b9ecfb7675e5179264c488605bbf713a1a1747ecb0d194`

See more details on using hashes here.

Provenance

The following attestation bundles were made for llamella-0.1.0-py3-none-any.whl:

Publisher: publish.yml on sunnyguntuka/llamella

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: llamella-0.1.0-py3-none-any.whl
- Subject digest: 2c2c028a65f528ea82cdb9e6b0c5a6eae69eee8a0e17b7c42d1b7853c8157e24
- Sigstore transparency entry: 1729586347
- Sigstore integration time: Jun 5, 2026
Source repository:
- Permalink: sunnyguntuka/llamella@0716b65ab57d8dde67daba750ce361631891e0bb
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/sunnyguntuka
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@0716b65ab57d8dde67daba750ce361631891e0bb
- Trigger Event: release

llamella 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LLM-free response quality scoring

Why llamella?

How it's different

Quick start

How it works

Metrics

IQS - the composite score

Flags

No LLM anywhere

Configuration

Custom models

No-context mode

Batch scoring

Agent registry

Sampling

Database connector

Feedback loop

Encrypted storage

Data retention

Performance

Improved sentence splitting

Security

Install

Contributing

Changelog

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance