Lightweight evaluation metrics for LLMs and AI agents. No platform. No API keys. Just scores.
Project description
SSEM — Standardized Scoring and Evaluation Metrics
Lightweight evaluation metrics for LLMs and AI agents. No platform. No API keys. Just scores.
SSEM provides 12 evaluation metrics covering text generation quality, factual consistency, hallucination detection, agentic AI evaluation, and safety — all with full scoring transparency and research citations.
Installation
# PyPI
pip install ssem
# uv
uv pip install ssem
# Latest from GitHub
pip install git+https://github.com/TechyNilesh/SSEM.git
Quick Start
from SSEM import SSEM
evaluator = SSEM()
# BERTScore — one line
result = evaluator.bertscore(
["The cat sat on the mat."],
["A cat was sitting on a mat."]
)
print(result.score) # 0.87
print(result.explain()) # Full transparency report
Why SSEM?
| SSEM | DeepEval / Ragas |
|---|---|
| 12 metrics in one lightweight package | Bundled with platforms, tracing, dashboards |
| No LLM-as-judge required — embedding + NLI based | Often requires GPT-4 API calls ($$$) |
| Agentic metrics built-in — tool accuracy, reasoning chains | Focused on RAG, agents are afterthought |
| Every score is transparent — method, model, citations | Black-box scores |
| Runs offline on CPU — no API keys needed | Many require cloud API keys |
Available Metrics
Text Generation Quality
| Metric | Method | Score Range | Citation |
|---|---|---|---|
semantic_similarity |
Sentence embedding cosine/euclidean/pearson similarity | [-1, 1] or [0, 1] | Vadapalli et al. (2021) |
bertscore |
Token-level precision, recall, F1 via contextual embeddings | [0, 1] | Zhang et al. (2020) |
Factual Consistency
| Metric | Method | Score Range | Citation |
|---|---|---|---|
faithfulness |
Claim extraction + NLI/embedding entailment checking | [0, 1] | Kryscinski et al. (2020) |
hallucination |
Fraction of output claims NOT grounded in source | [0, 1] | Kryscinski et al. (2020); Manakul et al. (2023) |
answer_relevancy |
Question-answer embedding similarity | [0, 1] | Es et al. (2024) |
Agentic AI Evaluation
| Metric | Method | Score Range | Citation |
|---|---|---|---|
reasoning_coherence |
Sequential + goal-aligned step similarity, contradiction detection | [0, 1] | Xia et al. (2024) |
tool_accuracy |
Tool selection + parameter + ordering accuracy (LCS) | [0, 1] | Liu et al. (2023) |
task_completion |
Checklist or reference-based graded completion | [0, 1] | Liu et al. (2023) |
Consistency & Safety
| Metric | Method | Score Range | Citation |
|---|---|---|---|
multi_turn_consistency |
Cross-turn semantic consistency + contradiction detection | [0, 1] | Zheng et al. (2023) |
selfcheck |
Sampling consistency for hallucination detection | [0, 1] | Manakul et al. (2023) |
toxicity |
Classifier-based toxicity scoring | [0, 1] | Gehman et al. (2020) |
Code Evaluation
| Metric | Method | Score Range | Citation |
|---|---|---|---|
code_correctness |
Execution-based Pass@k with unbiased estimator | [0, 1] | Chen et al. (2021) |
Scoring Transparency
Every SSEM metric returns a MetricResult — never a bare number. Each result includes:
result = evaluator.bertscore(outputs, references)
result.score # 0.87 — the primary score
result.score_range # (0.0, 1.0) — possible range
result.interpretation # "Strong token-level overlap..."
result.method # Step-by-step computation description
result.model_used # "bert-base-multilingual-cased"
result.citations # List of Citation objects
result.details # Per-sample scores, intermediates
result.elapsed_sec # Wall-clock time
# Full human-readable transparency report
print(result.explain())
Example output of result.explain():
Metric : BERTScore
Score : 0.8734
Score Range : [0.0, 1.0]
Interpretation: Strong token-level overlap — output captures most reference content.
Model Used : bert-base-multilingual-cased
How This Score Was Computed:
1. Encoded 1 sentence pairs into per-token contextual embeddings using 'bert-base-multilingual-cased'.
2. For each pair, built a cosine similarity matrix between output tokens and reference tokens.
3. Precision = mean of row-wise max similarities (each output token's best reference match).
4. Recall = mean of column-wise max similarities (each reference token's best output match).
5. F1 = harmonic mean of precision and recall.
6. Averaged across 1 pairs: P=0.8912, R=0.8561, F1=0.8734.
Research Citations:
[1] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). BERTScore: Evaluating Text Generation with BERT. ICLR 2020. https://arxiv.org/abs/1904.09675
Usage Examples
Text Generation Evaluation
from SSEM import SSEM
evaluator = SSEM()
outputs = ["The cat sat on the mat.", "It was a sunny day."]
references = ["A cat was sitting on a mat.", "The weather was sunny."]
# Semantic similarity
result = evaluator.semantic_similarity(outputs, references)
# BERTScore (P/R/F1)
result = evaluator.bertscore(outputs, references)
print(result.details["precision"]) # 0.89
print(result.details["recall"]) # 0.85
print(result.details["f1"]) # 0.87
Faithfulness & Hallucination
output = "Paris is the capital of France. The Eiffel Tower is in London."
source = "Paris is the capital of France. The Eiffel Tower is located in Paris."
# Faithfulness — are claims grounded?
result = evaluator.faithfulness(output, source)
print(result.score) # 0.5 — one of two claims is unfaithful
print(result.details) # Per-claim breakdown with individual scores
# Hallucination — what fraction is fabricated?
result = evaluator.hallucination(output, source)
print(result.score) # 0.5 — half the claims are hallucinated
Agentic AI Evaluation
# Reasoning chain coherence
result = evaluator.reasoning_coherence(
reasoning_steps=[
"First, I need to find the user's order history.",
"Next, I'll filter orders from the last 30 days.",
"Then, I'll calculate the total spending.",
"Finally, I'll generate a summary report.",
],
goal="Generate a spending report for the last month."
)
print(result.score) # 0.82
print(result.details["contradictions"]) # [] — no contradictions
# Tool call accuracy
result = evaluator.tool_accuracy(
predicted_calls=[
{"tool": "database_query", "params": {"table": "orders", "days": 30}},
{"tool": "calculate_sum", "params": {"column": "amount"}},
],
expected_calls=[
{"tool": "database_query", "params": {"table": "orders", "days": 30}},
{"tool": "calculate_sum", "params": {"column": "amount"}},
],
)
print(result.score) # 1.0 — perfect tool usage
# Task completion (checklist mode)
result = evaluator.task_completion(
agent_output="I queried the database and found 15 orders totaling $1,234.",
expected_criteria=[
"Query the order database",
"Calculate total spending",
"Report the number of orders",
],
)
print(result.score) # 0.67 — 2 of 3 criteria met
Multi-Turn Consistency
result = evaluator.multi_turn_consistency(
responses=[
"I recommend Python for this project.",
"Python has great ML libraries like scikit-learn.",
"Actually, you should use Java instead.", # contradiction!
]
)
print(result.score) # 0.61
print(result.details["contradictions"]) # Flags the Java contradiction
Code Correctness
result = evaluator.code_correctness(
code_samples=[
"def factorial(n):\n if n <= 1: return 1\n return n * factorial(n-1)",
"def factorial(n):\n return n * n", # wrong
],
test_code="assert factorial(5) == 120\nassert factorial(0) == 1",
k_values=[1, 2],
)
print(result.details["pass_at_k"]) # {"pass@1": 0.5, "pass@2": 1.0}
Full Evaluation Report
report = evaluator.evaluate_all(
output_sentences=["The cat sat on the mat."],
reference_sentences=["A cat was sitting on a mat."],
source_context="A cat was observed sitting on a mat in the room.",
reasoning_steps=["Find the cat.", "Describe its position."],
)
print(report.summary()) # One-line-per-metric table
print(report.explain()) # Full transparency + bibliography
print(report.to_json()) # JSON export for pipelines
Research Citations
SSEM is grounded in peer-reviewed research. Every metric cites its origin:
| Metric | Paper | Venue |
|---|---|---|
| BERTScore | Zhang et al. "BERTScore: Evaluating Text Generation with BERT" | ICLR 2020 |
| Semantic Similarity | Beken Fikri et al. "Semantic Similarity Based Evaluation for Abstractive News Summarization" | GEM @ ACL 2021 |
| Faithfulness | Kryscinski et al. "Evaluating the Factual Consistency of Abstractive Text Summarization" | EMNLP 2020 |
| SelfCheck | Manakul et al. "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection" | EMNLP 2023 |
| Answer Relevancy | Es et al. "RAGAS: Automated Evaluation of Retrieval Augmented Generation" | EACL 2024 |
| Reasoning Coherence | Xia, Li, Liu, Wu & Liu. "ReasonEval: Evaluating Mathematical Reasoning Beyond Accuracy" | arXiv 2024 |
| AgentBench | Liu et al. "AgentBench: Evaluating LLMs as Agents" | ICLR 2024 |
| Pass@k | Chen et al. "Evaluating Large Language Models Trained on Code" | arXiv 2021 |
| MT-Bench | Zheng et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" | NeurIPS 2023 |
| Toxicity | Gehman et al. "RealToxicityPrompts" | EMNLP 2020 |
| LSI | Deerwester et al. "Indexing by Latent Semantic Analysis" | JASIS 1990 |
| NLI | Williams et al. "MultiNLI" | NAACL 2018 |
Access all citations programmatically:
evaluator.list_citations() # Returns dict of all citations
Architecture
SSEM/
├── __init__.py # Package exports
├── evaluator.py # SSEM unified class — main entry point
├── core.py # EmbeddingEngine, MetricResult, Citation, BaseMetric
├── semantic.py # SemanticSimilarity, BERTScore
├── faithfulness.py # Faithfulness, Hallucination
├── relevancy.py # AnswerRelevancy
├── agentic.py # ReasoningCoherence, ToolCallAccuracy, TaskCompletion
├── consistency.py # MultiTurnConsistency, SelfCheckConsistency
├── safety.py # Toxicity
├── code_eval.py # CodeCorrectness (Pass@k)
├── report.py # EvaluationReport
└── SSEM.py # Legacy v1 (backward compatible)
Parameters
SSEM Constructor
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str | "bert-base-multilingual-cased" |
Any HuggingFace model |
device |
str | Auto-detected | "cpu", "cuda", or "mps" |
Common Method Parameters
All methods accept the specific inputs for their metric and return a MetricResult.
See evaluator.list_metrics() for all available metrics.
Backward Compatibility
The original v1 API still works:
from SSEM import SemanticSimilarity
ssem = SemanticSimilarity(model_name='bert-base-multilingual-cased', metric='cosine')
score = ssem.evaluate(output_sentences, reference_sentences, level='sentence', output_format='mean')
Core Contributors
|
Nilesh Verma Website · GitHub |
Citation
If you use SSEM in your research, please cite:
@misc{SSEM,
author = {Nilesh Verma},
title = {SSEM: Standardized Scoring and Evaluation Metrics},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/TechyNilesh/SSEM}}
}
License
SSEM is released under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ssem-2.0.0.tar.gz.
File metadata
- Download URL: ssem-2.0.0.tar.gz
- Upload date:
- Size: 322.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1b612ded75a4d205558aff8afea41ad4f1a7b430660b545192a98b4f6c4c818
|
|
| MD5 |
2f19211292d3b5220b8f33d6e2f2b267
|
|
| BLAKE2b-256 |
3f84957df2125122aeac7b43dedcff1f3d851e5019d53c443e149fd716869ec4
|
File details
Details for the file ssem-2.0.0-py3-none-any.whl.
File metadata
- Download URL: ssem-2.0.0-py3-none-any.whl
- Upload date:
- Size: 42.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9ca11f2aebb54d106ce40c3198146032c20f81754c013bfadd256c9f24b184f4
|
|
| MD5 |
14c054fc672fa8939cbc769231126aee
|
|
| BLAKE2b-256 |
7e27a35705671896f4d6c124be404766f378168abde74e72b144ad5f2ae3c034
|