Open-source AI evaluation toolkit — hallucination detection, safety, industry-specific evals
Project description
syncreus-eval
Open-source AI evaluation toolkit that shows you exactly which claims are hallucinated and why. Not just a score -- per-claim forensics with evidence.
Quick Start
pip install syncreus-eval
from syncreus_eval import check
result = check(
"Lisinopril is an ACE inhibitor used for hypertension. "
"It should not be combined with potassium-sparing diuretics. "
"The standard starting dose is 20mg daily.",
context="Lisinopril is an ACE inhibitor indicated for hypertension. "
"Concurrent use with potassium-sparing diuretics increases hyperkalemia risk. "
"The recommended starting dose is 10mg once daily."
)
print(result)
What You Get
Every AI output is decomposed into individual factual claims. Each claim gets a verdict and evidence quote from your reference context:
CheckResult(passed=False, score=0.67, claims=3)
x UNSUPPORTED 'The standard starting dose is 20mg daily'
~ AMBIGUOUS 'It should not be combined with potassium-sparing diuretics'
v SUPPORTED 'Lisinopril is an ACE inhibitor used for hypertension'
The CheckResult object is designed for programmatic use:
result.passed # False -- at least one unsupported claim
result.score # 0.67 -- fraction of claims that are supported
result.claims # list[ClaimVerdict] -- per-claim details
result.unsupported # filtered list of unsupported claims only
result.supported # filtered list of supported claims only
if result: # CheckResult is falsy when hallucinations are found
deploy()
for claim in result.unsupported:
print(claim.claim) # "The standard starting dose is 20mg daily"
print(claim.verdict) # "UNSUPPORTED"
print(claim.evidence) # "not found" or a quote from context
How It Works
A 3-tier pipeline calibrated across 109 ML experiments:
- Claim decomposition -- LLM extracts every atomic factual claim from the AI output
- NLI pre-filter -- Natural language inference catches obvious mismatches cheaply
- Chain-of-thought judge -- Gemini 2.5 Flash reasons through ambiguous cases with structured JSON output
Embedding-based methods (cosine similarity, BERTScore) fail on modern RLHF-aligned hallucinations because they are semantically indistinguishable from truth at the vector level. Claim decomposition + reasoning-based judgment is the approach that holds up.
pytest Integration
syncreus-eval registers a pytest plugin automatically. Use the syncreus fixture to gate CI on evaluation thresholds:
# test_my_chatbot.py
def test_no_hallucination(syncreus):
"""Fails the test if any unsupported claims are found."""
syncreus.check(
"Paris is the capital of France.",
context="Paris is the capital and largest city of France.",
)
def test_medical_accuracy(syncreus):
"""Run a domain-specific evaluator and assert it passes."""
syncreus.assert_eval(
"healthcare",
ai_input="Patient records and clinical guidelines...",
ai_output="The recommended dosage is 10mg daily...",
)
def test_custom_threshold(syncreus):
"""Run without asserting -- apply your own logic."""
result = syncreus.run("hallucination", ai_input="...", ai_output="...")
assert result.score >= 0.9, f"Score too low: {result.score}"
Set a global score threshold:
pytest --syncreus-threshold=0.95
Upload results to the Syncreus platform:
SYNCREUS_API_KEY=syn_... pytest --syncreus-upload
The terminal summary shows evaluation results alongside your test output:
======== Syncreus Evaluation Summary ========
3 passed, 1 failed, 0 errors (4 total evals)
FAILED evals:
test_chatbot.py::test_rag_accuracy [hallucination] score=0.67
All Evaluators
General Purpose
| Evaluator | What it checks | Requires |
|---|---|---|
HALLUCINATION |
Unsupported factual claims against reference context | Gemini API key |
ACCURACY |
Golden dataset comparison via semantic similarity | [accuracy] extra |
CONSISTENCY |
Pairwise similarity across repeated prompts | [accuracy] extra |
PERFORMANCE |
Latency, token counts, cost metrics from trace data | Nothing |
AGENT_TASK |
Whether an agent's completion claim matches reality | Gemini API key |
REGRESSION |
Baseline comparison against previous runs | Syncreus platform |
Safety and Compliance
| Evaluator | What it checks | Requires |
|---|---|---|
SAFETY |
PII/sensitive data detection + content safety | [safety] extra |
BIAS |
Demographic parity / EEOC four-fifths rule | Nothing |
IDEOLOGY |
Political neutrality (OMB M-26-04) | Gemini API key |
PROMPT_INJECTION |
Injection attempt detection | [prompt-injection] extra |
Industry-Specific
| Evaluator | What it checks | Requires |
|---|---|---|
HEALTHCARE |
Medical accuracy, drug safety, PHI detection | Gemini API key |
LEGAL |
Citation validity, holding fidelity, fabricated case law | Gemini API key |
FINANCE |
Regulatory accuracy, numerical precision, fabricated data | Gemini API key |
CODE_ACCURACY |
API existence, function signatures, package validity | Gemini API key |
Each industry evaluator returns domain-specific claim types. For example, the healthcare evaluator categorizes claims as drug_interaction, dosage, contraindication, diagnosis, treatment, terminology, or phi_leak -- each with a severity level (critical, major, minor).
Full Evaluator API
For evaluators beyond hallucination, use the evaluate() function:
from syncreus_eval import evaluate, EvalType
# Healthcare: drug safety and PHI detection
result = evaluate(
EvalType.HEALTHCARE,
ai_input="Clinical documentation and drug references...",
ai_output="The AI assistant's medical response...",
)
print(result.passed) # True/False/None
print(result.details["critical_count"]) # number of critical findings
print(result.details["phi_detected"]) # whether PHI was leaked
# Legal: citation verification
result = evaluate(
EvalType.LEGAL,
ai_input="Case law and statute text...",
ai_output="The court held in Smith v. Jones (2024)...",
)
# Run multiple evaluators at once
results = evaluate(
[EvalType.HALLUCINATION, EvalType.SAFETY, EvalType.IDEOLOGY],
ai_input="Context here",
ai_output="Response here",
)
for r in results:
print(f"{r.eval_type.value}: passed={r.passed}")
The EvalResult returned by evaluate():
class EvalResult:
eval_type: EvalType
passed: bool | None # True/False/None (None = error or skipped)
score: float | None # Numeric score where applicable
details: dict[str, Any] # Evaluator-specific details
error: bool # Whether an error occurred
error_message: str | None # Error description
Installation
# Core (hallucination detection, industry evaluators via Gemini)
pip install syncreus-eval
# With optional extras
pip install syncreus-eval[accuracy] # fastembed for semantic similarity
pip install syncreus-eval[safety] # Presidio PII scanning
pip install syncreus-eval[prompt-injection] # LLM Guard injection detection
pip install syncreus-eval[upload] # Upload results to Syncreus platform
pip install syncreus-eval[all] # Everything
Requires Python 3.10+.
Configuration
The LLM-as-judge evaluators (hallucination, healthcare, legal, finance, code, ideology, agent task) require a Google Gemini API key. The free tier works.
Set it as an environment variable:
export GEMINI_API_KEY=your-key-here
Or pass it directly:
result = check(output, context=doc, gemini_key="your-key-here")
Upload Results (Optional)
Send evaluation results to the Syncreus platform for dashboards, trend tracking, and regression detection:
from syncreus_eval import upload_results
upload_results(
results=result, # EvalResult or list
api_key="syn_...", # Syncreus API key
endpoint="https://api.syncreus.com",
trace_id="trace-123", # optional
)
Requires: pip install syncreus-eval[upload]
Links
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file syncreus_eval-0.2.0.tar.gz.
File metadata
- Download URL: syncreus_eval-0.2.0.tar.gz
- Upload date:
- Size: 43.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b8e90767c801467d997c1a55646c563abea2e7244738afa43b8fde3e87841b9
|
|
| MD5 |
2ad860228dda735e199410a71e6cc91e
|
|
| BLAKE2b-256 |
6a7e9736674df5bfa7daf088428b4215925cb1150efcbf21064d27c4ab27f6ba
|
File details
Details for the file syncreus_eval-0.2.0-py3-none-any.whl.
File metadata
- Download URL: syncreus_eval-0.2.0-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e2cd712411c946cecfd0f66415e03984adaa4b1657318d4482c821f535407e1
|
|
| MD5 |
5891bb080f87b4eaa97ca2ca77ae9554
|
|
| BLAKE2b-256 |
41f06222503f276c5dd7829b11993b03595ecd7b30b7ee821fbf6b01129bdc83
|