Open-source AI evaluation toolkit — hallucination detection, safety, industry-specific evals

These details have not been verified by PyPI

Project links

Project description

syncreus-eval

Open-source AI evaluation toolkit that shows you exactly which claims are hallucinated and why. Not just a score -- per-claim forensics with evidence.

Quick Start

pip install syncreus-eval

from syncreus_eval import check

result = check(
    "Lisinopril is an ACE inhibitor used for hypertension. "
    "It should not be combined with potassium-sparing diuretics. "
    "The standard starting dose is 20mg daily.",
    context="Lisinopril is an ACE inhibitor indicated for hypertension. "
    "Concurrent use with potassium-sparing diuretics increases hyperkalemia risk. "
    "The recommended starting dose is 10mg once daily."
)
print(result)

What You Get

Every AI output is decomposed into individual factual claims. Each claim gets a verdict and evidence quote from your reference context:

CheckResult(passed=False, score=0.67, claims=3)
  x UNSUPPORTED  'The standard starting dose is 20mg daily'
  ~ AMBIGUOUS    'It should not be combined with potassium-sparing diuretics'
  v SUPPORTED    'Lisinopril is an ACE inhibitor used for hypertension'

The CheckResult object is designed for programmatic use:

result.passed        # False -- at least one unsupported claim
result.score         # 0.67 -- fraction of claims that are supported
result.claims        # list[ClaimVerdict] -- per-claim details
result.unsupported   # filtered list of unsupported claims only
result.supported     # filtered list of supported claims only

if result:           # CheckResult is falsy when hallucinations are found
    deploy()

for claim in result.unsupported:
    print(claim.claim)     # "The standard starting dose is 20mg daily"
    print(claim.verdict)   # "UNSUPPORTED"
    print(claim.evidence)  # "not found" or a quote from context

How It Works

A 3-tier pipeline calibrated across 109 ML experiments:

Claim decomposition -- LLM extracts every atomic factual claim from the AI output
NLI pre-filter -- Natural language inference catches obvious mismatches cheaply
Chain-of-thought judge -- Gemini 2.5 Flash reasons through ambiguous cases with structured JSON output

Embedding-based methods (cosine similarity, BERTScore) fail on modern RLHF-aligned hallucinations because they are semantically indistinguishable from truth at the vector level. Claim decomposition + reasoning-based judgment is the approach that holds up.

pytest Integration

syncreus-eval registers a pytest plugin automatically. Use the syncreus fixture to gate CI on evaluation thresholds:

# test_my_chatbot.py

def test_no_hallucination(syncreus):
    """Fails the test if any unsupported claims are found."""
    syncreus.check(
        "Paris is the capital of France.",
        context="Paris is the capital and largest city of France.",
    )

def test_medical_accuracy(syncreus):
    """Run a domain-specific evaluator and assert it passes."""
    syncreus.assert_eval(
        "healthcare",
        ai_input="Patient records and clinical guidelines...",
        ai_output="The recommended dosage is 10mg daily...",
    )

def test_custom_threshold(syncreus):
    """Run without asserting -- apply your own logic."""
    result = syncreus.run("hallucination", ai_input="...", ai_output="...")
    assert result.score >= 0.9, f"Score too low: {result.score}"

Set a global score threshold:

pytest --syncreus-threshold=0.95

Upload results to the Syncreus platform:

SYNCREUS_API_KEY=syn_... pytest --syncreus-upload

The terminal summary shows evaluation results alongside your test output:

======== Syncreus Evaluation Summary ========
  3 passed, 1 failed, 0 errors (4 total evals)
  FAILED evals:
    test_chatbot.py::test_rag_accuracy [hallucination] score=0.67

All Evaluators

General Purpose

Evaluator	What it checks	Requires
`HALLUCINATION`	Unsupported factual claims against reference context	Gemini API key
`ACCURACY`	Golden dataset comparison via semantic similarity	`[accuracy]` extra
`CONSISTENCY`	Pairwise similarity across repeated prompts	`[accuracy]` extra
`PERFORMANCE`	Latency, token counts, cost metrics from trace data	Nothing
`AGENT_TASK`	Whether an agent's completion claim matches reality	Gemini API key
`REGRESSION`	Baseline comparison against previous runs	Syncreus platform

Safety and Compliance

Evaluator	What it checks	Requires
`SAFETY`	PII/sensitive data detection + content safety	`[safety]` extra
`BIAS`	Demographic parity / EEOC four-fifths rule	Nothing
`IDEOLOGY`	Political neutrality (OMB M-26-04)	Gemini API key
`PROMPT_INJECTION`	Injection attempt detection	`[prompt-injection]` extra

Industry-Specific

Evaluator	What it checks	Requires
`HEALTHCARE`	Medical accuracy, drug safety, PHI detection	Gemini API key
`LEGAL`	Citation validity, holding fidelity, fabricated case law	Gemini API key
`FINANCE`	Regulatory accuracy, numerical precision, fabricated data	Gemini API key
`CODE_ACCURACY`	API existence, function signatures, package validity	Gemini API key

Each industry evaluator returns domain-specific claim types. For example, the healthcare evaluator categorizes claims as drug_interaction, dosage, contraindication, diagnosis, treatment, terminology, or phi_leak -- each with a severity level (critical, major, minor).

Full Evaluator API

For evaluators beyond hallucination, use the evaluate() function:

from syncreus_eval import evaluate, EvalType

# Healthcare: drug safety and PHI detection
result = evaluate(
    EvalType.HEALTHCARE,
    ai_input="Clinical documentation and drug references...",
    ai_output="The AI assistant's medical response...",
)
print(result.passed)                     # True/False/None
print(result.details["critical_count"])  # number of critical findings
print(result.details["phi_detected"])    # whether PHI was leaked

# Legal: citation verification
result = evaluate(
    EvalType.LEGAL,
    ai_input="Case law and statute text...",
    ai_output="The court held in Smith v. Jones (2024)...",
)

# Run multiple evaluators at once
results = evaluate(
    [EvalType.HALLUCINATION, EvalType.SAFETY, EvalType.IDEOLOGY],
    ai_input="Context here",
    ai_output="Response here",
)
for r in results:
    print(f"{r.eval_type.value}: passed={r.passed}")

The EvalResult returned by evaluate():

class EvalResult:
    eval_type: EvalType
    passed: bool | None      # True/False/None (None = error or skipped)
    score: float | None       # Numeric score where applicable
    details: dict[str, Any]   # Evaluator-specific details
    error: bool               # Whether an error occurred
    error_message: str | None # Error description

Installation

# Core (hallucination detection, industry evaluators via Gemini)
pip install syncreus-eval

# With optional extras
pip install syncreus-eval[accuracy]          # fastembed for semantic similarity
pip install syncreus-eval[safety]            # Presidio PII scanning
pip install syncreus-eval[prompt-injection]  # LLM Guard injection detection
pip install syncreus-eval[upload]            # Upload results to Syncreus platform
pip install syncreus-eval[all]              # Everything

Requires Python 3.10+.

Configuration

The LLM-as-judge evaluators (hallucination, healthcare, legal, finance, code, ideology, agent task) require a Google Gemini API key. The free tier works.

Set it as an environment variable:

export GEMINI_API_KEY=your-key-here

Or pass it directly:

result = check(output, context=doc, gemini_key="your-key-here")

Upload Results (Optional)

Send evaluation results to the Syncreus platform for dashboards, trend tracking, and regression detection:

from syncreus_eval import upload_results

upload_results(
    results=result,           # EvalResult or list
    api_key="syn_...",        # Syncreus API key
    endpoint="https://api.syncreus.com",
    trace_id="trace-123",     # optional
)

Requires: pip install syncreus-eval[upload]

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Mar 30, 2026

0.1.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syncreus_eval-0.2.0.tar.gz (43.6 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

syncreus_eval-0.2.0-py3-none-any.whl (40.2 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file syncreus_eval-0.2.0.tar.gz.

File metadata

Download URL: syncreus_eval-0.2.0.tar.gz
Upload date: Mar 30, 2026
Size: 43.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for syncreus_eval-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`9b8e90767c801467d997c1a55646c563abea2e7244738afa43b8fde3e87841b9`
MD5	`2ad860228dda735e199410a71e6cc91e`
BLAKE2b-256	`6a7e9736674df5bfa7daf088428b4215925cb1150efcbf21064d27c4ab27f6ba`

See more details on using hashes here.

File details

Details for the file syncreus_eval-0.2.0-py3-none-any.whl.

File metadata

Download URL: syncreus_eval-0.2.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 40.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for syncreus_eval-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0e2cd712411c946cecfd0f66415e03984adaa4b1657318d4482c821f535407e1`
MD5	`5891bb080f87b4eaa97ca2ca77ae9554`
BLAKE2b-256	`41f06222503f276c5dd7829b11993b03595ecd7b30b7ee821fbf6b01129bdc83`

See more details on using hashes here.

syncreus-eval 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

syncreus-eval

Quick Start

What You Get

How It Works

pytest Integration

All Evaluators

General Purpose

Safety and Compliance

Industry-Specific

Full Evaluator API

Installation

Configuration

Upload Results (Optional)

Links

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes