Skip to main content

Open-source AI evaluation toolkit — hallucination detection, safety, industry-specific evals

Project description

syncreus-eval

Open-source AI evaluation toolkit that shows you exactly which claims are hallucinated and why. Not just a score -- per-claim forensics with evidence.

Quick Start

pip install syncreus-eval
from syncreus_eval import check

result = check(
    "Lisinopril is an ACE inhibitor used for hypertension. "
    "It should not be combined with potassium-sparing diuretics. "
    "The standard starting dose is 20mg daily.",
    context="Lisinopril is an ACE inhibitor indicated for hypertension. "
    "Concurrent use with potassium-sparing diuretics increases hyperkalemia risk. "
    "The recommended starting dose is 10mg once daily."
)
print(result)

What You Get

Every AI output is decomposed into individual factual claims. Each claim gets a verdict and evidence quote from your reference context:

CheckResult(passed=False, score=0.67, claims=3)
  x UNSUPPORTED  'The standard starting dose is 20mg daily'
  ~ AMBIGUOUS    'It should not be combined with potassium-sparing diuretics'
  v SUPPORTED    'Lisinopril is an ACE inhibitor used for hypertension'

The CheckResult object is designed for programmatic use:

result.passed        # False -- at least one unsupported claim
result.score         # 0.67 -- fraction of claims that are supported
result.claims        # list[ClaimVerdict] -- per-claim details
result.unsupported   # filtered list of unsupported claims only
result.supported     # filtered list of supported claims only

if result:           # CheckResult is falsy when hallucinations are found
    deploy()

for claim in result.unsupported:
    print(claim.claim)     # "The standard starting dose is 20mg daily"
    print(claim.verdict)   # "UNSUPPORTED"
    print(claim.evidence)  # "not found" or a quote from context

How It Works

A 3-tier pipeline calibrated across 109 ML experiments:

  1. Claim decomposition -- LLM extracts every atomic factual claim from the AI output
  2. NLI pre-filter -- Natural language inference catches obvious mismatches cheaply
  3. Chain-of-thought judge -- Gemini 2.5 Flash reasons through ambiguous cases with structured JSON output

Embedding-based methods (cosine similarity, BERTScore) fail on modern RLHF-aligned hallucinations because they are semantically indistinguishable from truth at the vector level. Claim decomposition + reasoning-based judgment is the approach that holds up.

pytest Integration

syncreus-eval registers a pytest plugin automatically. Use the syncreus fixture to gate CI on evaluation thresholds:

# test_my_chatbot.py

def test_no_hallucination(syncreus):
    """Fails the test if any unsupported claims are found."""
    syncreus.check(
        "Paris is the capital of France.",
        context="Paris is the capital and largest city of France.",
    )

def test_medical_accuracy(syncreus):
    """Run a domain-specific evaluator and assert it passes."""
    syncreus.assert_eval(
        "healthcare",
        ai_input="Patient records and clinical guidelines...",
        ai_output="The recommended dosage is 10mg daily...",
    )

def test_custom_threshold(syncreus):
    """Run without asserting -- apply your own logic."""
    result = syncreus.run("hallucination", ai_input="...", ai_output="...")
    assert result.score >= 0.9, f"Score too low: {result.score}"

Set a global score threshold:

pytest --syncreus-threshold=0.95

Upload results to the Syncreus platform:

SYNCREUS_API_KEY=syn_... pytest --syncreus-upload

The terminal summary shows evaluation results alongside your test output:

======== Syncreus Evaluation Summary ========
  3 passed, 1 failed, 0 errors (4 total evals)
  FAILED evals:
    test_chatbot.py::test_rag_accuracy [hallucination] score=0.67

All Evaluators

General Purpose

Evaluator What it checks Requires
HALLUCINATION Unsupported factual claims against reference context Gemini API key
ACCURACY Golden dataset comparison via semantic similarity [accuracy] extra
CONSISTENCY Pairwise similarity across repeated prompts [accuracy] extra
PERFORMANCE Latency, token counts, cost metrics from trace data Nothing
AGENT_TASK Whether an agent's completion claim matches reality Gemini API key
REGRESSION Baseline comparison against previous runs Syncreus platform

Safety and Compliance

Evaluator What it checks Requires
SAFETY PII/sensitive data detection + content safety [safety] extra
BIAS Demographic parity / EEOC four-fifths rule Nothing
IDEOLOGY Political neutrality (OMB M-26-04) Gemini API key
PROMPT_INJECTION Injection attempt detection [prompt-injection] extra

Industry-Specific

Evaluator What it checks Requires
HEALTHCARE Medical accuracy, drug safety, PHI detection Gemini API key
LEGAL Citation validity, holding fidelity, fabricated case law Gemini API key
FINANCE Regulatory accuracy, numerical precision, fabricated data Gemini API key
CODE_ACCURACY API existence, function signatures, package validity Gemini API key

Each industry evaluator returns domain-specific claim types. For example, the healthcare evaluator categorizes claims as drug_interaction, dosage, contraindication, diagnosis, treatment, terminology, or phi_leak -- each with a severity level (critical, major, minor).

Full Evaluator API

For evaluators beyond hallucination, use the evaluate() function:

from syncreus_eval import evaluate, EvalType

# Healthcare: drug safety and PHI detection
result = evaluate(
    EvalType.HEALTHCARE,
    ai_input="Clinical documentation and drug references...",
    ai_output="The AI assistant's medical response...",
)
print(result.passed)                     # True/False/None
print(result.details["critical_count"])  # number of critical findings
print(result.details["phi_detected"])    # whether PHI was leaked

# Legal: citation verification
result = evaluate(
    EvalType.LEGAL,
    ai_input="Case law and statute text...",
    ai_output="The court held in Smith v. Jones (2024)...",
)

# Run multiple evaluators at once
results = evaluate(
    [EvalType.HALLUCINATION, EvalType.SAFETY, EvalType.IDEOLOGY],
    ai_input="Context here",
    ai_output="Response here",
)
for r in results:
    print(f"{r.eval_type.value}: passed={r.passed}")

The EvalResult returned by evaluate():

class EvalResult:
    eval_type: EvalType
    passed: bool | None      # True/False/None (None = error or skipped)
    score: float | None       # Numeric score where applicable
    details: dict[str, Any]   # Evaluator-specific details
    error: bool               # Whether an error occurred
    error_message: str | None # Error description

Installation

# Core (hallucination detection, industry evaluators via Gemini)
pip install syncreus-eval

# With optional extras
pip install syncreus-eval[accuracy]          # fastembed for semantic similarity
pip install syncreus-eval[safety]            # Presidio PII scanning
pip install syncreus-eval[prompt-injection]  # LLM Guard injection detection
pip install syncreus-eval[upload]            # Upload results to Syncreus platform
pip install syncreus-eval[all]              # Everything

Requires Python 3.10+.

Configuration

The LLM-as-judge evaluators (hallucination, healthcare, legal, finance, code, ideology, agent task) require a Google Gemini API key. The free tier works.

Set it as an environment variable:

export GEMINI_API_KEY=your-key-here

Or pass it directly:

result = check(output, context=doc, gemini_key="your-key-here")

Upload Results (Optional)

Send evaluation results to the Syncreus platform for dashboards, trend tracking, and regression detection:

from syncreus_eval import upload_results

upload_results(
    results=result,           # EvalResult or list
    api_key="syn_...",        # Syncreus API key
    endpoint="https://api.syncreus.com",
    trace_id="trace-123",     # optional
)

Requires: pip install syncreus-eval[upload]

Links

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

syncreus_eval-0.2.0.tar.gz (43.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

syncreus_eval-0.2.0-py3-none-any.whl (40.2 kB view details)

Uploaded Python 3

File details

Details for the file syncreus_eval-0.2.0.tar.gz.

File metadata

  • Download URL: syncreus_eval-0.2.0.tar.gz
  • Upload date:
  • Size: 43.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for syncreus_eval-0.2.0.tar.gz
Algorithm Hash digest
SHA256 9b8e90767c801467d997c1a55646c563abea2e7244738afa43b8fde3e87841b9
MD5 2ad860228dda735e199410a71e6cc91e
BLAKE2b-256 6a7e9736674df5bfa7daf088428b4215925cb1150efcbf21064d27c4ab27f6ba

See more details on using hashes here.

File details

Details for the file syncreus_eval-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: syncreus_eval-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 40.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for syncreus_eval-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0e2cd712411c946cecfd0f66415e03984adaa4b1657318d4482c821f535407e1
MD5 5891bb080f87b4eaa97ca2ca77ae9554
BLAKE2b-256 41f06222503f276c5dd7829b11993b03595ecd7b30b7ee821fbf6b01129bdc83

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page