Skip to main content

Production-ready token optimization and hallucination detection for LLM applications. Works with Groq, Gemini, Ollama, and HuggingFace. Includes a full local inference runtime with context window management, session persistence, KV caching, and mathematical hallucination risk scoring.

Project description

Hallutok

Token optimization and hallucination detection for LLM applications.

PyPI version Python 3.10+ License: MIT

Hallutok is a Python library that wraps LLM calls with two things most production apps need but rarely have built-in: prompt compression to reduce token spend, and response scoring to catch hallucinations before they reach your users. It works with Groq, Gemini, Ollama, and HuggingFace.


Table of Contents


Installation

# Groq support
pip install hallutok[groq]

# Gemini support
pip install hallutok[gemini]

# Both API providers
pip install hallutok[all]

For local model support via Ollama or HuggingFace, install the additional dependencies:

pip install ollama                          # for Ollama
pip install transformers torch             # for HuggingFace

Quick Start

from hallutok import HallutokClient

client = HallutokClient.with_groq(
    api_key="gsk_your_key",
    model="llama3-8b-8192",
    temperature=0.3,
)

result = client.chat("Explain what black holes are.")

print(result.response)
print(result.token_report)
# {'tokens_before': 12, 'tokens_after': 9, 'tokens_saved': 3, 'percent_saved': 25.0}

if result.validation.is_likely_hallucination:
    print("Flags:", result.validation.flags)

API Providers — Groq and Gemini

Groq

from hallutok import HallutokClient

client = HallutokClient.with_groq(
    api_key="gsk_your_groq_key",
    model="llama3-8b-8192",
    temperature=0.3,
    max_response_tokens=1024,
    system_prompt="You are a factual assistant. Cite sources when possible.",
)

result = client.chat(
    "Please note that I would like you to explain in order to help me "
    "understand what black holes are and how they work in detail."
)

print(result.response)
print(result.token_report)
# {'tokens_before': 34, 'tokens_after': 13, 'tokens_saved': 21, 'percent_saved': 61.8}

if result.validation.is_likely_hallucination:
    print("Risk:", result.validation.risk_level)
    print("Flags:", result.validation.flags)
    print("Suggestions:", result.validation.suggestions)

Gemini

from hallutok import HallutokClient

client = HallutokClient.with_gemini(
    api_key="AIza_your_gemini_key",
    model="gemini-1.5-flash",
    temperature=0.4,
)

result = client.chat("Explain quantum entanglement to a 10-year-old.")
print(result.response)
print(result.token_report)

Custom provider setup

from hallutok import HallutokClient
from hallutok.providers import GroqProvider, GeminiProvider

provider = GroqProvider(api_key="gsk_...", model="mixtral-8x7b-32768")
# provider = GeminiProvider(api_key="AIza_...", model="gemini-1.5-pro")

client = HallutokClient(
    provider=provider,
    optimize_tokens=True,
    validate_responses=True,
    max_prompt_tokens=512,
    temperature=0.4,
    max_response_tokens=1024,
    system_prompt="You are a factual assistant.",
    cache_enabled=True,
)

result = client.chat("What causes inflation?")

Pre-flight token estimation

Check how many tokens a prompt will use before sending it:

estimate = client.estimate_cost_tokens(
    "Please note that I would like you to in order to help me explain "
    "how machine learning works and what it does."
)
print(estimate)
# {'tokens_before': 28, 'tokens_after': 11, 'tokens_saved': 17, 'percent_saved': 60.7}

Runtime Engine — Local Models

The HallutokEngine brings the full Hallutok pipeline to local models. Load any model from Ollama or HuggingFace and get token optimization, hallucination scoring, context window management, session persistence, and latency optimization out of the box — no API key required.

Loading a model

from hallutok.runtime import HallutokEngine

# From Ollama (requires Ollama running at localhost:11434)
engine = HallutokEngine.from_ollama("llama3")

# From HuggingFace Hub
engine = HallutokEngine.from_huggingface(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device="auto",    # auto-detects cuda / mps / cpu
    quantize=True,    # 4-bit quantization to reduce memory
)

# From a local model directory
engine = HallutokEngine.from_local("/path/to/model")

Engine configuration options

engine = HallutokEngine.from_ollama(
    model="llama3",
    max_tokens=4096,              # total context window token budget
    trim_strategy="sliding",      # how to handle context overflow
    kv_cache=True,                # cache identical prompts
    warm_up=True,                 # pre-warm model to cut first-call latency
    stream=False,
    system_prompt="You are a concise, factual assistant.",
)

Complete Runtime Example

This single script demonstrates every runtime feature — context management, session tracking, latency optimization, hallucination detection, export, and engine stats. Copy and run it against any Ollama model.

from hallutok.runtime import HallutokEngine

# ── 1. Load the engine ────────────────────────────────────────────────────────
engine = HallutokEngine.from_ollama(
    model="llama3",
    max_tokens=4096,
    trim_strategy="sliding",
    kv_cache=True,
    warm_up=True,
    system_prompt="You are a factual assistant. Keep answers concise.",
)

# ── 2. Create a session ───────────────────────────────────────────────────────
session = engine.create_session(
    name="demo-session",
    system_prompt="You are a factual assistant.",
    max_tokens=4096,
    trim_strategy="sliding",
)

# ── 3. Multi-turn conversation ────────────────────────────────────────────────
questions = [
    "What are black holes?",
    "Please note that I would like you to explain how Hawking radiation works.",
    "How does the event horizon relate to the singularity?",
    "What would happen to a person falling into a black hole?",
]

for question in questions:
    result = session.chat(question, temperature=0.4, max_tokens=512)

    print(f"\nQ: {question}")
    print(f"A: {result.response[:200]}...")
    print(f"   Tokens saved   : {result.tokens_saved} ({result.tokens_saved_pct}%)")
    print(f"   HRS score      : {result.hallucination_score:.3f}")
    print(f"   Risk level     : {result.hallucination_risk}")
    print(f"   Latency        : {result.latency_ms:.0f}ms")
    print(f"   Cache hit      : {result.cache_hit}")
    print(f"   Context used   : {result.context_tokens_used} / {result.context_tokens_used + result.context_tokens_available} tokens")

    if result.is_hallucination:
        print(f"   Flags          : {result.hallucination_flags}")
        print(f"   Suggestions    : {result.suggestions}")

    # Math score breakdown
    print(f"   HRS breakdown  : {result.math_scores}")

# ── 4. Flag an important turn (never trimmed from context) ────────────────────
result = session.chat(
    "Summarize everything we discussed.",
    flag_turn=True,
    temperature=0.3,
)
print(f"\nSummary: {result.response[:300]}")

# ── 5. Session analytics ──────────────────────────────────────────────────────
stats = session.get_stats()
print(f"\n--- Session Stats ---")
print(f"Total turns           : {stats['total_turns']}")
print(f"Total tokens saved    : {stats['total_tokens_saved']}")
print(f"Avg tokens saved      : {stats['avg_tokens_saved_pct']}%")
print(f"Hallucinations caught : {stats['total_hallucinations_caught']}")
print(f"Avg HRS score         : {stats['avg_hallucination_score']}")
print(f"Avg latency           : {stats['avg_latency_ms']}ms")
print(f"Session duration      : {stats['session_duration_s']}s")
print(f"Context trims         : {stats['context_trims']}")

# ── 6. Engine-wide stats ──────────────────────────────────────────────────────
engine_stats = engine.get_stats()
print(f"\n--- Engine Stats ---")
print(f"Model          : {engine_stats['model']}")
print(f"Source         : {engine_stats['source']}")
print(f"Device         : {engine_stats['device']}")
print(f"Total sessions : {engine_stats['total_sessions']}")
print(f"Uptime         : {engine_stats['uptime_s']}s")
print(f"Latency stats  : {engine_stats['latency']}")

# ── 7. Export session ─────────────────────────────────────────────────────────
session.save("my_session.json")
session.export_markdown("chat_log.md")
session.export_csv("analytics.csv")

# ── 8. Load a saved session ───────────────────────────────────────────────────
restored = engine.load_session("my_session.json")
print(f"\nRestored session: {restored.name}")
print(f"Last response: {restored.last_response()[:100]}")

# ── 9. Clear caches ───────────────────────────────────────────────────────────
engine.clear_cache()

Components Reference

HallutokClient

The main entry point for Groq and Gemini API usage.

from hallutok import HallutokClient

client = HallutokClient(
    provider=provider,
    optimize_tokens=True,       # compress prompts before sending
    validate_responses=True,    # score responses for hallucination
    max_prompt_tokens=512,      # hard cap on prompt size (None = no cap)
    temperature=0.5,
    max_response_tokens=1024,
    system_prompt=None,
    cache_enabled=True,
)
Method Description
chat(prompt, ...) Send a prompt through the full pipeline
estimate_cost_tokens(prompt) Preview token savings before sending
clear_cache() Flush the optimizer prompt cache
HallutokClient.with_groq(api_key, model, **kwargs) Factory for Groq
HallutokClient.with_gemini(api_key, model, **kwargs) Factory for Gemini

TokenOptimizer

Compresses prompts before they are sent to any model.

from hallutok.optimizer import TokenOptimizer

opt = TokenOptimizer(cache_enabled=True)

raw = """
Please note that I would like you to, in order to be helpful,
can you please explain, it is important to note that, machine learning
is a subset of AI. Machine learning is a subset of AI.
"""

compressed = opt.optimize(raw, max_tokens=100)
report = opt.savings_report(raw, compressed)
print(report)
# {'tokens_before': 54, 'tokens_after': 12, 'tokens_saved': 42, 'percent_saved': 77.8}

The optimizer applies these steps in order:

Step What it does
Whitespace normalization Collapses spaces, trims blank lines
Boilerplate stripping Removes "Please note that", "I would like you to", "It is important to note", etc.
Deduplication Removes repeated sentences
Phrase compression "in order to" -> "to", "due to the fact that" -> "because"
Truncation Cuts to max_tokens at a sentence boundary

HallucinationValidator

Scores any text for hallucination risk using the Hallucination Risk Score (HRS), a composite of four mathematical sub-scores.

from hallutok.antihallucination import HallucinationValidator

validator = HallucinationValidator()

response = "I think maybe studies show that eating chocolate probably cures cancer."
result = validator.validate(response)

print(result.confidence_score)         # 0.0–1.0, higher = more confident
print(result.risk_level)               # "LOW" | "MEDIUM" | "HIGH"
print(result.is_likely_hallucination)  # True / False
print(result.flags)                    # list of detected issues
print(result.warnings)                 # human-readable descriptions
print(result.suggestions)             # recommended actions
print(result.cleaned_response)         # response with disclaimer appended if flagged
print(result.math_scores)             # SCS, ECS, CDS, FGS, HRS breakdown

HRS scoring breakdown:

Score Name What it measures
SCS Semantic Confidence Score Hedging language ("I think", "maybe", "probably")
ECS Evidence Consistency Score Ungrounded claims ("Studies show", "Research suggests")
CDS Contradiction Detection Score Internal contradictions ("always" + "never" in same text)
FGS Factual Grounding Score Numeric anomalies, implausible figures
HRS Hallucination Risk Score Composite of all four

Detection layers:

Layer Examples caught
Hedging "I think", "maybe", "perhaps", "I'm not sure", "I believe"
Ungrounded claims "Studies show", "Research suggests", "Experts say"
Numeric anomalies Percentages over 100%, implausible statistics
Contradictions Contradictory absolute terms in the same response

HallutokEngine

The runtime engine for local model inference with the full Hallutok pipeline.

from hallutok.runtime import HallutokEngine

# Factory methods
engine = HallutokEngine.from_ollama(model, host, **kwargs)
engine = HallutokEngine.from_huggingface(model_id, device, quantize, token, **kwargs)
engine = HallutokEngine.from_local(path, device, **kwargs)

Constructor parameters:

Parameter Type Default Description
max_tokens int 4096 Context window token budget
trim_strategy str "sliding" Context overflow strategy
kv_cache bool True Cache identical prompt responses
warm_up bool True Pre-warm model on load
stream bool False Enable streaming responses
system_prompt str None Default system instruction

Methods:

Method Description
create_session(name, system_prompt, max_tokens, trim_strategy) Create a new chat session
load_session(path, max_tokens, trim_strategy) Restore session from JSON
get_stats() Engine-wide performance stats
clear_cache() Flush KV and optimizer caches

ContextWindowManager

Manages the token budget for a conversation and automatically trims messages when the budget is exceeded.

from hallutok.runtime.context_manager import ContextWindowManager

ctx = ContextWindowManager(
    max_tokens=4096,
    trim_strategy="sliding",
    reserve_tokens=512,
)

ctx.add_message("system", "You are a helpful assistant.", flagged=True)
ctx.add_message("user", "What are black holes?")
ctx.add_message("assistant", "Black holes are regions of extremely strong gravity.")

print(ctx.stats())
# {
#   'messages': 3,
#   'total_tokens': 28,
#   'available_tokens': 3556,
#   'budget': 4096,
#   'usage_percent': 0.7,
#   'trim_count': 0,
#   'strategy': 'sliding'
# }

Trim strategies:

Strategy Behavior
sliding Keep system messages and the last N conversation turns
drop_oldest Remove oldest non-system, non-flagged messages first
summarize Compress older messages into an extractive summary note
priority Keep system messages, flagged turns, and the last 6 messages

Messages added with flagged=True are never removed by any trim strategy.


SessionManager

Tracks conversation history, computes per-session analytics, and handles persistence and export.

from hallutok.runtime.session_manager import SessionManager
from hallutok.runtime.context_manager import ContextWindowManager

ctx = ContextWindowManager(max_tokens=4096)
session = SessionManager(name="my-session", context_manager=ctx)

Methods:

Method Description
record_turn(prompt, optimized_prompt, response, token_report, validation_result, latency_ms) Record a completed turn
get_stats() Return aggregated session analytics
save(path) Save session to JSON
SessionManager.load(path, context_manager) Load session from JSON
export_markdown(path) Export readable chat log as Markdown
export_csv(path) Export per-turn analytics as CSV
last_response() Return the most recent assistant response
clear() Clear history and context

SessionStats fields:

stats = session.get_stats()

stats.session_name
stats.total_turns
stats.total_tokens_before
stats.total_tokens_after
stats.total_tokens_saved
stats.avg_tokens_saved_pct
stats.total_hallucinations_caught
stats.avg_hallucination_score
stats.avg_latency_ms
stats.session_duration_s
stats.context_trims

LatencyOptimizer

Manages KV caching, warm-up, and latency tracking for the runtime engine.

from hallutok.runtime.latency_optimizer import LatencyOptimizer

lat = LatencyOptimizer(
    kv_cache_enabled=True,
    kv_cache_size=64,
    stream=False,
    warm_up=True,
)

# Cache operations
lat.store_cache("What is AI?", "AI is artificial intelligence.")
cached = lat.get_cached("What is AI?")  # returns response or None

# Latency stats
print(lat.latency_stats())
# {
#   'calls': 12,
#   'avg_ms': 134.2,
#   'min_ms': 98.1,
#   'max_ms': 312.4,
#   'p95_ms': 280.0,
#   'cache_hits': 3,
#   'stream_mode': False
# }

Result Objects

ChatResult (API providers)

Returned by HallutokClient.chat().

Field Type Description
response str Final model response (with disclaimer if flagged)
original_prompt str The prompt as you wrote it
optimized_prompt str The prompt after token optimization
token_report dict tokens_before, tokens_after, tokens_saved, percent_saved
validation ValidationResult Full hallucination validation result
provider str "groq" or "gemini"
warnings list[str] Aggregated warnings from optimizer and validator

EngineResult (Runtime Engine)

Returned by session.chat().

Field Type Description
response str Final model response
original_prompt str Raw input prompt
optimized_prompt str Prompt after optimization
tokens_before int Token count before optimization
tokens_after int Token count after optimization
tokens_saved int Tokens saved
tokens_saved_pct float Percentage saved
hallucination_score float HRS composite score (0.0–1.0)
hallucination_risk str "LOW", "MEDIUM", or "HIGH"
is_hallucination bool Whether response is flagged
hallucination_flags list[str] Detected issues
math_scores dict SCS, ECS, CDS, FGS, HRS sub-scores
latency_ms float End-to-end latency in milliseconds
cache_hit bool True if served from KV cache
context_tokens_used int Tokens currently in context window
context_tokens_available int Tokens remaining in budget
suggestions list[str] Recommendations if hallucination detected

Roadmap

  • Token optimization pipeline
  • Hallucination detection with mathematical HRS scoring
  • Groq and Gemini provider adapters
  • Runtime Engine with Ollama and HuggingFace support
  • Context Window Manager with four trim strategies
  • Session Manager with history, analytics, and export
  • Latency Optimizer with KV cache and P95 tracking
  • Async support via achat()
  • Streaming responses
  • OpenAI and Together AI provider adapters
  • Self-consistency hallucination verification
  • Per-call token budget enforcement

License

MIT License — see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hallutok-0.1.3.tar.gz (37.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hallutok-0.1.3-py3-none-any.whl (35.2 kB view details)

Uploaded Python 3

File details

Details for the file hallutok-0.1.3.tar.gz.

File metadata

  • Download URL: hallutok-0.1.3.tar.gz
  • Upload date:
  • Size: 37.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hallutok-0.1.3.tar.gz
Algorithm Hash digest
SHA256 5b8da2c45de46d88609a13b70016a3c0e7043fb4de3779b6504bd0f1799a9c64
MD5 9d7d2cc4776d6899fff7d1a081b08727
BLAKE2b-256 3bf5f21a95ca38fcb06df21e0fe8382f80971c1d08146f90bbda81422c9dca12

See more details on using hashes here.

File details

Details for the file hallutok-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: hallutok-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 35.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hallutok-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b9f8f2d717a9304d847972c776002bd7008d351cb10cd19dcd80f34c819f4e11
MD5 9c387db5b212aa558b335fe09bb0f492
BLAKE2b-256 e621c92f288f8ac865de66ba315a7aa09033e5852bb305eb6d8f9cd4c68d5c86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page