Anti-Hallucination & Token Optimization library for Groq and Gemini APIs

These details have not been verified by PyPI

Project links

Project description

🛡️ Hallutok

Anti-Hallucination & Token Optimization for Groq, Gemini, Ollama, and HuggingFace

Hallutok solves real problems that kill your API quota and reliability:

Problem	Hallutok's Solution
Long prompts burning through tokens	`TokenOptimizer` compresses prompts before sending
LLM making up facts / hedging	`HallucinationValidator` scores and flags sketchy responses
Running local/offline LLMs without guardrails	`HallutokEngine` wraps any local model with the full pipeline
Multi-turn context blowing past model limits	`ContextWindowManager` auto-trims with smart strategies

✨ Features

Token Optimization — whitespace cleanup, filler-phrase compression, deduplication, smart truncation, in-memory caching
Anti-Hallucination — mathematical HRS scoring, detects hedging, ungrounded claims, numeric anomalies, contradictions
Groq + Gemini — works with both APIs via thin, swappable provider adapters
Runtime Engine — load any Ollama or HuggingFace model and get the full optimization pipeline locally
Context Window Manager — smart token budget with 4 trim strategies (sliding, drop_oldest, summarize, priority)
Session Manager — multi-turn history, save/load JSON, export Markdown + CSV
Latency Optimizer — KV cache, warm-up pings, per-call latency tracking with P95
Zero hard dependencies — core library is pure Python; providers and runtime are optional extras
Savings reporting — see exactly how many tokens you saved per call

📦 Installation

# With Groq support
pip install hallutok[groq]

# With Gemini support
pip install hallutok[gemini]

# Both
pip install hallutok[all]

# With local model support (Ollama / HuggingFace)
pip install hallutok[local]

🚀 Quick Start

Using Groq

from hallutok import HallutokClient

client = HallutokClient.with_groq(
    api_key="gsk_your_groq_key",
    model="llama3-8b-8192",
    temperature=0.3,
)

result = client.chat(
    "Please note that I would like you to explain in order to help me "
    "understand what black holes are and how they work."
)

print(result.response)
print(result.token_report)
# {'tokens_before': 48, 'tokens_after': 19, 'tokens_saved': 29, 'percent_saved': 60.4}

if result.validation.is_likely_hallucination:
    print("⚠️  Flags:", result.validation.flags)

Using Gemini

from hallutok import HallutokClient

client = HallutokClient.with_gemini(
    api_key="AIza_your_gemini_key",
    model="gemini-1.5-flash",
)

result = client.chat("Explain quantum entanglement to a 10-year-old.")
print(result.response)

Using providers directly

from hallutok import HallutokClient
from hallutok.providers import GroqProvider, GeminiProvider

provider = GroqProvider(api_key="gsk_...", model="mixtral-8x7b-32768")

client = HallutokClient(
    provider=provider,
    optimize_tokens=True,
    validate_responses=True,
    max_prompt_tokens=512,
    temperature=0.4,
    max_response_tokens=1024,
    system_prompt="You are a factual assistant. Cite sources when possible.",
)

result = client.chat("What causes inflation?")

🖥️ Runtime Engine (v0.1.1 — New)

The HallutokEngine lets you run any local model (via Ollama or HuggingFace) with the full Hallutok pipeline — token optimization, hallucination detection, context management, session tracking, and latency optimization — all built in.

Loading a model

from hallutok.runtime import HallutokEngine

# From Ollama (requires Ollama running locally)
engine = HallutokEngine.from_ollama("llama3")

# From HuggingFace Hub
engine = HallutokEngine.from_huggingface(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device="auto",      # auto-detects cuda / mps / cpu
    quantize=True,      # 4-bit quantization to save memory
)

# From a local model directory
engine = HallutokEngine.from_local("/path/to/my-model")

Creating sessions and chatting

session = engine.create_session(
    name="research-chat",
    system_prompt="You are a factual assistant. Always cite sources.",
    max_tokens=4096,
    trim_strategy="sliding",   # sliding | drop_oldest | summarize | priority
)

result = session.chat("Explain black holes")

print(result.response)
print(result.hallucination_score)   # HRS 0.0–1.0
print(result.hallucination_risk)    # LOW / MEDIUM / HIGH
print(result.tokens_saved)
print(result.latency_ms)
print(result.cache_hit)             # True if served from KV cache

EngineResult fields

result.response                 # final (possibly cleaned) response
result.original_prompt          # your raw input
result.optimized_prompt         # what was actually sent to the model
result.tokens_before            # tokens in original prompt
result.tokens_after             # tokens in optimized prompt
result.tokens_saved             # tokens saved
result.tokens_saved_pct         # percentage saved
result.hallucination_score      # HRS score 0.0–1.0
result.hallucination_risk       # "LOW" | "MEDIUM" | "HIGH"
result.is_hallucination         # bool
result.hallucination_flags      # list of detected issues
result.math_scores              # {"SCS": ..., "ECS": ..., "CDS": ..., "FGS": ..., "HRS": ...}
result.latency_ms               # end-to-end latency in milliseconds
result.cache_hit                # True if response came from KV cache
result.context_tokens_used      # tokens currently in context window
result.context_tokens_available # tokens remaining in budget
result.suggestions              # recommendations if hallucination detected

Multi-turn conversations

session = engine.create_session("science-qa")

r1 = session.chat("What are black holes?")
r2 = session.chat("How does Hawking radiation work?")
r3 = session.chat("What is the event horizon?", flag_turn=True)  # never trimmed

Session persistence and export

# Save and reload a session
session.save("my_session.json")
restored = engine.load_session("my_session.json")

# Export in multiple formats
session.export_markdown("chat_log.md")   # human-readable chat log
session.export_csv("analytics.csv")     # per-turn analytics

Session analytics

stats = session.get_stats()

print(stats["total_turns"])
print(stats["total_tokens_saved"])
print(stats["avg_tokens_saved_pct"])   # e.g. 42.3
print(stats["total_hallucinations_caught"])
print(stats["avg_hallucination_score"])
print(stats["avg_latency_ms"])
print(stats["session_duration_s"])
print(stats["context_trims"])          # how many times context was auto-trimmed

Engine-wide stats

print(engine.get_stats())
# {
#   "model": "llama3",
#   "source": "ollama",
#   "device": "cpu",
#   "quantized": True,
#   "memory_mb": 4000.0,
#   "total_sessions": 3,
#   "uptime_s": 142.7,
#   "latency": {"calls": 12, "avg_ms": 134.2, "min_ms": 98.1, "max_ms": 312.4, "p95_ms": 280.0},
#   "context_budget": 4096,
#   "trim_strategy": "sliding"
# }

engine.clear_cache()  # flush KV cache and optimizer cache

Advanced engine configuration

engine = HallutokEngine.from_ollama(
    model="mistral",
    max_tokens=8192,             # context window budget
    trim_strategy="priority",    # keep system + flagged + last 6 turns
    kv_cache=True,               # cache identical prompts
    warm_up=True,                # pre-warm model to reduce first-call latency
    stream=False,
    system_prompt="You are a concise, factual assistant.",
)

🔧 Components

TokenOptimizer

from hallutok.optimizer import TokenOptimizer

opt = TokenOptimizer()

raw = """
Please note that I would like you to, in order to be helpful,
can you please explain, it is important to note that, machine learning
is a subset of AI. Machine learning is a subset of AI.
"""

compressed = opt.optimize(raw, max_tokens=100)
report = opt.savings_report(raw, compressed)
# {'tokens_before': 54, 'tokens_after': 12, 'tokens_saved': 42, 'percent_saved': 77.8}

What the optimizer does, in order:

Normalize whitespace (collapse spaces, trim blank lines)
Strip boilerplate ("Please note that", "I would like you to", etc.)
Deduplicate repeated sentences
Replace verbose phrases ("in order to" → "to", "due to the fact that" → "because", …)
Truncate to max_tokens at a sentence boundary

HallucinationValidator

from hallutok.antihallucination import HallucinationValidator

validator = HallucinationValidator()

response = "I think maybe studies show that eating chocolate probably cures cancer."
result = validator.validate(response)

print(result.confidence_score)          # e.g. 0.72
print(result.is_likely_hallucination)   # True / False
print(result.risk_level)                # "LOW" | "MEDIUM" | "HIGH"
print(result.flags)                     # list of issues found
print(result.warnings)                  # human-readable descriptions
print(result.suggestions)              # what to do about it
print(result.cleaned_response)          # response + disclaimer if flagged
print(result.math_scores)              # SCS, ECS, CDS, FGS, HRS breakdown

Detection layers:

Layer	What it catches
Hedging	"I think", "maybe", "perhaps", "I'm not sure", etc.
Ungrounded claims	"Studies show…", "Research suggests…" without citations
Numeric anomalies	Percentages over 100%, other implausible numbers
Contradictions	"always" + "never", "increases" + "decreases" in same text

ContextWindowManager

from hallutok.runtime.context_manager import ContextWindowManager

ctx = ContextWindowManager(
    max_tokens=4096,
    trim_strategy="sliding",   # sliding | drop_oldest | summarize | priority
    reserve_tokens=512,        # tokens reserved for the response
)

ctx.add_message("system", "You are a helpful assistant.", flagged=True)
ctx.add_message("user", "What are black holes?")
ctx.add_message("assistant", "Black holes are regions of extremely strong gravity.")

print(ctx.stats())
# {'messages': 3, 'total_tokens': 28, 'available_tokens': 3556, 'budget': 4096,
#  'usage_percent': 0.7, 'trim_count': 0, 'strategy': 'sliding'}

Trim strategies:

Strategy	Behavior
`sliding`	Keep system messages + last N conversation turns
`drop_oldest`	Remove oldest non-system, non-flagged messages first
`summarize`	Compress older messages into an extractive summary
`priority`	Keep system + flagged turns + last 6 messages

LatencyOptimizer

from hallutok.runtime.latency_optimizer import LatencyOptimizer

lat = LatencyOptimizer(kv_cache_enabled=True, kv_cache_size=64)

# KV cache
lat.store_cache("What is AI?", "AI is artificial intelligence.")
cached = lat.get_cached("What is AI?")   # returns cached response

# Latency stats
print(lat.latency_stats())
# {'calls': 12, 'avg_ms': 134.2, 'min_ms': 98.1, 'max_ms': 312.4,
#  'p95_ms': 280.0, 'cache_hits': 3, 'stream_mode': False}

💡 Tips to Maximize Token Savings

Avoid filler openers — "Can you please", "I would like you to", "It is important that"
Don't repeat yourself — Hallutok deduplicates, but it's faster to not duplicate at all
Use max_prompt_tokens — set a hard cap so you never accidentally send a 4k-token prompt
Lower the temperature — temperature=0.3 reduces hallucination risk significantly
Use a system prompt — instruct the model to cite sources and avoid speculation
Check token_report per call — it tells you exactly what was saved
Use flag_turn=True for important turns you never want trimmed from context

📊 ChatResult Fields (API Providers)

result.response           # final (possibly cleaned) text
result.original_prompt    # your original input
result.optimized_prompt   # what was actually sent to the API
result.token_report       # {tokens_before, tokens_after, tokens_saved, percent_saved}
result.validation         # ValidationResult object
result.provider           # "groq" or "gemini"
result.warnings           # list of human-readable warnings

🗺️ Roadmap

Token optimization pipeline
Hallucination detection with HRS scoring
Groq + Gemini provider adapters
Runtime Engine (Ollama + HuggingFace support)
Context Window Manager with smart trimming
Session Manager with history and export
Latency Optimizer with KV cache
Async support (achat())
Streaming responses
OpenAI / Together AI provider adapters
Per-call token budget enforcement
Self-consistency hallucination verification

📄 License

MIT License — see LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.3

Jun 5, 2026

0.1.2

Jun 5, 2026

This version

0.1.1

Jun 5, 2026

0.1.0

Jun 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hallutok-0.1.1.tar.gz (34.4 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hallutok-0.1.1-py3-none-any.whl (33.5 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file hallutok-0.1.1.tar.gz.

File metadata

Download URL: hallutok-0.1.1.tar.gz
Upload date: Jun 5, 2026
Size: 34.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hallutok-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`3300dd022935da65fce9765afcda5afe9392bf873dc02e9cafc7a6c07add440c`
MD5	`4db939bb93bf5e6070e66a4ac0b99c63`
BLAKE2b-256	`f24bf11742853c1667cc1585e36e2302fd958f2d2b7afa7b9a36469e5e1f01a4`

See more details on using hashes here.

File details

Details for the file hallutok-0.1.1-py3-none-any.whl.

File metadata

Download URL: hallutok-0.1.1-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for hallutok-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`14f304592ebf0904505610905c23944e9d4658aa4e34c84da37129e386e243c2`
MD5	`ca43635726b907d8e5c6a88a3ec28462`
BLAKE2b-256	`f4a645e535810bfd3ea8ac567ae6047d783db471b7a476e0da7d83330c10d239`

See more details on using hashes here.

hallutok 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🛡️ Hallutok

✨ Features

📦 Installation

🚀 Quick Start

Using Groq

Using Gemini

Using providers directly

🖥️ Runtime Engine (v0.1.1 — New)

Loading a model

Creating sessions and chatting

EngineResult fields

Multi-turn conversations

Session persistence and export

Session analytics

Engine-wide stats

Advanced engine configuration

🔧 Components

TokenOptimizer

HallucinationValidator

ContextWindowManager

LatencyOptimizer

💡 Tips to Maximize Token Savings

📊 ChatResult Fields (API Providers)

🗺️ Roadmap

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes