Anti-Hallucination & Token Optimization library for Groq and Gemini APIs
Project description
🛡️ Hallutok
Anti-Hallucination & Token Optimization for Groq, Gemini, Ollama, and HuggingFace
Hallutok solves real problems that kill your API quota and reliability:
| Problem | Hallutok's Solution |
|---|---|
| Long prompts burning through tokens | TokenOptimizer compresses prompts before sending |
| LLM making up facts / hedging | HallucinationValidator scores and flags sketchy responses |
| Running local/offline LLMs without guardrails | HallutokEngine wraps any local model with the full pipeline |
| Multi-turn context blowing past model limits | ContextWindowManager auto-trims with smart strategies |
✨ Features
- Token Optimization — whitespace cleanup, filler-phrase compression, deduplication, smart truncation, in-memory caching
- Anti-Hallucination — mathematical HRS scoring, detects hedging, ungrounded claims, numeric anomalies, contradictions
- Groq + Gemini — works with both APIs via thin, swappable provider adapters
- Runtime Engine — load any Ollama or HuggingFace model and get the full optimization pipeline locally
- Context Window Manager — smart token budget with 4 trim strategies (sliding, drop_oldest, summarize, priority)
- Session Manager — multi-turn history, save/load JSON, export Markdown + CSV
- Latency Optimizer — KV cache, warm-up pings, per-call latency tracking with P95
- Zero hard dependencies — core library is pure Python; providers and runtime are optional extras
- Savings reporting — see exactly how many tokens you saved per call
📦 Installation
# With Groq support
pip install hallutok[groq]
# With Gemini support
pip install hallutok[gemini]
# Both
pip install hallutok[all]
# With local model support (Ollama / HuggingFace)
pip install hallutok[local]
🚀 Quick Start
Using Groq
from hallutok import HallutokClient
client = HallutokClient.with_groq(
api_key="gsk_your_groq_key",
model="llama3-8b-8192",
temperature=0.3,
)
result = client.chat(
"Please note that I would like you to explain in order to help me "
"understand what black holes are and how they work."
)
print(result.response)
print(result.token_report)
# {'tokens_before': 48, 'tokens_after': 19, 'tokens_saved': 29, 'percent_saved': 60.4}
if result.validation.is_likely_hallucination:
print("⚠️ Flags:", result.validation.flags)
Using Gemini
from hallutok import HallutokClient
client = HallutokClient.with_gemini(
api_key="AIza_your_gemini_key",
model="gemini-1.5-flash",
)
result = client.chat("Explain quantum entanglement to a 10-year-old.")
print(result.response)
Using providers directly
from hallutok import HallutokClient
from hallutok.providers import GroqProvider, GeminiProvider
provider = GroqProvider(api_key="gsk_...", model="mixtral-8x7b-32768")
client = HallutokClient(
provider=provider,
optimize_tokens=True,
validate_responses=True,
max_prompt_tokens=512,
temperature=0.4,
max_response_tokens=1024,
system_prompt="You are a factual assistant. Cite sources when possible.",
)
result = client.chat("What causes inflation?")
🖥️ Runtime Engine (v0.1.1 — New)
The HallutokEngine lets you run any local model (via Ollama or HuggingFace) with the full Hallutok pipeline — token optimization, hallucination detection, context management, session tracking, and latency optimization — all built in.
Loading a model
from hallutok.runtime import HallutokEngine
# From Ollama (requires Ollama running locally)
engine = HallutokEngine.from_ollama("llama3")
# From HuggingFace Hub
engine = HallutokEngine.from_huggingface(
"mistralai/Mistral-7B-Instruct-v0.2",
device="auto", # auto-detects cuda / mps / cpu
quantize=True, # 4-bit quantization to save memory
)
# From a local model directory
engine = HallutokEngine.from_local("/path/to/my-model")
Creating sessions and chatting
session = engine.create_session(
name="research-chat",
system_prompt="You are a factual assistant. Always cite sources.",
max_tokens=4096,
trim_strategy="sliding", # sliding | drop_oldest | summarize | priority
)
result = session.chat("Explain black holes")
print(result.response)
print(result.hallucination_score) # HRS 0.0–1.0
print(result.hallucination_risk) # LOW / MEDIUM / HIGH
print(result.tokens_saved)
print(result.latency_ms)
print(result.cache_hit) # True if served from KV cache
EngineResult fields
result.response # final (possibly cleaned) response
result.original_prompt # your raw input
result.optimized_prompt # what was actually sent to the model
result.tokens_before # tokens in original prompt
result.tokens_after # tokens in optimized prompt
result.tokens_saved # tokens saved
result.tokens_saved_pct # percentage saved
result.hallucination_score # HRS score 0.0–1.0
result.hallucination_risk # "LOW" | "MEDIUM" | "HIGH"
result.is_hallucination # bool
result.hallucination_flags # list of detected issues
result.math_scores # {"SCS": ..., "ECS": ..., "CDS": ..., "FGS": ..., "HRS": ...}
result.latency_ms # end-to-end latency in milliseconds
result.cache_hit # True if response came from KV cache
result.context_tokens_used # tokens currently in context window
result.context_tokens_available # tokens remaining in budget
result.suggestions # recommendations if hallucination detected
Multi-turn conversations
session = engine.create_session("science-qa")
r1 = session.chat("What are black holes?")
r2 = session.chat("How does Hawking radiation work?")
r3 = session.chat("What is the event horizon?", flag_turn=True) # never trimmed
Session persistence and export
# Save and reload a session
session.save("my_session.json")
restored = engine.load_session("my_session.json")
# Export in multiple formats
session.export_markdown("chat_log.md") # human-readable chat log
session.export_csv("analytics.csv") # per-turn analytics
Session analytics
stats = session.get_stats()
print(stats["total_turns"])
print(stats["total_tokens_saved"])
print(stats["avg_tokens_saved_pct"]) # e.g. 42.3
print(stats["total_hallucinations_caught"])
print(stats["avg_hallucination_score"])
print(stats["avg_latency_ms"])
print(stats["session_duration_s"])
print(stats["context_trims"]) # how many times context was auto-trimmed
Engine-wide stats
print(engine.get_stats())
# {
# "model": "llama3",
# "source": "ollama",
# "device": "cpu",
# "quantized": True,
# "memory_mb": 4000.0,
# "total_sessions": 3,
# "uptime_s": 142.7,
# "latency": {"calls": 12, "avg_ms": 134.2, "min_ms": 98.1, "max_ms": 312.4, "p95_ms": 280.0},
# "context_budget": 4096,
# "trim_strategy": "sliding"
# }
engine.clear_cache() # flush KV cache and optimizer cache
Advanced engine configuration
engine = HallutokEngine.from_ollama(
model="mistral",
max_tokens=8192, # context window budget
trim_strategy="priority", # keep system + flagged + last 6 turns
kv_cache=True, # cache identical prompts
warm_up=True, # pre-warm model to reduce first-call latency
stream=False,
system_prompt="You are a concise, factual assistant.",
)
🔧 Components
TokenOptimizer
from hallutok.optimizer import TokenOptimizer
opt = TokenOptimizer()
raw = """
Please note that I would like you to, in order to be helpful,
can you please explain, it is important to note that, machine learning
is a subset of AI. Machine learning is a subset of AI.
"""
compressed = opt.optimize(raw, max_tokens=100)
report = opt.savings_report(raw, compressed)
# {'tokens_before': 54, 'tokens_after': 12, 'tokens_saved': 42, 'percent_saved': 77.8}
What the optimizer does, in order:
- Normalize whitespace (collapse spaces, trim blank lines)
- Strip boilerplate ("Please note that", "I would like you to", etc.)
- Deduplicate repeated sentences
- Replace verbose phrases ("in order to" → "to", "due to the fact that" → "because", …)
- Truncate to
max_tokensat a sentence boundary
HallucinationValidator
from hallutok.antihallucination import HallucinationValidator
validator = HallucinationValidator()
response = "I think maybe studies show that eating chocolate probably cures cancer."
result = validator.validate(response)
print(result.confidence_score) # e.g. 0.72
print(result.is_likely_hallucination) # True / False
print(result.risk_level) # "LOW" | "MEDIUM" | "HIGH"
print(result.flags) # list of issues found
print(result.warnings) # human-readable descriptions
print(result.suggestions) # what to do about it
print(result.cleaned_response) # response + disclaimer if flagged
print(result.math_scores) # SCS, ECS, CDS, FGS, HRS breakdown
Detection layers:
| Layer | What it catches |
|---|---|
| Hedging | "I think", "maybe", "perhaps", "I'm not sure", etc. |
| Ungrounded claims | "Studies show…", "Research suggests…" without citations |
| Numeric anomalies | Percentages over 100%, other implausible numbers |
| Contradictions | "always" + "never", "increases" + "decreases" in same text |
ContextWindowManager
from hallutok.runtime.context_manager import ContextWindowManager
ctx = ContextWindowManager(
max_tokens=4096,
trim_strategy="sliding", # sliding | drop_oldest | summarize | priority
reserve_tokens=512, # tokens reserved for the response
)
ctx.add_message("system", "You are a helpful assistant.", flagged=True)
ctx.add_message("user", "What are black holes?")
ctx.add_message("assistant", "Black holes are regions of extremely strong gravity.")
print(ctx.stats())
# {'messages': 3, 'total_tokens': 28, 'available_tokens': 3556, 'budget': 4096,
# 'usage_percent': 0.7, 'trim_count': 0, 'strategy': 'sliding'}
Trim strategies:
| Strategy | Behavior |
|---|---|
sliding |
Keep system messages + last N conversation turns |
drop_oldest |
Remove oldest non-system, non-flagged messages first |
summarize |
Compress older messages into an extractive summary |
priority |
Keep system + flagged turns + last 6 messages |
LatencyOptimizer
from hallutok.runtime.latency_optimizer import LatencyOptimizer
lat = LatencyOptimizer(kv_cache_enabled=True, kv_cache_size=64)
# KV cache
lat.store_cache("What is AI?", "AI is artificial intelligence.")
cached = lat.get_cached("What is AI?") # returns cached response
# Latency stats
print(lat.latency_stats())
# {'calls': 12, 'avg_ms': 134.2, 'min_ms': 98.1, 'max_ms': 312.4,
# 'p95_ms': 280.0, 'cache_hits': 3, 'stream_mode': False}
💡 Tips to Maximize Token Savings
- Avoid filler openers — "Can you please", "I would like you to", "It is important that"
- Don't repeat yourself — Hallutok deduplicates, but it's faster to not duplicate at all
- Use
max_prompt_tokens— set a hard cap so you never accidentally send a 4k-token prompt - Lower the temperature —
temperature=0.3reduces hallucination risk significantly - Use a system prompt — instruct the model to cite sources and avoid speculation
- Check
token_reportper call — it tells you exactly what was saved - Use
flag_turn=Truefor important turns you never want trimmed from context
📊 ChatResult Fields (API Providers)
result.response # final (possibly cleaned) text
result.original_prompt # your original input
result.optimized_prompt # what was actually sent to the API
result.token_report # {tokens_before, tokens_after, tokens_saved, percent_saved}
result.validation # ValidationResult object
result.provider # "groq" or "gemini"
result.warnings # list of human-readable warnings
🗺️ Roadmap
- Token optimization pipeline
- Hallucination detection with HRS scoring
- Groq + Gemini provider adapters
- Runtime Engine (Ollama + HuggingFace support)
- Context Window Manager with smart trimming
- Session Manager with history and export
- Latency Optimizer with KV cache
- Async support (
achat()) - Streaming responses
- OpenAI / Together AI provider adapters
- Per-call token budget enforcement
- Self-consistency hallucination verification
📄 License
MIT License — see LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hallutok-0.1.1.tar.gz.
File metadata
- Download URL: hallutok-0.1.1.tar.gz
- Upload date:
- Size: 34.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3300dd022935da65fce9765afcda5afe9392bf873dc02e9cafc7a6c07add440c
|
|
| MD5 |
4db939bb93bf5e6070e66a4ac0b99c63
|
|
| BLAKE2b-256 |
f24bf11742853c1667cc1585e36e2302fd958f2d2b7afa7b9a36469e5e1f01a4
|
File details
Details for the file hallutok-0.1.1-py3-none-any.whl.
File metadata
- Download URL: hallutok-0.1.1-py3-none-any.whl
- Upload date:
- Size: 33.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
14f304592ebf0904505610905c23944e9d4658aa4e34c84da37129e386e243c2
|
|
| MD5 |
ca43635726b907d8e5c6a88a3ec28462
|
|
| BLAKE2b-256 |
f4a645e535810bfd3ea8ac567ae6047d783db471b7a476e0da7d83330c10d239
|