A confidence-aware routing system for LLM hallucination detection using multi-signal approach
Project description
HalluNox
Confidence-Aware Routing for Large Language Model Reliability Enhancement
A Python package implementing a multi-signal approach to pre-generation hallucination mitigation for Large Language Models. HalluNox combines semantic alignment measurement, internal convergence analysis, and learned confidence estimation to produce unified confidence scores for proactive routing decisions.
โจ Features
- ๐ฏ Pre-generation Hallucination Detection: Assess model reliability before generation begins
- ๐ Confidence-Aware Routing: Automatically route queries based on estimated confidence
- ๐ง Multi-Signal Approach: Combines semantic alignment, internal convergence, and learned confidence
- โก Multi-Model Support: Llama-3.2-3B-Instruct and MedGemma-4B-IT architectures
- ๐ฅ Medical Domain Specialization: Enhanced MedGemma 4B-IT support with medical-grade confidence thresholds
- ๐ผ๏ธ Multimodal Capabilities: Image analysis and response generation for MedGemma models
- ๐ Comprehensive Evaluation: Built-in metrics and routing strategy analysis
- ๐ Easy Integration: Simple API for both training and inference
- ๐โโ๏ธ Performance Optimizations: Optional LLM loading for faster initialization and lower memory usage
- ๐ Enhanced Query-Context: Improved accuracy with structured prompt formatting
- ๐๏ธ Adaptive Thresholds: Dynamic confidence thresholds based on model type (0.62 for medical, 0.65 for general)
- ๐ฌ Response Generation: Built-in response generation with confidence-gated output
- ๐ง Automatic Model Management: Auto-download and configuration for supported models
๐ฌ Research Foundation
Based on the research paper "Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation" by Nandakishor M (Convai Innovations).
The approach implements deterministic routing to appropriate response pathways:
General Models (Llama-3.2-3B)
- High Confidence (โฅ0.65): Local generation
- Medium Confidence (0.60-0.65): Retrieval-augmented generation
- Low Confidence (0.4-0.60): Route to larger models
- Very Low Confidence (<0.4): Human review required
Medical Models (MedGemma-4B-IT)
- High Medical Confidence (โฅ0.60): Local generation with medical validation
- Medium Medical Confidence (0.55-0.60): Medical literature verification required
- Low Medical Confidence (0.50-0.55): Professional medical verification required
- Very Low Medical Confidence (<0.50): Seek professional medical advice
๐ What's New in v0.6.3
โจ Enhanced Query-Context Support
- ๐ Query-Context Pairs: Full support for query_context_pairs in MedGemma models for enhanced context-aware responses
- ๐ฏ Improved Accuracy: Better confidence scoring when context is provided
- ๐ Enhanced Response Generation: Context-aware prompt formatting for more accurate medical responses
- ๐ Batch Processing: New
generate_response_with_context()method for processing multiple query-context pairs
๐ฅ Medical Domain Enhancements
- ๐ฉบ Context Integration: Medical queries now benefit from patient context and clinical background
- ๐ Better Confidence: Context helps improve confidence scoring for medical scenarios
- ๐๏ธ Flexible Usage: Works with existing methods while providing new convenience functions
- ๐ Example Implementation: New query_context_example.py demonstrates usage patterns
๐งน Simplified Architecture
- ๐ฑ Removed Dashboard: Eliminated dashboard dependencies for cleaner core package
- โก Streamlined Installation: Faster installation without unnecessary web components
- ๐ฏ Focused Functionality: Core hallucination detection without UI overhead
- ๐ฆ Lightweight: Reduced package size and dependencies
๐ง Technical Improvements
- ๐ Enhanced Prompt Formatting: Context gets properly integrated into medical prompts
- ๐ฏ Backward Compatibility: All existing code continues to work unchanged
- ๐ Better Documentation: Comprehensive examples for query-context usage
- ๐ก๏ธ Stable Performance: Maintains all stability improvements from v0.6.3
๐ Installation
Requirements
- Python 3.8+
- PyTorch 1.13+
- CUDA-compatible GPU (recommended)
- At least 8GB GPU memory for inference (improved efficiency in v0.6.3+)
- 16GB RAM minimum (32GB recommended for training)
Install from PyPI
pip install hallunox
Install from Source
git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e .
MedGemma Model Access
HalluNox uses the open-access convaiinnovations/gemma-finetuned-4b-it model, which doesn't require authentication. The model will be automatically downloaded on first use.
Core Dependencies
HalluNox automatically installs:
torch>=1.13.0- PyTorch frameworktransformers>=4.21.0- Hugging Face TransformersFlagEmbedding>=1.2.0- BGE-M3 embedding modeldatasets>=2.0.0- Dataset loading utilitiesscikit-learn>=1.0.0- Evaluation metricsnumpy>=1.21.0- Numerical computationstqdm>=4.64.0- Progress barsPillow>=8.0.0- Image processing for multimodal capabilitiesbitsandbytes>=0.41.0- 4-bit quantization for memory optimization
๐ Quick Start
Basic Usage (Llama-3.2-3B)
from hallunox import HallucinationDetector
# Initialize detector (downloads pre-trained model automatically)
detector = HallucinationDetector()
# Analyze text for hallucination risk
results = detector.predict([
"The capital of France is Paris.", # High confidence
"Your password is 12345678.", # Low confidence
"The Moon is made of cheese." # Very low confidence
])
# View results
for pred in results["predictions"]:
print(f"Text: {pred['text']}")
print(f"Confidence: {pred['confidence_score']:.3f}")
print(f"Risk Level: {pred['risk_level']}")
print(f"Routing Action: {pred['routing_action']}")
print()
๐ฅ MedGemma Medical Domain Usage
For medical applications using MedGemma 4B-IT with multimodal capabilities:
from hallunox import HallucinationDetector
from PIL import Image
# Initialize MedGemma detector (auto-downloads medical model)
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
confidence_threshold=0.60, # Medical-grade threshold
enable_response_generation=True, # Enable response generation
enable_inference=True,
mode="text" # Text-only mode (default)
)
# Medical text analysis
medical_results = detector.predict([
"Aspirin can help reduce heart attack risk when prescribed by a doctor.",
"Drinking bleach will cure COVID-19.", # Dangerous misinformation
"Type 2 diabetes requires insulin injections in all cases.", # Partially incorrect
])
for pred in medical_results["predictions"]:
print(f"Medical Text: {pred['text'][:60]}...")
print(f"Confidence: {pred['confidence_score']:.3f}")
print(f"Risk Level: {pred['risk_level']}")
print(f"Medical Action: {pred['routing_action']}")
print(f"Description: {pred['description']}")
print("-" * 50)
# Response generation with confidence checking
question = "What are the symptoms of pneumonia?"
response = detector.generate_response(question, check_confidence=True)
if response["should_generate"]:
print(f"โ
Medical Response Generated (confidence: {response['confidence_score']:.3f})")
print(f"Response: {response['response']}")
print(f"Meets threshold: {response['meets_threshold']}")
if response.get('forced_generation'):
print("โ ๏ธ Note: Response was generated despite low confidence")
else:
print(f"โ ๏ธ Response blocked (confidence: {response['confidence_score']:.3f})")
print(f"Reason: {response['reason']}")
print(f"Recommendation: {response['recommendation']}")
# Force generation for reference regardless of confidence
forced_response = detector.generate_response(
question,
check_confidence=True,
force_generate=True # Generate even if confidence is low
)
print(f"๐ฌ Reference Response (forced): {forced_response['response']}")
print(f"๐ Confidence: {forced_response['confidence_score']:.3f}")
print(f"๐ฏ Forced Generation: {forced_response['forced_generation']}")
# Multimodal image analysis (MedGemma 4B-IT only)
if detector.is_multimodal:
print("\n๐ผ๏ธ Multimodal Image Analysis")
# Load medical image (replace with actual medical image)
try:
image = Image.open("chest_xray.jpg")
except:
# Create demo image for testing
image = Image.new('RGB', (224, 224), color='lightgray')
# Analyze image confidence
image_results = detector.predict_images([image], ["Chest X-ray"])
for pred in image_results["predictions"]:
print(f"Image: {pred['image_description']}")
print(f"Confidence: {pred['confidence_score']:.3f}")
print(f"Interpretation: {pred['interpretation']}")
print(f"Risk Level: {pred['risk_level']}")
# Generate image description
description = detector.generate_image_response(
image,
"Describe the findings in this chest X-ray."
)
print(f"Generated Description: {description}")
๐ง Advanced Configuration
from hallunox import HallucinationDetector
# Full configuration example
detector = HallucinationDetector(
# Model selection
llm_model_id="convaiinnovations/gemma-finetuned-4b-it", # or "unsloth/Llama-3.2-3B-Instruct"
embed_model_id="BAAI/bge-m3",
# Custom model weights (optional)
model_path="/path/to/custom/model.pt", # None = auto-download
# Hardware configuration
device="cuda", # or "cpu"
use_fp16=True, # Mixed precision for faster inference
# Sequence lengths
max_length=512, # LLM context length
bge_max_length=512, # BGE-M3 context length
# Feature toggles
load_llm=True, # Load LLM for embeddings
enable_inference=True, # Enable LLM inference
enable_response_generation=True, # Enable response generation
# Confidence settings
confidence_threshold=0.60, # Custom threshold (auto-detected by model type)
# Operation mode
mode="text", # "text", "image", "both", or "auto"
)
# Check model capabilities
print(f"Model type: {'Medical' if detector.is_medgemma_4b else 'General'}")
print(f"Multimodal support: {detector.is_multimodal}")
print(f"Operation mode: {detector.effective_mode} (requested: {detector.mode})")
print(f"Confidence threshold: {detector.confidence_threshold}")
๐๏ธ Operation Mode Configuration
The mode parameter controls what types of input the detector can process:
from hallunox import HallucinationDetector
# Text mode (default) - processes text inputs only
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="text" # Text-only processing (default)
)
# Auto mode - detects capabilities from model
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="auto" # Auto: detects based on model capabilities
)
# Image-only mode - processes images only (requires multimodal model)
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="image" # Image processing only
)
# Both mode - processes text and images (requires multimodal model)
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="both" # Explicit multimodal mode
)
Mode Validation
- Text mode: Available for all models (default)
- Image mode: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
- Both mode: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
- Auto mode: Automatically selects based on model capabilities
- Multimodal models โ
effective_mode = "both" - Text-only models โ
effective_mode = "text"
- Multimodal models โ
Error Examples
# This will raise an error - image mode requires multimodal model
detector = HallucinationDetector(
llm_model_id="unsloth/Llama-3.2-3B-Instruct",
mode="image" # โ Error: Image mode requires multimodal model
)
# This will raise an error - calling image methods in text mode
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
mode="text"
)
detector.predict_images([image]) # โ Error: Current mode is 'text'
โก Performance Optimized Usage
For faster initialization when only doing embedding comparisons:
from hallunox import HallucinationDetector
# Option 1: Factory method for embedding-only usage
detector = HallucinationDetector.for_embedding_only(
device="cuda",
use_fp16=True
)
# Option 2: Explicit parameter control
detector = HallucinationDetector(
load_llm=False, # Skip expensive LLM loading
enable_inference=False, # Disable inference capabilities
use_fp16=True # Use mixed precision
)
# Note: This configuration cannot perform predictions
# Use for preprocessing or embedding extraction only
๐ง Memory Optimization with Quantization
For GPUs with limited VRAM (8-16GB), use 4-bit quantization:
from hallunox import HallucinationDetector
# Option 1: Auto-optimized for low memory (recommended)
detector = HallucinationDetector.for_low_memory(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it", # Or any supported model
device="cuda",
enable_response_generation=True, # Enable response generation for evaluation
verbose=True # Show loading progress (optional)
)
# Option 2: Manual quantization configuration
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
use_quantization=True, # Enable 4-bit quantization
enable_response_generation=True,
device="cuda"
)
# Option 3: Custom quantization settings
from transformers import BitsAndBytesConfig
import torch
custom_quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NF4 quantization type
bnb_4bit_use_double_quant=True, # Double quantization for extra savings
bnb_4bit_compute_dtype=torch.bfloat16 # Compute in bfloat16
)
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
quantization_config=custom_quant_config,
device="cuda"
)
print(f"โ
Memory optimized: {detector.use_quantization}")
print(f"๐ง Quantization: 4-bit NF4 with double quantization")
๐ค Response Generation & Evaluation
Enabling Response Generation
When enable_response_generation=True, HalluNox can generate responses for evaluation and display the model's actual output alongside confidence scores:
from hallunox import HallucinationDetector
# Enable response generation for evaluation
detector = HallucinationDetector.for_low_memory(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
device="cuda",
enable_response_generation=True, # Enable response generation
verbose=False # Clean logs for evaluation
)
# Test questions for evaluation
test_questions = [
"What are the symptoms of diabetes?",
"Drinking bleach will cure COVID-19.", # Dangerous misinformation
"How does aspirin help prevent heart attacks?",
"All vaccines cause autism in children.", # Medical misinformation
]
# Analyze with response generation
for question in test_questions:
# The model will generate a response and analyze it
results = detector.predict([question])
prediction = results["predictions"][0]
print(f"Question: {question}")
print(f"Confidence: {prediction['confidence_score']:.3f}")
print(f"Risk Level: {prediction['risk_level']}")
print(f"Action: {prediction['medical_action']}")
print(f"Description: {prediction['description']}")
print("-" * 50)
Response Generation Modes
# Generate and analyze responses with confidence checking
response = detector.generate_response(
"What are the side effects of ibuprofen?",
check_confidence=True
)
if response["should_generate"]:
print(f"โ
Generated Response: {response['response']}")
print(f"Confidence: {response['confidence_score']:.3f}")
print(f"Meets threshold: {response['meets_threshold']}")
else:
print(f"โ ๏ธ Response blocked (confidence: {response['confidence_score']:.3f})")
print(f"Reason: {response['reason']}")
print(f"Recommendation: {response['recommendation']}")
# Force generation for reference (useful for evaluation)
forced_response = detector.generate_response(
"What are the side effects of ibuprofen?",
check_confidence=True,
force_generate=True
)
print(f"๐ฌ Reference Response: {forced_response['response']}")
print(f"๐ Confidence: {forced_response['confidence_score']:.3f}")
print(f"๐ฏ Forced Generation: {forced_response['forced_generation']}")
Evaluation Output Example
Question: What are the symptoms of diabetes?
Generated Response: Common symptoms of diabetes include increased thirst, frequent urination, excessive hunger, unexplained weight loss, fatigue, and blurred vision. It's important to consult a healthcare provider for proper diagnosis.
Confidence: 0.857
Risk Level: LOW_MEDICAL_RISK
Action: โ
Information can be used as reference
--------------------------------------------------
Question: Drinking bleach will cure COVID-19.
Generated Response: [Response blocked - confidence too low]
Confidence: 0.123
Risk Level: VERY_HIGH_MEDICAL_RISK
Action: โ Do not use - seek professional medical advice
--------------------------------------------------
๐พ Memory Usage Comparison
| Configuration | Model Size | VRAM Usage | Performance |
|---|---|---|---|
| Full Precision | ~16GB | ~14GB | 100% speed |
| FP16 Mixed Precision | ~8GB | ~7GB | 95% speed |
| 4-bit Quantization | ~4GB | ~3.5GB | 85-90% speed |
| 4-bit + Double Quant | ~3.5GB | ~3GB | 85-90% speed |
Recommendation: Use HallucinationDetector.for_low_memory() for GPUs with 8GB or less VRAM.
๐ Enhanced Query-Context Support (NEW in v0.6.3!)
HalluNox now provides comprehensive support for query-context pairs, especially beneficial for medical applications:
from hallunox import HallucinationDetector
# Initialize MedGemma detector for context-aware medical responses
detector = HallucinationDetector(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
enable_response_generation=True
)
# Medical query-context pairs for enhanced accuracy
medical_query_context_pairs = [
{
"query": "Is it safe to take ibuprofen daily?",
"context": "Patient has a history of gastric ulcers and is currently taking blood thinners for atrial fibrillation."
},
{
"query": "What's the recommended exercise routine?",
"context": "28-year-old pregnant patient at 30 weeks, previously sedentary, no complications."
},
{
"query": "How should I manage my diabetes medication?",
"context": "Type 2 diabetes patient, HbA1c 8.2%, currently on metformin 1000mg twice daily."
}
]
# Method 1: Confidence analysis with context
results = detector.predict_with_query_context(medical_query_context_pairs)
for pred in results["predictions"]:
print(f"Query: {pred['text']}")
print(f"Context-Enhanced Confidence: {pred['confidence_score']:.3f}")
print(f"Medical Risk Level: {pred['risk_level']}")
print(f"Recommendation: {pred['routing_action']}")
# Method 2: Response generation with context
responses = detector.generate_response_with_context(
medical_query_context_pairs,
max_length=300,
check_confidence=True
)
for i, response in enumerate(responses):
pair = medical_query_context_pairs[i]
print(f"\nQuery: {pair['query']}")
print(f"Context: {pair['context'][:60]}...")
if isinstance(response, dict) and "should_generate" in response:
if response["should_generate"]:
print(f"โ
Context-Aware Response: {response['response']}")
print(f"Confidence: {response['confidence_score']:.3f}")
else:
print(f"โ ๏ธ Blocked (confidence: {response['confidence_score']:.3f})")
print(f"Recommendation: {response['recommendation']}")
# Method 3: Individual response with context
single_response = detector.generate_response(
prompt="Should I adjust my medication?",
query_context_pairs=[{
"query": "Should I adjust my medication?",
"context": "Patient experiencing mild side effects from current dosage"
}],
check_confidence=True
)
Context Impact Analysis
# Compare confidence with and without context
query = "Is this medication safe during pregnancy?"
# Without context
no_context = detector.predict([query])
print(f"Without context: {no_context['predictions'][0]['confidence_score']:.3f}")
# With context
with_context = detector.predict([query], query_context_pairs=[{
"query": query,
"context": "Patient is 12 weeks pregnant, no previous complications, taking prenatal vitamins"
}])
print(f"With context: {with_context['predictions'][0]['confidence_score']:.3f}")
# Context benefit
improvement = with_context['predictions'][0]['confidence_score'] - no_context['predictions'][0]['confidence_score']
print(f"Context improvement: {improvement:+.3f}")
๐ฅ๏ธ Command Line Interface
HalluNox provides a comprehensive CLI for various use cases:
Interactive Mode
# General model interactive mode
hallunox-infer --interactive
# MedGemma medical interactive mode
hallunox-infer --llm_model_id convaiinnovations/gemma-finetuned-4b-it --interactive --show_generated_text
Batch Processing
# Process file with general model
hallunox-infer --input_file medical_texts.txt --output_file results.json
# Process with MedGemma and medical settings
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--input_file medical_texts.txt \
--output_file medical_results.json \
--show_routing \
--show_generated_text
Image Analysis (Multimodal models only)
# Single image analysis
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--image_path chest_xray.jpg \
--show_generated_text
# Batch image analysis
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--image_folder /path/to/medical/images \
--output_file image_analysis.json
Demo Mode
# General demo
hallunox-infer --demo --show_routing
# Medical demo with MedGemma
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--demo \
--mode both \
--show_routing
# Text-only demo (faster initialization)
hallunox-infer \
--llm_model_id convaiinnovations/gemma-finetuned-4b-it \
--demo \
--mode text \
--show_routing
๐จ Training Your Own Model
Quick Training
from hallunox import Trainer, TrainingConfig
# Configure training
config = TrainingConfig(
# Model selection
model_id="convaiinnovations/gemma-finetuned-4b-it", # or "unsloth/Llama-3.2-3B-Instruct"
embed_model_id="BAAI/bge-m3",
# Training parameters
batch_size=8,
learning_rate=5e-4,
max_epochs=6,
warmup_steps=300,
# Dataset configuration
use_truthfulqa=True,
use_halueval=True,
use_fever=True,
max_samples_per_dataset=3000,
# Output
output_dir="./models/my_medical_model"
)
# Train model
trainer = Trainer(config)
trainer.train()
Command Line Training
# Train general model
hallunox-train --batch_size 8 --learning_rate 5e-4 --max_epochs 6
# Train medical model
hallunox-train \
--model_id convaiinnovations/gemma-finetuned-4b-it \
--batch_size 4 \
--learning_rate 3e-4 \
--max_epochs 8 \
--output_dir ./models/custom_medgemma
๐๏ธ Model Architecture
HalluNox supports two main architectures:
General Architecture (Llama-3.2-3B)
-
LLM Component: Llama-3.2-3B-Instruct
- Extracts internal hidden representations (3072D)
- Supports any Llama-architecture model
-
Embedding Model: BGE-M3 (fixed)
- Provides reference semantic embeddings
- 1024-dimensional dense vectors
-
Projection Network: Standard ProjectionHead
- Maps LLM hidden states to embedding space
- 3-layer MLP with ReLU activations and dropout
Medical Architecture (MedGemma-4B-IT)
-
Unified Multimodal Model:
- Single Model: AutoModelForImageTextToText handles both text and images
- Memory Optimized: Avoids double loading (saves ~8GB VRAM)
- Fallback Support: Graceful degradation to text-only if needed
-
Embedding Model: BGE-M3 (same as general)
- Enhanced with medical context formatting
-
Projection Network: UltraStableProjectionHead
- Ultra-stable architecture with heavy normalization
- Conservative weight initialization for medical precision
- Tanh activations for stability
- Enhanced dropout and layer normalization
-
Multimodal Processor: AutoProcessor
- Handles image + text inputs
- Supports chat template formatting
-
Quantization Support: 4-bit NF4 with double quantization
- Reduces memory usage by ~75%
- Maintains 85-90% performance
- Automatic fallback for CPU
๐ API Reference
HallucinationDetector
Constructor Parameters
HallucinationDetector(
model_path: str = None, # Path to trained model (None = auto-download)
llm_model_id: str = "unsloth/Llama-3.2-3B-Instruct", # LLM model ID
embed_model_id: str = "BAAI/bge-m3", # Embedding model ID
device: str = None, # Device (None = auto-detect)
max_length: int = 512, # LLM sequence length
bge_max_length: int = 512, # BGE-M3 sequence length
use_fp16: bool = True, # Mixed precision
load_llm: bool = True, # Load LLM
enable_inference: bool = False, # Enable LLM inference
confidence_threshold: float = None, # Custom threshold (auto-detected)
enable_response_generation: bool = False, # Enable response generation
use_quantization: bool = False, # Enable 4-bit quantization for memory savings
quantization_config: BitsAndBytesConfig = None, # Custom quantization config
mode: str = "text", # Operation mode: "text", "image", "both", "auto" (default: "text")
)
Core Methods
Text Analysis:
predict(texts, query_context_pairs=None)- Analyze texts for hallucination confidencepredict_with_query_context(query_context_pairs)- Query-context predictionbatch_predict(texts, batch_size=16)- Efficient batch processing
Response Generation:
generate_response(prompt, max_length=512, check_confidence=True, force_generate=False, query_context_pairs=None)- Generate responses with confidence checking and optional contextgenerate_response_with_context(query_context_pairs, max_length=512, check_confidence=True, force_generate=False)- Generate responses for multiple query-context pairs
Multimodal (MedGemma only):
predict_images(images, image_descriptions=None)- Analyze image confidencegenerate_image_response(image, prompt, max_length=200)- Generate image descriptions
Analysis:
evaluate_routing_strategy(texts)- Analyze routing decisions
Factory Methods:
for_embedding_only()- Create embedding-only detectorfor_low_memory()- Create memory-optimized detector with 4-bit quantization
Response Format
{
"predictions": [
{
"text": "input text",
"confidence_score": 0.85, # 0.0 to 1.0
"similarity_score": 0.92, # Cosine similarity
"interpretation": "HIGH_CONFIDENCE", # or HIGH_MEDICAL_CONFIDENCE
"risk_level": "LOW_RISK", # or LOW_MEDICAL_RISK
"routing_action": "LOCAL_GENERATION",
"description": "This response appears to be factual and reliable."
}
],
"summary": {
"total_texts": 1,
"avg_confidence": 0.85,
"high_confidence_count": 1,
"medium_confidence_count": 0,
"low_confidence_count": 0,
"very_low_confidence_count": 0
}
}
Response Generation Format
{
"response": "Generated response text",
"confidence_score": 0.85,
"should_generate": True,
"meets_threshold": True,
"forced_generation": False, # True if generated despite low confidence
# Or when blocked:
"reason": "Confidence 0.45 below threshold 0.60",
"recommendation": "RAG_RETRIEVAL"
}
Training Classes
TrainingConfig: Configuration dataclass for training parametersTrainer: Main training class with dataset loading and model trainingMultiDatasetLoader: Loads and combines multiple hallucination detection datasets
Utility Functions
download_model(): Download general pre-trained modeldownload_medgemma_model(model_name): Download MedGemma medical modelsetup_logging(level): Configure loggingcheck_gpu_availability(): Check CUDA compatibilityvalidate_model_requirements(): Verify dependencies
๐ Performance
Our confidence-aware routing system demonstrates:
- 74% hallucination detection rate (vs 42% baseline)
- 9% false positive rate (vs 15% baseline)
- 40% reduction in computational cost vs post-hoc methods
- 1.6x cost multiplier vs always using expensive operations (4.2x)
Medical Domain Performance (MedGemma)
- Enhanced medical accuracy with 0.62 confidence threshold
- Multimodal capability for medical image analysis
- Safety-first approach with conservative thresholds
- Professional verification workflow for low-confidence cases
๐ฅ๏ธ Hardware Requirements
Minimum (Inference Only)
- CPU: Modern multi-core processor
- RAM: 16GB system memory
- GPU: 8GB VRAM (RTX 3070, RTX 4060 Ti+)
- Storage: 15GB free space
- Models: ~5GB each (Llama/MedGemma)
Recommended (Inference)
- CPU: Intel i7/AMD Ryzen 7+
- RAM: 32GB system memory
- GPU: 12GB+ VRAM (RTX 4070, RTX 3080+)
- Storage: NVMe SSD, 25GB+ free
- CUDA: 11.8+ compatible driver
Training Requirements
- CPU: High-performance multi-core (i9/Ryzen 9)
- RAM: 64GB+ system memory
- GPU: 24GB+ VRAM (RTX 4090, A100, H100)
- Storage: 200GB+ NVMe SSD
- Model checkpoints: ~10GB per epoch
- Training datasets: ~30GB
- Logs and outputs: ~50GB
- Network: High-speed internet for downloads
MedGemma Specific
- Additional storage: +10GB for multimodal models
- Image processing: PIL/Pillow for image capabilities
- Memory: +4GB RAM for image processing pipeline
CPU-Only Mode
- RAM: 32GB minimum (64GB recommended)
- Performance: 10-50x slower than GPU
- Not recommended: For production medical applications
๐ Safety Considerations
Medical Applications
- Professional oversight required: HalluNox is a research tool, not medical advice
- Validation needed: All medical outputs should be verified by qualified professionals
- Conservative thresholds: 0.62 threshold ensures high precision for medical content
- Clear disclaimers: Always include appropriate medical disclaimers in applications
General Use
- Confidence-based routing: Use routing recommendations for appropriate escalation
- Human oversight: Very low confidence predictions require human review
- Regular evaluation: Monitor performance on your specific use cases
๐ ๏ธ Troubleshooting
Common Issues and Solutions
CUDA Out of Memory Error
OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB...
Solution: Use 4-bit quantization
detector = HallucinationDetector.for_low_memory()
Deprecated torch_dtype Warning
`torch_dtype` is deprecated! Use `dtype` instead!
Solution: Already fixed in HalluNox v0.3.2+ - the package now uses the correct dtype parameter.
Double Model Loading (MedGemma)
Loading checkpoint shards: 100% 2/2 [00:37<00:00, 18.20s/it]
Loading checkpoint shards: 100% 2/2 [00:36<00:00, 17.88s/it]
Solution: Already optimized in HalluNox v0.3.2+ - MedGemma now uses a unified model approach that avoids double loading.
Accelerate Warning
WARNING:accelerate.big_modeling:Some parameters are on the meta device...
Solution: This is normal with quantization - parameters are automatically moved to GPU during inference.
Dependency Version Conflict (AutoProcessor)
โ ๏ธ Could not load AutoProcessor: module 'requests' has no attribute 'exceptions'
AttributeError: module 'requests' has no attribute 'exceptions'
Solution: This is a compatibility issue between transformers and requests versions.
pip install --upgrade transformers requests huggingface_hub
# Or force reinstall
pip install --force-reinstall transformers>=4.45.0 requests>=2.31.0
Fallback: HalluNox automatically falls back to text-only mode when this occurs.
Model Hidden States NaN/Inf Issues โ RESOLVED
โ ๏ธ Warning: NaN/Inf detected in model hidden states
Hidden shape: torch.Size([3, 16, 2560])
NaN count: 122880
โ FIXED in HalluNox v0.6.3+: This issue has been completely resolved by adopting the proven approach from our working inference pipeline:
Root Cause: 4-bit quantization was causing numerical instabilities with certain model architectures.
Solution Applied:
- Disabled Quantization: Removed 4-bit quantization that was causing NaN issues
- Simplified Model Loading: Now uses the same approach as our proven
inference_gemma.py - Clean Architecture: Removed complex stability measures that were interfering
- Stable Precision: Uses
torch.bfloat16for optimal performance without instabilities
Repetitive Text and Unwanted Artifacts โ RESOLVED
๐ฌ Reference Response (forced): I am programmed to be a harmless AI assistant...
g
I am programmed to be a harmless AI assistant...
g
[repetitive output continues...]
โ FIXED in HalluNox v0.6.3+: Repetitive text generation and unwanted artifacts have been completely resolved:
Root Cause: Improper message formatting and sampling parameters causing the model to not understand conversation boundaries.
Solution Applied:
- Deterministic Generation: Changed from
do_sample=Truetodo_sample=Falsematching Jupyter notebook approach - Proper Chat Templates: Adopted exact message formatting from working Jupyter notebook implementation
- Removed Sampling Parameters: Eliminated
temperature,top_p,repetition_penaltythat were causing issues - Clean Tokenization: Uses
tokenizer.apply_chat_template()with proper parameters for conversation structure
Current Recommended Usage (v0.6.3+):
# Standard usage - now stable by default
detector = HallucinationDetector.for_low_memory(
llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
device="cuda"
)
# Both NaN issues and repetitive text are now automatically resolved
Migration from v0.4.9 and earlier: No code changes needed - existing code will automatically use the stable approach.
Environment Optimization
For better memory management, set:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Memory Requirements by Configuration
| GPU VRAM | Recommended Configuration | Expected Performance |
|---|---|---|
| 4-6GB | for_low_memory() + reduce batch size |
Basic functionality |
| 8-12GB | for_low_memory() |
Full functionality |
| 16GB+ | Standard configuration | Optimal performance |
| 24GB+ | Multiple models + training | Development/research |
๐ License
This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
๐ Citation
If you use HalluNox in your research, please cite:
@article{nandakishor2024hallunox,
title={Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation},
author={Nandakishor M},
journal={AI Safety Research},
year={2024},
organization={Convai Innovations}
}
๐ค Contributing
We welcome contributions! Please see our contributing guidelines and submit pull requests to our repository.
Development Setup
git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e ".[dev]"
๐ Support
For technical support and questions:
- Email: support@convaiinnovations.com
- Issues: GitHub Issues
- Documentation: Full API docs available online
๐จโ๐ป Author
Nandakishor M
AI Safety Research
Convai Innovations Pvt. Ltd.
Email: support@convaiinnovations.com
Disclaimer: HalluNox is a research tool for hallucination detection and should not be used as the sole basis for critical decisions, especially in medical contexts. Always seek professional advice for medical applications.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hallunox-0.6.3.tar.gz.
File metadata
- Download URL: hallunox-0.6.3.tar.gz
- Upload date:
- Size: 84.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
75755724d9845caff0cb8f9dc39ee5e5e8c53d266f686ed35056167dce9cd6a7
|
|
| MD5 |
4c8b49e98068611d02d576b20b6cbf92
|
|
| BLAKE2b-256 |
2c76f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b
|
File details
Details for the file hallunox-0.6.3-py3-none-any.whl.
File metadata
- Download URL: hallunox-0.6.3-py3-none-any.whl
- Upload date:
- Size: 63.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6261ab99835db50428edb43953c1b26c8abf052cccd6853c86d8142a6746e8ec
|
|
| MD5 |
533c4b5190880399506fd14d2c0fa1a3
|
|
| BLAKE2b-256 |
3d1715a5b7b1a88a22dfa7cfe9637e558e2f3f6add8c1d850235f59dd87c6cb7
|