A confidence-aware routing system for LLM hallucination detection using multi-signal approach

These details have not been verified by PyPI

Project links

Project description

HalluNox

Confidence-Aware Routing for Large Language Model Reliability Enhancement

A Python package implementing a multi-signal approach to pre-generation hallucination mitigation for Large Language Models. HalluNox combines semantic alignment measurement, internal convergence analysis, and learned confidence estimation to produce unified confidence scores for proactive routing decisions.

✨ Features

🎯 Pre-generation Hallucination Detection: Assess model reliability before generation begins
🔄 Confidence-Aware Routing: Automatically route queries based on estimated confidence
🧠 Multi-Signal Approach: Combines semantic alignment, internal convergence, and learned confidence
⚡ Multi-Model Support: Llama-3.2-3B-Instruct and MedGemma-4B-IT architectures
🏥 Medical Domain Specialization: Enhanced MedGemma 4B-IT support with medical-grade confidence thresholds
🖼️ Multimodal Capabilities: Image analysis and response generation for MedGemma models
📊 Comprehensive Evaluation: Built-in metrics and routing strategy analysis
🚀 Easy Integration: Simple API for both training and inference
🏃‍♂️ Performance Optimizations: Optional LLM loading for faster initialization and lower memory usage
📝 Enhanced Query-Context: Improved accuracy with structured prompt formatting
🎛️ Adaptive Thresholds: Dynamic confidence thresholds based on model type (0.62 for medical, 0.65 for general)
💬 Response Generation: Built-in response generation with confidence-gated output
🔧 Automatic Model Management: Auto-download and configuration for supported models

🔬 Research Foundation

Based on the research paper "Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation" by Nandakishor M (Convai Innovations).

The approach implements deterministic routing to appropriate response pathways:

General Models (Llama-3.2-3B)

High Confidence (≥0.65): Local generation
Medium Confidence (0.60-0.65): Retrieval-augmented generation
Low Confidence (0.4-0.60): Route to larger models
Very Low Confidence (<0.4): Human review required

Medical Models (MedGemma-4B-IT)

High Medical Confidence (≥0.60): Local generation with medical validation
Medium Medical Confidence (0.55-0.60): Medical literature verification required
Low Medical Confidence (0.50-0.55): Professional medical verification required
Very Low Medical Confidence (<0.50): Seek professional medical advice

🆕 What's New in v0.6.3

✨ Enhanced Query-Context Support

🔗 Query-Context Pairs: Full support for query_context_pairs in MedGemma models for enhanced context-aware responses
🎯 Improved Accuracy: Better confidence scoring when context is provided
📝 Enhanced Response Generation: Context-aware prompt formatting for more accurate medical responses
🔄 Batch Processing: New generate_response_with_context() method for processing multiple query-context pairs

🏥 Medical Domain Enhancements

🩺 Context Integration: Medical queries now benefit from patient context and clinical background
📊 Better Confidence: Context helps improve confidence scoring for medical scenarios
🎛️ Flexible Usage: Works with existing methods while providing new convenience functions
🔍 Example Implementation: New query_context_example.py demonstrates usage patterns

🧹 Simplified Architecture

📱 Removed Dashboard: Eliminated dashboard dependencies for cleaner core package
⚡ Streamlined Installation: Faster installation without unnecessary web components
🎯 Focused Functionality: Core hallucination detection without UI overhead
📦 Lightweight: Reduced package size and dependencies

🔧 Technical Improvements

🔗 Enhanced Prompt Formatting: Context gets properly integrated into medical prompts
🎯 Backward Compatibility: All existing code continues to work unchanged
📝 Better Documentation: Comprehensive examples for query-context usage
🛡️ Stable Performance: Maintains all stability improvements from v0.6.3

🚀 Installation

Requirements

Python 3.8+
PyTorch 1.13+
CUDA-compatible GPU (recommended)
At least 8GB GPU memory for inference (improved efficiency in v0.6.3+)
16GB RAM minimum (32GB recommended for training)

Install from PyPI

pip install hallunox

Install from Source

git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e .

MedGemma Model Access

HalluNox uses the open-access convaiinnovations/gemma-finetuned-4b-it model, which doesn't require authentication. The model will be automatically downloaded on first use.

Core Dependencies

HalluNox automatically installs:

torch>=1.13.0 - PyTorch framework
transformers>=4.21.0 - Hugging Face Transformers
FlagEmbedding>=1.2.0 - BGE-M3 embedding model
datasets>=2.0.0 - Dataset loading utilities
scikit-learn>=1.0.0 - Evaluation metrics
numpy>=1.21.0 - Numerical computations
tqdm>=4.64.0 - Progress bars
Pillow>=8.0.0 - Image processing for multimodal capabilities
bitsandbytes>=0.41.0 - 4-bit quantization for memory optimization

📖 Quick Start

Basic Usage (Llama-3.2-3B)

from hallunox import HallucinationDetector

# Initialize detector (downloads pre-trained model automatically)
detector = HallucinationDetector()

# Analyze text for hallucination risk
results = detector.predict([
    "The capital of France is Paris.",  # High confidence
    "Your password is 12345678.",       # Low confidence  
    "The Moon is made of cheese."       # Very low confidence
])

# View results
for pred in results["predictions"]:
    print(f"Text: {pred['text']}")
    print(f"Confidence: {pred['confidence_score']:.3f}")
    print(f"Risk Level: {pred['risk_level']}")
    print(f"Routing Action: {pred['routing_action']}")
    print()

🏥 MedGemma Medical Domain Usage

For medical applications using MedGemma 4B-IT with multimodal capabilities:

from hallunox import HallucinationDetector
from PIL import Image

# Initialize MedGemma detector (auto-downloads medical model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    confidence_threshold=0.60,  # Medical-grade threshold
    enable_response_generation=True,  # Enable response generation
    enable_inference=True,
    mode="text"  # Text-only mode (default)
)

# Medical text analysis
medical_results = detector.predict([
    "Aspirin can help reduce heart attack risk when prescribed by a doctor.",
    "Drinking bleach will cure COVID-19.",  # Dangerous misinformation
    "Type 2 diabetes requires insulin injections in all cases.",  # Partially incorrect
])

for pred in medical_results["predictions"]:
    print(f"Medical Text: {pred['text'][:60]}...")
    print(f"Confidence: {pred['confidence_score']:.3f}")
    print(f"Risk Level: {pred['risk_level']}")
    print(f"Medical Action: {pred['routing_action']}")
    print(f"Description: {pred['description']}")
    print("-" * 50)

# Response generation with confidence checking
question = "What are the symptoms of pneumonia?"
response = detector.generate_response(question, check_confidence=True)

if response["should_generate"]:
    print(f"✅ Medical Response Generated (confidence: {response['confidence_score']:.3f})")
    print(f"Response: {response['response']}")
    print(f"Meets threshold: {response['meets_threshold']}")
    if response.get('forced_generation'):
        print("⚠️ Note: Response was generated despite low confidence")
else:
    print(f"⚠️ Response blocked (confidence: {response['confidence_score']:.3f})")
    print(f"Reason: {response['reason']}")
    print(f"Recommendation: {response['recommendation']}")

# Force generation for reference regardless of confidence
forced_response = detector.generate_response(
    question, 
    check_confidence=True, 
    force_generate=True  # Generate even if confidence is low
)
print(f"🔬 Reference Response (forced): {forced_response['response']}")
print(f"📊 Confidence: {forced_response['confidence_score']:.3f}")
print(f"🎯 Forced Generation: {forced_response['forced_generation']}")

# Multimodal image analysis (MedGemma 4B-IT only)
if detector.is_multimodal:
    print("\n🖼️ Multimodal Image Analysis")
    
    # Load medical image (replace with actual medical image)
    try:
        image = Image.open("chest_xray.jpg")
    except:
        # Create demo image for testing
        image = Image.new('RGB', (224, 224), color='lightgray')
    
    # Analyze image confidence
    image_results = detector.predict_images([image], ["Chest X-ray"])
    
    for pred in image_results["predictions"]:
        print(f"Image: {pred['image_description']}")
        print(f"Confidence: {pred['confidence_score']:.3f}")
        print(f"Interpretation: {pred['interpretation']}")
        print(f"Risk Level: {pred['risk_level']}")
    
    # Generate image description
    description = detector.generate_image_response(
        image, 
        "Describe the findings in this chest X-ray."
    )
    print(f"Generated Description: {description}")

🔧 Advanced Configuration

from hallunox import HallucinationDetector

# Full configuration example
detector = HallucinationDetector(
    # Model selection
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",  # or "unsloth/Llama-3.2-3B-Instruct"
    embed_model_id="BAAI/bge-m3",
    
    # Custom model weights (optional)
    model_path="/path/to/custom/model.pt",  # None = auto-download
    
    # Hardware configuration
    device="cuda",  # or "cpu"
    use_fp16=True,  # Mixed precision for faster inference
    
    # Sequence lengths
    max_length=512,      # LLM context length
    bge_max_length=512,  # BGE-M3 context length
    
    # Feature toggles
    load_llm=True,                    # Load LLM for embeddings
    enable_inference=True,            # Enable LLM inference
    enable_response_generation=True,  # Enable response generation
    
    # Confidence settings
    confidence_threshold=0.60,  # Custom threshold (auto-detected by model type)
    
    # Operation mode
    mode="text",  # "text", "image", "both", or "auto"
)

# Check model capabilities
print(f"Model type: {'Medical' if detector.is_medgemma_4b else 'General'}")
print(f"Multimodal support: {detector.is_multimodal}")
print(f"Operation mode: {detector.effective_mode} (requested: {detector.mode})")
print(f"Confidence threshold: {detector.confidence_threshold}")

🎛️ Operation Mode Configuration

The mode parameter controls what types of input the detector can process:

from hallunox import HallucinationDetector

# Text mode (default) - processes text inputs only
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="text"  # Text-only processing (default)
)

# Auto mode - detects capabilities from model
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="auto"  # Auto: detects based on model capabilities
)

# Image-only mode - processes images only (requires multimodal model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="image"  # Image processing only
)

# Both mode - processes text and images (requires multimodal model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="both"  # Explicit multimodal mode
)

Mode Validation

Text mode: Available for all models (default)
Image mode: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
Both mode: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
Auto mode: Automatically selects based on model capabilities
- Multimodal models → effective_mode = "both"
- Text-only models → effective_mode = "text"

Error Examples

# This will raise an error - image mode requires multimodal model
detector = HallucinationDetector(
    llm_model_id="unsloth/Llama-3.2-3B-Instruct",
    mode="image"  # ❌ Error: Image mode requires multimodal model
)

# This will raise an error - calling image methods in text mode
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="text"
)
detector.predict_images([image])  # ❌ Error: Current mode is 'text'

⚡ Performance Optimized Usage

For faster initialization when only doing embedding comparisons:

from hallunox import HallucinationDetector

# Option 1: Factory method for embedding-only usage
detector = HallucinationDetector.for_embedding_only(
    device="cuda",
    use_fp16=True
)

# Option 2: Explicit parameter control
detector = HallucinationDetector(
    load_llm=False,         # Skip expensive LLM loading
    enable_inference=False, # Disable inference capabilities
    use_fp16=True          # Use mixed precision
)

# Note: This configuration cannot perform predictions
# Use for preprocessing or embedding extraction only

🧠 Memory Optimization with Quantization

For GPUs with limited VRAM (8-16GB), use 4-bit quantization:

from hallunox import HallucinationDetector

# Option 1: Auto-optimized for low memory (recommended)
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",  # Or any supported model
    device="cuda",
    enable_response_generation=True,  # Enable response generation for evaluation
    verbose=True  # Show loading progress (optional)
)

# Option 2: Manual quantization configuration
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    use_quantization=True,  # Enable 4-bit quantization
    enable_response_generation=True,
    device="cuda"
)

# Option 3: Custom quantization settings
from transformers import BitsAndBytesConfig
import torch

custom_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 quantization type
    bnb_4bit_use_double_quant=True,     # Double quantization for extra savings
    bnb_4bit_compute_dtype=torch.bfloat16  # Compute in bfloat16
)

detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    quantization_config=custom_quant_config,
    device="cuda"
)

print(f"✅ Memory optimized: {detector.use_quantization}")
print(f"🔧 Quantization: 4-bit NF4 with double quantization")

🤖 Response Generation & Evaluation

Enabling Response Generation

When enable_response_generation=True, HalluNox can generate responses for evaluation and display the model's actual output alongside confidence scores:

from hallunox import HallucinationDetector

# Enable response generation for evaluation
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    device="cuda",
    enable_response_generation=True,  # Enable response generation
    verbose=False  # Clean logs for evaluation
)

# Test questions for evaluation
test_questions = [
    "What are the symptoms of diabetes?",
    "Drinking bleach will cure COVID-19.",  # Dangerous misinformation
    "How does aspirin help prevent heart attacks?",
    "All vaccines cause autism in children.",  # Medical misinformation
]

# Analyze with response generation
for question in test_questions:
    # The model will generate a response and analyze it
    results = detector.predict([question])
    prediction = results["predictions"][0]
    
    print(f"Question: {question}")
    print(f"Confidence: {prediction['confidence_score']:.3f}")
    print(f"Risk Level: {prediction['risk_level']}")
    print(f"Action: {prediction['medical_action']}")
    print(f"Description: {prediction['description']}")
    print("-" * 50)

Response Generation Modes

# Generate and analyze responses with confidence checking
response = detector.generate_response(
    "What are the side effects of ibuprofen?", 
    check_confidence=True
)

if response["should_generate"]:
    print(f"✅ Generated Response: {response['response']}")
    print(f"Confidence: {response['confidence_score']:.3f}")
    print(f"Meets threshold: {response['meets_threshold']}")
else:
    print(f"⚠️ Response blocked (confidence: {response['confidence_score']:.3f})")
    print(f"Reason: {response['reason']}")
    print(f"Recommendation: {response['recommendation']}")

# Force generation for reference (useful for evaluation)
forced_response = detector.generate_response(
    "What are the side effects of ibuprofen?", 
    check_confidence=True, 
    force_generate=True
)
print(f"🔬 Reference Response: {forced_response['response']}")
print(f"📊 Confidence: {forced_response['confidence_score']:.3f}")
print(f"🎯 Forced Generation: {forced_response['forced_generation']}")

Evaluation Output Example

Question: What are the symptoms of diabetes?
Generated Response: Common symptoms of diabetes include increased thirst, frequent urination, excessive hunger, unexplained weight loss, fatigue, and blurred vision. It's important to consult a healthcare provider for proper diagnosis.
Confidence: 0.857
Risk Level: LOW_MEDICAL_RISK
Action: ✅ Information can be used as reference
--------------------------------------------------
Question: Drinking bleach will cure COVID-19.
Generated Response: [Response blocked - confidence too low]
Confidence: 0.123
Risk Level: VERY_HIGH_MEDICAL_RISK
Action: ⛔ Do not use - seek professional medical advice
--------------------------------------------------

💾 Memory Usage Comparison

Configuration	Model Size	VRAM Usage	Performance
Full Precision	~16GB	~14GB	100% speed
FP16 Mixed Precision	~8GB	~7GB	95% speed
4-bit Quantization	~4GB	~3.5GB	85-90% speed
4-bit + Double Quant	~3.5GB	~3GB	85-90% speed

Recommendation: Use HallucinationDetector.for_low_memory() for GPUs with 8GB or less VRAM.

📝 Enhanced Query-Context Support (NEW in v0.6.3!)

HalluNox now provides comprehensive support for query-context pairs, especially beneficial for medical applications:

from hallunox import HallucinationDetector

# Initialize MedGemma detector for context-aware medical responses
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    enable_response_generation=True
)

# Medical query-context pairs for enhanced accuracy
medical_query_context_pairs = [
    {
        "query": "Is it safe to take ibuprofen daily?",
        "context": "Patient has a history of gastric ulcers and is currently taking blood thinners for atrial fibrillation."
    },
    {
        "query": "What's the recommended exercise routine?",
        "context": "28-year-old pregnant patient at 30 weeks, previously sedentary, no complications."
    },
    {
        "query": "How should I manage my diabetes medication?",
        "context": "Type 2 diabetes patient, HbA1c 8.2%, currently on metformin 1000mg twice daily."
    }
]

# Method 1: Confidence analysis with context
results = detector.predict_with_query_context(medical_query_context_pairs)
for pred in results["predictions"]:
    print(f"Query: {pred['text']}")
    print(f"Context-Enhanced Confidence: {pred['confidence_score']:.3f}")
    print(f"Medical Risk Level: {pred['risk_level']}")
    print(f"Recommendation: {pred['routing_action']}")

# Method 2: Response generation with context
responses = detector.generate_response_with_context(
    medical_query_context_pairs,
    max_length=300,
    check_confidence=True
)

for i, response in enumerate(responses):
    pair = medical_query_context_pairs[i]
    print(f"\nQuery: {pair['query']}")
    print(f"Context: {pair['context'][:60]}...")
    
    if isinstance(response, dict) and "should_generate" in response:
        if response["should_generate"]:
            print(f"✅ Context-Aware Response: {response['response']}")
            print(f"Confidence: {response['confidence_score']:.3f}")
        else:
            print(f"⚠️ Blocked (confidence: {response['confidence_score']:.3f})")
            print(f"Recommendation: {response['recommendation']}")

# Method 3: Individual response with context
single_response = detector.generate_response(
    prompt="Should I adjust my medication?",
    query_context_pairs=[{
        "query": "Should I adjust my medication?", 
        "context": "Patient experiencing mild side effects from current dosage"
    }],
    check_confidence=True
)

Context Impact Analysis

# Compare confidence with and without context
query = "Is this medication safe during pregnancy?"

# Without context
no_context = detector.predict([query])
print(f"Without context: {no_context['predictions'][0]['confidence_score']:.3f}")

# With context
with_context = detector.predict([query], query_context_pairs=[{
    "query": query,
    "context": "Patient is 12 weeks pregnant, no previous complications, taking prenatal vitamins"
}])
print(f"With context: {with_context['predictions'][0]['confidence_score']:.3f}")

# Context benefit
improvement = with_context['predictions'][0]['confidence_score'] - no_context['predictions'][0]['confidence_score']
print(f"Context improvement: {improvement:+.3f}")

🖥️ Command Line Interface

HalluNox provides a comprehensive CLI for various use cases:

Interactive Mode

# General model interactive mode
hallunox-infer --interactive

# MedGemma medical interactive mode
hallunox-infer --llm_model_id convaiinnovations/gemma-finetuned-4b-it --interactive --show_generated_text

Batch Processing

# Process file with general model
hallunox-infer --input_file medical_texts.txt --output_file results.json

# Process with MedGemma and medical settings
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --input_file medical_texts.txt \
    --output_file medical_results.json \
    --show_routing \
    --show_generated_text

Image Analysis (Multimodal models only)

# Single image analysis
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --image_path chest_xray.jpg \
    --show_generated_text

# Batch image analysis
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --image_folder /path/to/medical/images \
    --output_file image_analysis.json

Demo Mode

# General demo
hallunox-infer --demo --show_routing

# Medical demo with MedGemma
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --demo \
    --mode both \
    --show_routing

# Text-only demo (faster initialization)
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --demo \
    --mode text \
    --show_routing

🔨 Training Your Own Model

Quick Training

from hallunox import Trainer, TrainingConfig

# Configure training
config = TrainingConfig(
    # Model selection
    model_id="convaiinnovations/gemma-finetuned-4b-it",  # or "unsloth/Llama-3.2-3B-Instruct"
    embed_model_id="BAAI/bge-m3",
    
    # Training parameters
    batch_size=8,
    learning_rate=5e-4,
    max_epochs=6,
    warmup_steps=300,
    
    # Dataset configuration
    use_truthfulqa=True,
    use_halueval=True,
    use_fever=True,
    max_samples_per_dataset=3000,
    
    # Output
    output_dir="./models/my_medical_model"
)

# Train model
trainer = Trainer(config)
trainer.train()

Command Line Training

# Train general model
hallunox-train --batch_size 8 --learning_rate 5e-4 --max_epochs 6

# Train medical model
hallunox-train \
    --model_id convaiinnovations/gemma-finetuned-4b-it \
    --batch_size 4 \
    --learning_rate 3e-4 \
    --max_epochs 8 \
    --output_dir ./models/custom_medgemma

🏗️ Model Architecture

HalluNox supports two main architectures:

General Architecture (Llama-3.2-3B)

LLM Component: Llama-3.2-3B-Instruct
- Extracts internal hidden representations (3072D)
- Supports any Llama-architecture model
Embedding Model: BGE-M3 (fixed)
- Provides reference semantic embeddings
- 1024-dimensional dense vectors
Projection Network: Standard ProjectionHead
- Maps LLM hidden states to embedding space
- 3-layer MLP with ReLU activations and dropout

Medical Architecture (MedGemma-4B-IT)

Unified Multimodal Model:
- Single Model: AutoModelForImageTextToText handles both text and images
- Memory Optimized: Avoids double loading (saves ~8GB VRAM)
- Fallback Support: Graceful degradation to text-only if needed
Embedding Model: BGE-M3 (same as general)
- Enhanced with medical context formatting
Projection Network: UltraStableProjectionHead
- Ultra-stable architecture with heavy normalization
- Conservative weight initialization for medical precision
- Tanh activations for stability
- Enhanced dropout and layer normalization
Multimodal Processor: AutoProcessor
- Handles image + text inputs
- Supports chat template formatting
Quantization Support: 4-bit NF4 with double quantization
- Reduces memory usage by ~75%
- Maintains 85-90% performance
- Automatic fallback for CPU

📊 API Reference

HallucinationDetector

Constructor Parameters

HallucinationDetector(
    model_path: str = None,                    # Path to trained model (None = auto-download)
    llm_model_id: str = "unsloth/Llama-3.2-3B-Instruct",  # LLM model ID
    embed_model_id: str = "BAAI/bge-m3",      # Embedding model ID
    device: str = None,                        # Device (None = auto-detect)
    max_length: int = 512,                     # LLM sequence length
    bge_max_length: int = 512,                # BGE-M3 sequence length
    use_fp16: bool = True,                     # Mixed precision
    load_llm: bool = True,                     # Load LLM
    enable_inference: bool = False,            # Enable LLM inference
    confidence_threshold: float = None,        # Custom threshold (auto-detected)
    enable_response_generation: bool = False,  # Enable response generation
    use_quantization: bool = False,            # Enable 4-bit quantization for memory savings
    quantization_config: BitsAndBytesConfig = None,  # Custom quantization config
    mode: str = "text",                        # Operation mode: "text", "image", "both", "auto" (default: "text")
)

Core Methods

Text Analysis:

predict(texts, query_context_pairs=None) - Analyze texts for hallucination confidence
predict_with_query_context(query_context_pairs) - Query-context prediction
batch_predict(texts, batch_size=16) - Efficient batch processing

Response Generation:

generate_response(prompt, max_length=512, check_confidence=True, force_generate=False, query_context_pairs=None) - Generate responses with confidence checking and optional context
generate_response_with_context(query_context_pairs, max_length=512, check_confidence=True, force_generate=False) - Generate responses for multiple query-context pairs

Multimodal (MedGemma only):

predict_images(images, image_descriptions=None) - Analyze image confidence
generate_image_response(image, prompt, max_length=200) - Generate image descriptions

Analysis:

evaluate_routing_strategy(texts) - Analyze routing decisions

Factory Methods:

for_embedding_only() - Create embedding-only detector
for_low_memory() - Create memory-optimized detector with 4-bit quantization

Response Format

{
    "predictions": [
        {
            "text": "input text",
            "confidence_score": 0.85,           # 0.0 to 1.0
            "similarity_score": 0.92,          # Cosine similarity
            "interpretation": "HIGH_CONFIDENCE", # or HIGH_MEDICAL_CONFIDENCE
            "risk_level": "LOW_RISK",          # or LOW_MEDICAL_RISK
            "routing_action": "LOCAL_GENERATION",
            "description": "This response appears to be factual and reliable."
        }
    ],
    "summary": {
        "total_texts": 1,
        "avg_confidence": 0.85,
        "high_confidence_count": 1,
        "medium_confidence_count": 0,
        "low_confidence_count": 0,
        "very_low_confidence_count": 0
    }
}

Response Generation Format

{
    "response": "Generated response text",
    "confidence_score": 0.85,
    "should_generate": True,
    "meets_threshold": True,
    "forced_generation": False,  # True if generated despite low confidence
    # Or when blocked:
    "reason": "Confidence 0.45 below threshold 0.60",
    "recommendation": "RAG_RETRIEVAL"
}

Training Classes

TrainingConfig: Configuration dataclass for training parameters
Trainer: Main training class with dataset loading and model training
MultiDatasetLoader: Loads and combines multiple hallucination detection datasets

Utility Functions

download_model(): Download general pre-trained model
download_medgemma_model(model_name): Download MedGemma medical model
setup_logging(level): Configure logging
check_gpu_availability(): Check CUDA compatibility
validate_model_requirements(): Verify dependencies

📈 Performance

Our confidence-aware routing system demonstrates:

74% hallucination detection rate (vs 42% baseline)
9% false positive rate (vs 15% baseline)
40% reduction in computational cost vs post-hoc methods
1.6x cost multiplier vs always using expensive operations (4.2x)

Medical Domain Performance (MedGemma)

Enhanced medical accuracy with 0.62 confidence threshold
Multimodal capability for medical image analysis
Safety-first approach with conservative thresholds
Professional verification workflow for low-confidence cases

🖥️ Hardware Requirements

Minimum (Inference Only)

CPU: Modern multi-core processor
RAM: 16GB system memory
GPU: 8GB VRAM (RTX 3070, RTX 4060 Ti+)
Storage: 15GB free space
Models: ~5GB each (Llama/MedGemma)

Recommended (Inference)

CPU: Intel i7/AMD Ryzen 7+
RAM: 32GB system memory
GPU: 12GB+ VRAM (RTX 4070, RTX 3080+)
Storage: NVMe SSD, 25GB+ free
CUDA: 11.8+ compatible driver

Training Requirements

CPU: High-performance multi-core (i9/Ryzen 9)
RAM: 64GB+ system memory
GPU: 24GB+ VRAM (RTX 4090, A100, H100)
Storage: 200GB+ NVMe SSD
- Model checkpoints: ~10GB per epoch
- Training datasets: ~30GB
- Logs and outputs: ~50GB
Network: High-speed internet for downloads

MedGemma Specific

Additional storage: +10GB for multimodal models
Image processing: PIL/Pillow for image capabilities
Memory: +4GB RAM for image processing pipeline

CPU-Only Mode

RAM: 32GB minimum (64GB recommended)
Performance: 10-50x slower than GPU
Not recommended: For production medical applications

🔒 Safety Considerations

Medical Applications

Professional oversight required: HalluNox is a research tool, not medical advice
Validation needed: All medical outputs should be verified by qualified professionals
Conservative thresholds: 0.62 threshold ensures high precision for medical content
Clear disclaimers: Always include appropriate medical disclaimers in applications

General Use

Confidence-based routing: Use routing recommendations for appropriate escalation
Human oversight: Very low confidence predictions require human review
Regular evaluation: Monitor performance on your specific use cases

🛠️ Troubleshooting

Common Issues and Solutions

CUDA Out of Memory Error

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB...

Solution: Use 4-bit quantization

detector = HallucinationDetector.for_low_memory()

Deprecated torch_dtype Warning

`torch_dtype` is deprecated! Use `dtype` instead!

Solution: Already fixed in HalluNox v0.3.2+ - the package now uses the correct dtype parameter.

Double Model Loading (MedGemma)

Loading checkpoint shards: 100% 2/2 [00:37<00:00, 18.20s/it]
Loading checkpoint shards: 100% 2/2 [00:36<00:00, 17.88s/it]

Solution: Already optimized in HalluNox v0.3.2+ - MedGemma now uses a unified model approach that avoids double loading.

Accelerate Warning

WARNING:accelerate.big_modeling:Some parameters are on the meta device...

Solution: This is normal with quantization - parameters are automatically moved to GPU during inference.

Dependency Version Conflict (AutoProcessor)

⚠️ Could not load AutoProcessor: module 'requests' has no attribute 'exceptions'
AttributeError: module 'requests' has no attribute 'exceptions'

Solution: This is a compatibility issue between transformers and requests versions.

pip install --upgrade transformers requests huggingface_hub
# Or force reinstall
pip install --force-reinstall transformers>=4.45.0 requests>=2.31.0

Fallback: HalluNox automatically falls back to text-only mode when this occurs.

Model Hidden States NaN/Inf Issues ✅ RESOLVED

⚠️ Warning: NaN/Inf detected in model hidden states
   Hidden shape: torch.Size([3, 16, 2560])
   NaN count: 122880

✅ FIXED in HalluNox v0.6.3+: This issue has been completely resolved by adopting the proven approach from our working inference pipeline:

Root Cause: 4-bit quantization was causing numerical instabilities with certain model architectures.

Solution Applied:

Disabled Quantization: Removed 4-bit quantization that was causing NaN issues
Simplified Model Loading: Now uses the same approach as our proven inference_gemma.py
Clean Architecture: Removed complex stability measures that were interfering
Stable Precision: Uses torch.bfloat16 for optimal performance without instabilities

Repetitive Text and Unwanted Artifacts ✅ RESOLVED

🔬 Reference Response (forced): I am programmed to be a harmless AI assistant...
g
I am programmed to be a harmless AI assistant...
g
[repetitive output continues...]

✅ FIXED in HalluNox v0.6.3+: Repetitive text generation and unwanted artifacts have been completely resolved:

Root Cause: Improper message formatting and sampling parameters causing the model to not understand conversation boundaries.

Solution Applied:

Deterministic Generation: Changed from do_sample=True to do_sample=False matching Jupyter notebook approach
Proper Chat Templates: Adopted exact message formatting from working Jupyter notebook implementation
Removed Sampling Parameters: Eliminated temperature, top_p, repetition_penalty that were causing issues
Clean Tokenization: Uses tokenizer.apply_chat_template() with proper parameters for conversation structure

Current Recommended Usage (v0.6.3+):

# Standard usage - now stable by default
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    device="cuda"
)

# Both NaN issues and repetitive text are now automatically resolved

Migration from v0.4.9 and earlier: No code changes needed - existing code will automatically use the stable approach.

Environment Optimization

For better memory management, set:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Memory Requirements by Configuration

GPU VRAM	Recommended Configuration	Expected Performance
4-6GB	`for_low_memory()` + reduce batch size	Basic functionality
8-12GB	`for_low_memory()`	Full functionality
16GB+	Standard configuration	Optimal performance
24GB+	Multiple models + training	Development/research

📄 License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

📚 Citation

If you use HalluNox in your research, please cite:

@article{nandakishor2024hallunox,
    title={Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation},
    author={Nandakishor M},
    journal={AI Safety Research},
    year={2024},
    organization={Convai Innovations}
}

🤝 Contributing

We welcome contributions! Please see our contributing guidelines and submit pull requests to our repository.

Development Setup

git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e ".[dev]"

📞 Support

For technical support and questions:

Email: support@convaiinnovations.com
Issues: GitHub Issues
Documentation: Full API docs available online

👨‍💻 Author

Nandakishor M
AI Safety Research
Convai Innovations Pvt. Ltd.
Email: support@convaiinnovations.com

Disclaimer: HalluNox is a research tool for hallucination detection and should not be used as the sole basis for critical decisions, especially in medical contexts. Always seek professional advice for medical applications.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.3

Oct 6, 2025

0.6.2

Oct 5, 2025

0.6.1

Oct 5, 2025

0.6.0

Oct 5, 2025

0.5.9

Oct 5, 2025

0.5.8

Oct 5, 2025

0.5.7

Oct 5, 2025

0.5.6

Oct 5, 2025

0.5.5

Oct 5, 2025

0.5.4

Oct 5, 2025

0.5.3

Oct 5, 2025

0.5.2

Oct 5, 2025

0.5.1

Oct 5, 2025

0.5.0

Oct 5, 2025

0.4.9

Oct 5, 2025

0.4.8

Oct 5, 2025

0.4.7

Oct 5, 2025

0.4.6

Oct 5, 2025

0.4.5

Oct 5, 2025

0.4.4

Oct 5, 2025

0.4.3

Oct 5, 2025

0.4.2

Oct 5, 2025

0.4.1

Oct 5, 2025

0.4.0

Oct 5, 2025

0.3.1

Oct 5, 2025

0.3.0

Oct 5, 2025

0.2.9

Oct 5, 2025

0.2.8

Oct 5, 2025

0.2.7

Oct 5, 2025

0.2.6

Oct 5, 2025

0.2.5

Oct 5, 2025

0.2.4

Oct 5, 2025

0.2.3

Oct 5, 2025

0.2.2

Oct 5, 2025

0.2.1

Oct 5, 2025

0.2.0

Oct 5, 2025

0.1.9

Oct 5, 2025

0.1.8

Oct 3, 2025

0.1.7

Sep 29, 2025

0.1.6

Sep 29, 2025

0.1.5

Sep 29, 2025

0.1.4

Sep 24, 2025

0.1.3

Sep 24, 2025

0.1.2

Sep 24, 2025

0.1.1

Sep 24, 2025

0.1.0

Sep 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hallunox-0.6.3.tar.gz (84.0 kB view details)

Uploaded Oct 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hallunox-0.6.3-py3-none-any.whl (63.6 kB view details)

Uploaded Oct 6, 2025 Python 3

File details

Details for the file hallunox-0.6.3.tar.gz.

File metadata

Download URL: hallunox-0.6.3.tar.gz
Upload date: Oct 6, 2025
Size: 84.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for hallunox-0.6.3.tar.gz
Algorithm	Hash digest
SHA256	`75755724d9845caff0cb8f9dc39ee5e5e8c53d266f686ed35056167dce9cd6a7`
MD5	`4c8b49e98068611d02d576b20b6cbf92`
BLAKE2b-256	`2c76f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b`

See more details on using hashes here.

File details

Details for the file hallunox-0.6.3-py3-none-any.whl.

File metadata

Download URL: hallunox-0.6.3-py3-none-any.whl
Upload date: Oct 6, 2025
Size: 63.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for hallunox-0.6.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6261ab99835db50428edb43953c1b26c8abf052cccd6853c86d8142a6746e8ec`
MD5	`533c4b5190880399506fd14d2c0fa1a3`
BLAKE2b-256	`3d1715a5b7b1a88a22dfa7cfe9637e558e2f3f6add8c1d850235f59dd87c6cb7`

See more details on using hashes here.

hallunox 0.6.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

HalluNox

✨ Features

🔬 Research Foundation

General Models (Llama-3.2-3B)

Medical Models (MedGemma-4B-IT)

🆕 What's New in v0.6.3

✨ Enhanced Query-Context Support

🏥 Medical Domain Enhancements

🧹 Simplified Architecture

🔧 Technical Improvements

🚀 Installation

Requirements

Install from PyPI

Install from Source

MedGemma Model Access

Core Dependencies

📖 Quick Start

Basic Usage (Llama-3.2-3B)

🏥 MedGemma Medical Domain Usage

🔧 Advanced Configuration

🎛️ Operation Mode Configuration

Mode Validation

Error Examples

⚡ Performance Optimized Usage

🧠 Memory Optimization with Quantization

🤖 Response Generation & Evaluation

Enabling Response Generation

Response Generation Modes

Evaluation Output Example

💾 Memory Usage Comparison

📝 Enhanced Query-Context Support (NEW in v0.6.3!)

Context Impact Analysis

🖥️ Command Line Interface

Interactive Mode

Batch Processing

Image Analysis (Multimodal models only)

Demo Mode

🔨 Training Your Own Model

Quick Training

Command Line Training

🏗️ Model Architecture

General Architecture (Llama-3.2-3B)

Medical Architecture (MedGemma-4B-IT)

📊 API Reference

HallucinationDetector

Constructor Parameters

Core Methods

Response Format

Response Generation Format

Training Classes

Utility Functions

📈 Performance

Medical Domain Performance (MedGemma)

🖥️ Hardware Requirements

Minimum (Inference Only)

Recommended (Inference)

Training Requirements

MedGemma Specific

CPU-Only Mode

🔒 Safety Considerations

Medical Applications

General Use

🛠️ Troubleshooting

Common Issues and Solutions

CUDA Out of Memory Error

Deprecated torch_dtype Warning

Double Model Loading (MedGemma)

Accelerate Warning

Dependency Version Conflict (AutoProcessor)

Model Hidden States NaN/Inf Issues ✅ RESOLVED

Repetitive Text and Unwanted Artifacts ✅ RESOLVED