Skip to main content

A confidence-aware routing system for LLM hallucination detection using multi-signal approach

Project description

HalluNox

Confidence-Aware Routing for Large Language Model Reliability Enhancement

A Python package implementing a multi-signal approach to pre-generation hallucination mitigation for Large Language Models. HalluNox combines semantic alignment measurement, internal convergence analysis, and learned confidence estimation to produce unified confidence scores for proactive routing decisions.

โœจ Features

  • ๐ŸŽฏ Pre-generation Hallucination Detection: Assess model reliability before generation begins
  • ๐Ÿ”„ Confidence-Aware Routing: Automatically route queries based on estimated confidence
  • ๐Ÿง  Multi-Signal Approach: Combines semantic alignment, internal convergence, and learned confidence
  • โšก Multi-Model Support: Llama-3.2-3B-Instruct and MedGemma-4B-IT architectures
  • ๐Ÿฅ Medical Domain Specialization: Enhanced MedGemma 4B-IT support with medical-grade confidence thresholds
  • ๐Ÿ–ผ๏ธ Multimodal Capabilities: Image analysis and response generation for MedGemma models
  • ๐Ÿ“Š Comprehensive Evaluation: Built-in metrics and routing strategy analysis
  • ๐Ÿš€ Easy Integration: Simple API for both training and inference
  • ๐Ÿƒโ€โ™‚๏ธ Performance Optimizations: Optional LLM loading for faster initialization and lower memory usage
  • ๐Ÿ“ Enhanced Query-Context: Improved accuracy with structured prompt formatting
  • ๐ŸŽ›๏ธ Adaptive Thresholds: Dynamic confidence thresholds based on model type (0.62 for medical, 0.65 for general)
  • ๐Ÿ’ฌ Response Generation: Built-in response generation with confidence-gated output
  • ๐Ÿ”ง Automatic Model Management: Auto-download and configuration for supported models

๐Ÿ”ฌ Research Foundation

Based on the research paper "Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation" by Nandakishor M (Convai Innovations).

The approach implements deterministic routing to appropriate response pathways:

General Models (Llama-3.2-3B)

  • High Confidence (โ‰ฅ0.65): Local generation
  • Medium Confidence (0.60-0.65): Retrieval-augmented generation
  • Low Confidence (0.4-0.60): Route to larger models
  • Very Low Confidence (<0.4): Human review required

Medical Models (MedGemma-4B-IT)

  • High Medical Confidence (โ‰ฅ0.60): Local generation with medical validation
  • Medium Medical Confidence (0.55-0.60): Medical literature verification required
  • Low Medical Confidence (0.50-0.55): Professional medical verification required
  • Very Low Medical Confidence (<0.50): Seek professional medical advice

๐Ÿ†• What's New in v0.6.3

โœจ Enhanced Query-Context Support

  • ๐Ÿ”— Query-Context Pairs: Full support for query_context_pairs in MedGemma models for enhanced context-aware responses
  • ๐ŸŽฏ Improved Accuracy: Better confidence scoring when context is provided
  • ๐Ÿ“ Enhanced Response Generation: Context-aware prompt formatting for more accurate medical responses
  • ๐Ÿ”„ Batch Processing: New generate_response_with_context() method for processing multiple query-context pairs

๐Ÿฅ Medical Domain Enhancements

  • ๐Ÿฉบ Context Integration: Medical queries now benefit from patient context and clinical background
  • ๐Ÿ“Š Better Confidence: Context helps improve confidence scoring for medical scenarios
  • ๐ŸŽ›๏ธ Flexible Usage: Works with existing methods while providing new convenience functions
  • ๐Ÿ” Example Implementation: New query_context_example.py demonstrates usage patterns

๐Ÿงน Simplified Architecture

  • ๐Ÿ“ฑ Removed Dashboard: Eliminated dashboard dependencies for cleaner core package
  • โšก Streamlined Installation: Faster installation without unnecessary web components
  • ๐ŸŽฏ Focused Functionality: Core hallucination detection without UI overhead
  • ๐Ÿ“ฆ Lightweight: Reduced package size and dependencies

๐Ÿ”ง Technical Improvements

  • ๐Ÿ”— Enhanced Prompt Formatting: Context gets properly integrated into medical prompts
  • ๐ŸŽฏ Backward Compatibility: All existing code continues to work unchanged
  • ๐Ÿ“ Better Documentation: Comprehensive examples for query-context usage
  • ๐Ÿ›ก๏ธ Stable Performance: Maintains all stability improvements from v0.6.3

๐Ÿš€ Installation

Requirements

  • Python 3.8+
  • PyTorch 1.13+
  • CUDA-compatible GPU (recommended)
  • At least 8GB GPU memory for inference (improved efficiency in v0.6.3+)
  • 16GB RAM minimum (32GB recommended for training)

Install from PyPI

pip install hallunox

Install from Source

git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e .

MedGemma Model Access

HalluNox uses the open-access convaiinnovations/gemma-finetuned-4b-it model, which doesn't require authentication. The model will be automatically downloaded on first use.

Core Dependencies

HalluNox automatically installs:

  • torch>=1.13.0 - PyTorch framework
  • transformers>=4.21.0 - Hugging Face Transformers
  • FlagEmbedding>=1.2.0 - BGE-M3 embedding model
  • datasets>=2.0.0 - Dataset loading utilities
  • scikit-learn>=1.0.0 - Evaluation metrics
  • numpy>=1.21.0 - Numerical computations
  • tqdm>=4.64.0 - Progress bars
  • Pillow>=8.0.0 - Image processing for multimodal capabilities
  • bitsandbytes>=0.41.0 - 4-bit quantization for memory optimization

๐Ÿ“– Quick Start

Basic Usage (Llama-3.2-3B)

from hallunox import HallucinationDetector

# Initialize detector (downloads pre-trained model automatically)
detector = HallucinationDetector()

# Analyze text for hallucination risk
results = detector.predict([
    "The capital of France is Paris.",  # High confidence
    "Your password is 12345678.",       # Low confidence  
    "The Moon is made of cheese."       # Very low confidence
])

# View results
for pred in results["predictions"]:
    print(f"Text: {pred['text']}")
    print(f"Confidence: {pred['confidence_score']:.3f}")
    print(f"Risk Level: {pred['risk_level']}")
    print(f"Routing Action: {pred['routing_action']}")
    print()

๐Ÿฅ MedGemma Medical Domain Usage

For medical applications using MedGemma 4B-IT with multimodal capabilities:

from hallunox import HallucinationDetector
from PIL import Image

# Initialize MedGemma detector (auto-downloads medical model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    confidence_threshold=0.60,  # Medical-grade threshold
    enable_response_generation=True,  # Enable response generation
    enable_inference=True,
    mode="text"  # Text-only mode (default)
)

# Medical text analysis
medical_results = detector.predict([
    "Aspirin can help reduce heart attack risk when prescribed by a doctor.",
    "Drinking bleach will cure COVID-19.",  # Dangerous misinformation
    "Type 2 diabetes requires insulin injections in all cases.",  # Partially incorrect
])

for pred in medical_results["predictions"]:
    print(f"Medical Text: {pred['text'][:60]}...")
    print(f"Confidence: {pred['confidence_score']:.3f}")
    print(f"Risk Level: {pred['risk_level']}")
    print(f"Medical Action: {pred['routing_action']}")
    print(f"Description: {pred['description']}")
    print("-" * 50)

# Response generation with confidence checking
question = "What are the symptoms of pneumonia?"
response = detector.generate_response(question, check_confidence=True)

if response["should_generate"]:
    print(f"โœ… Medical Response Generated (confidence: {response['confidence_score']:.3f})")
    print(f"Response: {response['response']}")
    print(f"Meets threshold: {response['meets_threshold']}")
    if response.get('forced_generation'):
        print("โš ๏ธ Note: Response was generated despite low confidence")
else:
    print(f"โš ๏ธ Response blocked (confidence: {response['confidence_score']:.3f})")
    print(f"Reason: {response['reason']}")
    print(f"Recommendation: {response['recommendation']}")

# Force generation for reference regardless of confidence
forced_response = detector.generate_response(
    question, 
    check_confidence=True, 
    force_generate=True  # Generate even if confidence is low
)
print(f"๐Ÿ”ฌ Reference Response (forced): {forced_response['response']}")
print(f"๐Ÿ“Š Confidence: {forced_response['confidence_score']:.3f}")
print(f"๐ŸŽฏ Forced Generation: {forced_response['forced_generation']}")

# Multimodal image analysis (MedGemma 4B-IT only)
if detector.is_multimodal:
    print("\n๐Ÿ–ผ๏ธ Multimodal Image Analysis")
    
    # Load medical image (replace with actual medical image)
    try:
        image = Image.open("chest_xray.jpg")
    except:
        # Create demo image for testing
        image = Image.new('RGB', (224, 224), color='lightgray')
    
    # Analyze image confidence
    image_results = detector.predict_images([image], ["Chest X-ray"])
    
    for pred in image_results["predictions"]:
        print(f"Image: {pred['image_description']}")
        print(f"Confidence: {pred['confidence_score']:.3f}")
        print(f"Interpretation: {pred['interpretation']}")
        print(f"Risk Level: {pred['risk_level']}")
    
    # Generate image description
    description = detector.generate_image_response(
        image, 
        "Describe the findings in this chest X-ray."
    )
    print(f"Generated Description: {description}")

๐Ÿ”ง Advanced Configuration

from hallunox import HallucinationDetector

# Full configuration example
detector = HallucinationDetector(
    # Model selection
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",  # or "unsloth/Llama-3.2-3B-Instruct"
    embed_model_id="BAAI/bge-m3",
    
    # Custom model weights (optional)
    model_path="/path/to/custom/model.pt",  # None = auto-download
    
    # Hardware configuration
    device="cuda",  # or "cpu"
    use_fp16=True,  # Mixed precision for faster inference
    
    # Sequence lengths
    max_length=512,      # LLM context length
    bge_max_length=512,  # BGE-M3 context length
    
    # Feature toggles
    load_llm=True,                    # Load LLM for embeddings
    enable_inference=True,            # Enable LLM inference
    enable_response_generation=True,  # Enable response generation
    
    # Confidence settings
    confidence_threshold=0.60,  # Custom threshold (auto-detected by model type)
    
    # Operation mode
    mode="text",  # "text", "image", "both", or "auto"
)

# Check model capabilities
print(f"Model type: {'Medical' if detector.is_medgemma_4b else 'General'}")
print(f"Multimodal support: {detector.is_multimodal}")
print(f"Operation mode: {detector.effective_mode} (requested: {detector.mode})")
print(f"Confidence threshold: {detector.confidence_threshold}")

๐ŸŽ›๏ธ Operation Mode Configuration

The mode parameter controls what types of input the detector can process:

from hallunox import HallucinationDetector

# Text mode (default) - processes text inputs only
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="text"  # Text-only processing (default)
)

# Auto mode - detects capabilities from model
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="auto"  # Auto: detects based on model capabilities
)

# Image-only mode - processes images only (requires multimodal model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="image"  # Image processing only
)

# Both mode - processes text and images (requires multimodal model)
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="both"  # Explicit multimodal mode
)

Mode Validation

  • Text mode: Available for all models (default)
  • Image mode: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
  • Both mode: Requires multimodal model (e.g., convaiinnovations/gemma-finetuned-4b-it)
  • Auto mode: Automatically selects based on model capabilities
    • Multimodal models โ†’ effective_mode = "both"
    • Text-only models โ†’ effective_mode = "text"

Error Examples

# This will raise an error - image mode requires multimodal model
detector = HallucinationDetector(
    llm_model_id="unsloth/Llama-3.2-3B-Instruct",
    mode="image"  # โŒ Error: Image mode requires multimodal model
)

# This will raise an error - calling image methods in text mode
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    mode="text"
)
detector.predict_images([image])  # โŒ Error: Current mode is 'text'

โšก Performance Optimized Usage

For faster initialization when only doing embedding comparisons:

from hallunox import HallucinationDetector

# Option 1: Factory method for embedding-only usage
detector = HallucinationDetector.for_embedding_only(
    device="cuda",
    use_fp16=True
)

# Option 2: Explicit parameter control
detector = HallucinationDetector(
    load_llm=False,         # Skip expensive LLM loading
    enable_inference=False, # Disable inference capabilities
    use_fp16=True          # Use mixed precision
)

# Note: This configuration cannot perform predictions
# Use for preprocessing or embedding extraction only

๐Ÿง  Memory Optimization with Quantization

For GPUs with limited VRAM (8-16GB), use 4-bit quantization:

from hallunox import HallucinationDetector

# Option 1: Auto-optimized for low memory (recommended)
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",  # Or any supported model
    device="cuda",
    enable_response_generation=True,  # Enable response generation for evaluation
    verbose=True  # Show loading progress (optional)
)

# Option 2: Manual quantization configuration
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    use_quantization=True,  # Enable 4-bit quantization
    enable_response_generation=True,
    device="cuda"
)

# Option 3: Custom quantization settings
from transformers import BitsAndBytesConfig
import torch

custom_quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NF4 quantization type
    bnb_4bit_use_double_quant=True,     # Double quantization for extra savings
    bnb_4bit_compute_dtype=torch.bfloat16  # Compute in bfloat16
)

detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    quantization_config=custom_quant_config,
    device="cuda"
)

print(f"โœ… Memory optimized: {detector.use_quantization}")
print(f"๐Ÿ”ง Quantization: 4-bit NF4 with double quantization")

๐Ÿค– Response Generation & Evaluation

Enabling Response Generation

When enable_response_generation=True, HalluNox can generate responses for evaluation and display the model's actual output alongside confidence scores:

from hallunox import HallucinationDetector

# Enable response generation for evaluation
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    device="cuda",
    enable_response_generation=True,  # Enable response generation
    verbose=False  # Clean logs for evaluation
)

# Test questions for evaluation
test_questions = [
    "What are the symptoms of diabetes?",
    "Drinking bleach will cure COVID-19.",  # Dangerous misinformation
    "How does aspirin help prevent heart attacks?",
    "All vaccines cause autism in children.",  # Medical misinformation
]

# Analyze with response generation
for question in test_questions:
    # The model will generate a response and analyze it
    results = detector.predict([question])
    prediction = results["predictions"][0]
    
    print(f"Question: {question}")
    print(f"Confidence: {prediction['confidence_score']:.3f}")
    print(f"Risk Level: {prediction['risk_level']}")
    print(f"Action: {prediction['medical_action']}")
    print(f"Description: {prediction['description']}")
    print("-" * 50)

Response Generation Modes

# Generate and analyze responses with confidence checking
response = detector.generate_response(
    "What are the side effects of ibuprofen?", 
    check_confidence=True
)

if response["should_generate"]:
    print(f"โœ… Generated Response: {response['response']}")
    print(f"Confidence: {response['confidence_score']:.3f}")
    print(f"Meets threshold: {response['meets_threshold']}")
else:
    print(f"โš ๏ธ Response blocked (confidence: {response['confidence_score']:.3f})")
    print(f"Reason: {response['reason']}")
    print(f"Recommendation: {response['recommendation']}")

# Force generation for reference (useful for evaluation)
forced_response = detector.generate_response(
    "What are the side effects of ibuprofen?", 
    check_confidence=True, 
    force_generate=True
)
print(f"๐Ÿ”ฌ Reference Response: {forced_response['response']}")
print(f"๐Ÿ“Š Confidence: {forced_response['confidence_score']:.3f}")
print(f"๐ŸŽฏ Forced Generation: {forced_response['forced_generation']}")

Evaluation Output Example

Question: What are the symptoms of diabetes?
Generated Response: Common symptoms of diabetes include increased thirst, frequent urination, excessive hunger, unexplained weight loss, fatigue, and blurred vision. It's important to consult a healthcare provider for proper diagnosis.
Confidence: 0.857
Risk Level: LOW_MEDICAL_RISK
Action: โœ… Information can be used as reference
--------------------------------------------------
Question: Drinking bleach will cure COVID-19.
Generated Response: [Response blocked - confidence too low]
Confidence: 0.123
Risk Level: VERY_HIGH_MEDICAL_RISK
Action: โ›” Do not use - seek professional medical advice
--------------------------------------------------

๐Ÿ’พ Memory Usage Comparison

Configuration Model Size VRAM Usage Performance
Full Precision ~16GB ~14GB 100% speed
FP16 Mixed Precision ~8GB ~7GB 95% speed
4-bit Quantization ~4GB ~3.5GB 85-90% speed
4-bit + Double Quant ~3.5GB ~3GB 85-90% speed

Recommendation: Use HallucinationDetector.for_low_memory() for GPUs with 8GB or less VRAM.

๐Ÿ“ Enhanced Query-Context Support (NEW in v0.6.3!)

HalluNox now provides comprehensive support for query-context pairs, especially beneficial for medical applications:

from hallunox import HallucinationDetector

# Initialize MedGemma detector for context-aware medical responses
detector = HallucinationDetector(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    enable_response_generation=True
)

# Medical query-context pairs for enhanced accuracy
medical_query_context_pairs = [
    {
        "query": "Is it safe to take ibuprofen daily?",
        "context": "Patient has a history of gastric ulcers and is currently taking blood thinners for atrial fibrillation."
    },
    {
        "query": "What's the recommended exercise routine?",
        "context": "28-year-old pregnant patient at 30 weeks, previously sedentary, no complications."
    },
    {
        "query": "How should I manage my diabetes medication?",
        "context": "Type 2 diabetes patient, HbA1c 8.2%, currently on metformin 1000mg twice daily."
    }
]

# Method 1: Confidence analysis with context
results = detector.predict_with_query_context(medical_query_context_pairs)
for pred in results["predictions"]:
    print(f"Query: {pred['text']}")
    print(f"Context-Enhanced Confidence: {pred['confidence_score']:.3f}")
    print(f"Medical Risk Level: {pred['risk_level']}")
    print(f"Recommendation: {pred['routing_action']}")

# Method 2: Response generation with context
responses = detector.generate_response_with_context(
    medical_query_context_pairs,
    max_length=300,
    check_confidence=True
)

for i, response in enumerate(responses):
    pair = medical_query_context_pairs[i]
    print(f"\nQuery: {pair['query']}")
    print(f"Context: {pair['context'][:60]}...")
    
    if isinstance(response, dict) and "should_generate" in response:
        if response["should_generate"]:
            print(f"โœ… Context-Aware Response: {response['response']}")
            print(f"Confidence: {response['confidence_score']:.3f}")
        else:
            print(f"โš ๏ธ Blocked (confidence: {response['confidence_score']:.3f})")
            print(f"Recommendation: {response['recommendation']}")

# Method 3: Individual response with context
single_response = detector.generate_response(
    prompt="Should I adjust my medication?",
    query_context_pairs=[{
        "query": "Should I adjust my medication?", 
        "context": "Patient experiencing mild side effects from current dosage"
    }],
    check_confidence=True
)

Context Impact Analysis

# Compare confidence with and without context
query = "Is this medication safe during pregnancy?"

# Without context
no_context = detector.predict([query])
print(f"Without context: {no_context['predictions'][0]['confidence_score']:.3f}")

# With context
with_context = detector.predict([query], query_context_pairs=[{
    "query": query,
    "context": "Patient is 12 weeks pregnant, no previous complications, taking prenatal vitamins"
}])
print(f"With context: {with_context['predictions'][0]['confidence_score']:.3f}")

# Context benefit
improvement = with_context['predictions'][0]['confidence_score'] - no_context['predictions'][0]['confidence_score']
print(f"Context improvement: {improvement:+.3f}")

๐Ÿ–ฅ๏ธ Command Line Interface

HalluNox provides a comprehensive CLI for various use cases:

Interactive Mode

# General model interactive mode
hallunox-infer --interactive

# MedGemma medical interactive mode
hallunox-infer --llm_model_id convaiinnovations/gemma-finetuned-4b-it --interactive --show_generated_text

Batch Processing

# Process file with general model
hallunox-infer --input_file medical_texts.txt --output_file results.json

# Process with MedGemma and medical settings
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --input_file medical_texts.txt \
    --output_file medical_results.json \
    --show_routing \
    --show_generated_text

Image Analysis (Multimodal models only)

# Single image analysis
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --image_path chest_xray.jpg \
    --show_generated_text

# Batch image analysis
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --image_folder /path/to/medical/images \
    --output_file image_analysis.json

Demo Mode

# General demo
hallunox-infer --demo --show_routing

# Medical demo with MedGemma
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --demo \
    --mode both \
    --show_routing

# Text-only demo (faster initialization)
hallunox-infer \
    --llm_model_id convaiinnovations/gemma-finetuned-4b-it \
    --demo \
    --mode text \
    --show_routing

๐Ÿ”จ Training Your Own Model

Quick Training

from hallunox import Trainer, TrainingConfig

# Configure training
config = TrainingConfig(
    # Model selection
    model_id="convaiinnovations/gemma-finetuned-4b-it",  # or "unsloth/Llama-3.2-3B-Instruct"
    embed_model_id="BAAI/bge-m3",
    
    # Training parameters
    batch_size=8,
    learning_rate=5e-4,
    max_epochs=6,
    warmup_steps=300,
    
    # Dataset configuration
    use_truthfulqa=True,
    use_halueval=True,
    use_fever=True,
    max_samples_per_dataset=3000,
    
    # Output
    output_dir="./models/my_medical_model"
)

# Train model
trainer = Trainer(config)
trainer.train()

Command Line Training

# Train general model
hallunox-train --batch_size 8 --learning_rate 5e-4 --max_epochs 6

# Train medical model
hallunox-train \
    --model_id convaiinnovations/gemma-finetuned-4b-it \
    --batch_size 4 \
    --learning_rate 3e-4 \
    --max_epochs 8 \
    --output_dir ./models/custom_medgemma

๐Ÿ—๏ธ Model Architecture

HalluNox supports two main architectures:

General Architecture (Llama-3.2-3B)

  1. LLM Component: Llama-3.2-3B-Instruct

    • Extracts internal hidden representations (3072D)
    • Supports any Llama-architecture model
  2. Embedding Model: BGE-M3 (fixed)

    • Provides reference semantic embeddings
    • 1024-dimensional dense vectors
  3. Projection Network: Standard ProjectionHead

    • Maps LLM hidden states to embedding space
    • 3-layer MLP with ReLU activations and dropout

Medical Architecture (MedGemma-4B-IT)

  1. Unified Multimodal Model:

    • Single Model: AutoModelForImageTextToText handles both text and images
    • Memory Optimized: Avoids double loading (saves ~8GB VRAM)
    • Fallback Support: Graceful degradation to text-only if needed
  2. Embedding Model: BGE-M3 (same as general)

    • Enhanced with medical context formatting
  3. Projection Network: UltraStableProjectionHead

    • Ultra-stable architecture with heavy normalization
    • Conservative weight initialization for medical precision
    • Tanh activations for stability
    • Enhanced dropout and layer normalization
  4. Multimodal Processor: AutoProcessor

    • Handles image + text inputs
    • Supports chat template formatting
  5. Quantization Support: 4-bit NF4 with double quantization

    • Reduces memory usage by ~75%
    • Maintains 85-90% performance
    • Automatic fallback for CPU

๐Ÿ“Š API Reference

HallucinationDetector

Constructor Parameters

HallucinationDetector(
    model_path: str = None,                    # Path to trained model (None = auto-download)
    llm_model_id: str = "unsloth/Llama-3.2-3B-Instruct",  # LLM model ID
    embed_model_id: str = "BAAI/bge-m3",      # Embedding model ID
    device: str = None,                        # Device (None = auto-detect)
    max_length: int = 512,                     # LLM sequence length
    bge_max_length: int = 512,                # BGE-M3 sequence length
    use_fp16: bool = True,                     # Mixed precision
    load_llm: bool = True,                     # Load LLM
    enable_inference: bool = False,            # Enable LLM inference
    confidence_threshold: float = None,        # Custom threshold (auto-detected)
    enable_response_generation: bool = False,  # Enable response generation
    use_quantization: bool = False,            # Enable 4-bit quantization for memory savings
    quantization_config: BitsAndBytesConfig = None,  # Custom quantization config
    mode: str = "text",                        # Operation mode: "text", "image", "both", "auto" (default: "text")
)

Core Methods

Text Analysis:

  • predict(texts, query_context_pairs=None) - Analyze texts for hallucination confidence
  • predict_with_query_context(query_context_pairs) - Query-context prediction
  • batch_predict(texts, batch_size=16) - Efficient batch processing

Response Generation:

  • generate_response(prompt, max_length=512, check_confidence=True, force_generate=False, query_context_pairs=None) - Generate responses with confidence checking and optional context
  • generate_response_with_context(query_context_pairs, max_length=512, check_confidence=True, force_generate=False) - Generate responses for multiple query-context pairs

Multimodal (MedGemma only):

  • predict_images(images, image_descriptions=None) - Analyze image confidence
  • generate_image_response(image, prompt, max_length=200) - Generate image descriptions

Analysis:

  • evaluate_routing_strategy(texts) - Analyze routing decisions

Factory Methods:

  • for_embedding_only() - Create embedding-only detector
  • for_low_memory() - Create memory-optimized detector with 4-bit quantization

Response Format

{
    "predictions": [
        {
            "text": "input text",
            "confidence_score": 0.85,           # 0.0 to 1.0
            "similarity_score": 0.92,          # Cosine similarity
            "interpretation": "HIGH_CONFIDENCE", # or HIGH_MEDICAL_CONFIDENCE
            "risk_level": "LOW_RISK",          # or LOW_MEDICAL_RISK
            "routing_action": "LOCAL_GENERATION",
            "description": "This response appears to be factual and reliable."
        }
    ],
    "summary": {
        "total_texts": 1,
        "avg_confidence": 0.85,
        "high_confidence_count": 1,
        "medium_confidence_count": 0,
        "low_confidence_count": 0,
        "very_low_confidence_count": 0
    }
}

Response Generation Format

{
    "response": "Generated response text",
    "confidence_score": 0.85,
    "should_generate": True,
    "meets_threshold": True,
    "forced_generation": False,  # True if generated despite low confidence
    # Or when blocked:
    "reason": "Confidence 0.45 below threshold 0.60",
    "recommendation": "RAG_RETRIEVAL"
}

Training Classes

  • TrainingConfig: Configuration dataclass for training parameters
  • Trainer: Main training class with dataset loading and model training
  • MultiDatasetLoader: Loads and combines multiple hallucination detection datasets

Utility Functions

  • download_model(): Download general pre-trained model
  • download_medgemma_model(model_name): Download MedGemma medical model
  • setup_logging(level): Configure logging
  • check_gpu_availability(): Check CUDA compatibility
  • validate_model_requirements(): Verify dependencies

๐Ÿ“ˆ Performance

Our confidence-aware routing system demonstrates:

  • 74% hallucination detection rate (vs 42% baseline)
  • 9% false positive rate (vs 15% baseline)
  • 40% reduction in computational cost vs post-hoc methods
  • 1.6x cost multiplier vs always using expensive operations (4.2x)

Medical Domain Performance (MedGemma)

  • Enhanced medical accuracy with 0.62 confidence threshold
  • Multimodal capability for medical image analysis
  • Safety-first approach with conservative thresholds
  • Professional verification workflow for low-confidence cases

๐Ÿ–ฅ๏ธ Hardware Requirements

Minimum (Inference Only)

  • CPU: Modern multi-core processor
  • RAM: 16GB system memory
  • GPU: 8GB VRAM (RTX 3070, RTX 4060 Ti+)
  • Storage: 15GB free space
  • Models: ~5GB each (Llama/MedGemma)

Recommended (Inference)

  • CPU: Intel i7/AMD Ryzen 7+
  • RAM: 32GB system memory
  • GPU: 12GB+ VRAM (RTX 4070, RTX 3080+)
  • Storage: NVMe SSD, 25GB+ free
  • CUDA: 11.8+ compatible driver

Training Requirements

  • CPU: High-performance multi-core (i9/Ryzen 9)
  • RAM: 64GB+ system memory
  • GPU: 24GB+ VRAM (RTX 4090, A100, H100)
  • Storage: 200GB+ NVMe SSD
    • Model checkpoints: ~10GB per epoch
    • Training datasets: ~30GB
    • Logs and outputs: ~50GB
  • Network: High-speed internet for downloads

MedGemma Specific

  • Additional storage: +10GB for multimodal models
  • Image processing: PIL/Pillow for image capabilities
  • Memory: +4GB RAM for image processing pipeline

CPU-Only Mode

  • RAM: 32GB minimum (64GB recommended)
  • Performance: 10-50x slower than GPU
  • Not recommended: For production medical applications

๐Ÿ”’ Safety Considerations

Medical Applications

  • Professional oversight required: HalluNox is a research tool, not medical advice
  • Validation needed: All medical outputs should be verified by qualified professionals
  • Conservative thresholds: 0.62 threshold ensures high precision for medical content
  • Clear disclaimers: Always include appropriate medical disclaimers in applications

General Use

  • Confidence-based routing: Use routing recommendations for appropriate escalation
  • Human oversight: Very low confidence predictions require human review
  • Regular evaluation: Monitor performance on your specific use cases

๐Ÿ› ๏ธ Troubleshooting

Common Issues and Solutions

CUDA Out of Memory Error

OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB...

Solution: Use 4-bit quantization

detector = HallucinationDetector.for_low_memory()

Deprecated torch_dtype Warning

`torch_dtype` is deprecated! Use `dtype` instead!

Solution: Already fixed in HalluNox v0.3.2+ - the package now uses the correct dtype parameter.

Double Model Loading (MedGemma)

Loading checkpoint shards: 100% 2/2 [00:37<00:00, 18.20s/it]
Loading checkpoint shards: 100% 2/2 [00:36<00:00, 17.88s/it]

Solution: Already optimized in HalluNox v0.3.2+ - MedGemma now uses a unified model approach that avoids double loading.

Accelerate Warning

WARNING:accelerate.big_modeling:Some parameters are on the meta device...

Solution: This is normal with quantization - parameters are automatically moved to GPU during inference.

Dependency Version Conflict (AutoProcessor)

โš ๏ธ Could not load AutoProcessor: module 'requests' has no attribute 'exceptions'
AttributeError: module 'requests' has no attribute 'exceptions'

Solution: This is a compatibility issue between transformers and requests versions.

pip install --upgrade transformers requests huggingface_hub
# Or force reinstall
pip install --force-reinstall transformers>=4.45.0 requests>=2.31.0

Fallback: HalluNox automatically falls back to text-only mode when this occurs.

Model Hidden States NaN/Inf Issues โœ… RESOLVED

โš ๏ธ Warning: NaN/Inf detected in model hidden states
   Hidden shape: torch.Size([3, 16, 2560])
   NaN count: 122880

โœ… FIXED in HalluNox v0.6.3+: This issue has been completely resolved by adopting the proven approach from our working inference pipeline:

Root Cause: 4-bit quantization was causing numerical instabilities with certain model architectures.

Solution Applied:

  • Disabled Quantization: Removed 4-bit quantization that was causing NaN issues
  • Simplified Model Loading: Now uses the same approach as our proven inference_gemma.py
  • Clean Architecture: Removed complex stability measures that were interfering
  • Stable Precision: Uses torch.bfloat16 for optimal performance without instabilities

Repetitive Text and Unwanted Artifacts โœ… RESOLVED

๐Ÿ”ฌ Reference Response (forced): I am programmed to be a harmless AI assistant...
g
I am programmed to be a harmless AI assistant...
g
[repetitive output continues...]

โœ… FIXED in HalluNox v0.6.3+: Repetitive text generation and unwanted artifacts have been completely resolved:

Root Cause: Improper message formatting and sampling parameters causing the model to not understand conversation boundaries.

Solution Applied:

  • Deterministic Generation: Changed from do_sample=True to do_sample=False matching Jupyter notebook approach
  • Proper Chat Templates: Adopted exact message formatting from working Jupyter notebook implementation
  • Removed Sampling Parameters: Eliminated temperature, top_p, repetition_penalty that were causing issues
  • Clean Tokenization: Uses tokenizer.apply_chat_template() with proper parameters for conversation structure

Current Recommended Usage (v0.6.3+):

# Standard usage - now stable by default
detector = HallucinationDetector.for_low_memory(
    llm_model_id="convaiinnovations/gemma-finetuned-4b-it",
    device="cuda"
)

# Both NaN issues and repetitive text are now automatically resolved

Migration from v0.4.9 and earlier: No code changes needed - existing code will automatically use the stable approach.

Environment Optimization

For better memory management, set:

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Memory Requirements by Configuration

GPU VRAM Recommended Configuration Expected Performance
4-6GB for_low_memory() + reduce batch size Basic functionality
8-12GB for_low_memory() Full functionality
16GB+ Standard configuration Optimal performance
24GB+ Multiple models + training Development/research

๐Ÿ“„ License

This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

๐Ÿ“š Citation

If you use HalluNox in your research, please cite:

@article{nandakishor2024hallunox,
    title={Confidence-Aware Routing for Large Language Model Reliability Enhancement: A Multi-Signal Approach to Pre-Generation Hallucination Mitigation},
    author={Nandakishor M},
    journal={AI Safety Research},
    year={2024},
    organization={Convai Innovations}
}

๐Ÿค Contributing

We welcome contributions! Please see our contributing guidelines and submit pull requests to our repository.

Development Setup

git clone https://github.com/convai-innovations/hallunox.git
cd hallunox
pip install -e ".[dev]"

๐Ÿ“ž Support

For technical support and questions:

๐Ÿ‘จโ€๐Ÿ’ป Author

Nandakishor M
AI Safety Research
Convai Innovations Pvt. Ltd.
Email: support@convaiinnovations.com


Disclaimer: HalluNox is a research tool for hallucination detection and should not be used as the sole basis for critical decisions, especially in medical contexts. Always seek professional advice for medical applications.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hallunox-0.6.3.tar.gz (84.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hallunox-0.6.3-py3-none-any.whl (63.6 kB view details)

Uploaded Python 3

File details

Details for the file hallunox-0.6.3.tar.gz.

File metadata

  • Download URL: hallunox-0.6.3.tar.gz
  • Upload date:
  • Size: 84.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for hallunox-0.6.3.tar.gz
Algorithm Hash digest
SHA256 75755724d9845caff0cb8f9dc39ee5e5e8c53d266f686ed35056167dce9cd6a7
MD5 4c8b49e98068611d02d576b20b6cbf92
BLAKE2b-256 2c76f15afa53d796d647f9a651381152717bc7951aa9cb638e9e4fedbcccfc9b

See more details on using hashes here.

File details

Details for the file hallunox-0.6.3-py3-none-any.whl.

File metadata

  • Download URL: hallunox-0.6.3-py3-none-any.whl
  • Upload date:
  • Size: 63.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for hallunox-0.6.3-py3-none-any.whl
Algorithm Hash digest
SHA256 6261ab99835db50428edb43953c1b26c8abf052cccd6853c86d8142a6746e8ec
MD5 533c4b5190880399506fd14d2c0fa1a3
BLAKE2b-256 3d1715a5b7b1a88a22dfa7cfe9637e558e2f3f6add8c1d850235f59dd87c6cb7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page