A Python library for generating high-quality question-answer pairs from PDF, DOCX, MD, and TXT files

These details have not been verified by PyPI

Project links

Project description

SemanticQAGen

Intelligent Question-Answer Generation with Advanced Semantic Understanding

Overview • Installation • Quickstart • Feature Overview • Core Capabilities • Advanced Features • Project File Organization • Architecture • Configuration • API Reference • CLI Reference • Usage Examples • Extension • Troubleshooting • License

Beta (v0.5.1): SemanticQAGen is functional and actively maintained. The public API is largely stable, but some details may still change ahead of the 1.0 release.

Overview

SemanticQAGen is a powerful Python library for generating high-quality question-answer pairs from text documents. It uses advanced semantic understanding to intelligently process content, analyze information density, and create diverse questions across multiple cognitive levels.

SemanticQAGen features enhanced semantic chunking, dynamic question generation, validation of questions and answers, and flexible LLM routing capabilities. You can run all tasks locally on an OpenAI-compatible server, run them via a remote API, or split specific tasks (e.g., validation, analysis, generation) between local and remote servers. The library is designed with a "for Humans" philosophy - simple for basic use cases while providing advanced capabilities for power users.

Installation

Basic Installation

pip install semantic-qa-gen

The core install is enough to process plain-text and Markdown documents and to talk to any OpenAI-compatible LLM endpoint — OpenAI, Azure OpenAI, OpenRouter, or a local server such as Ollama — over the library's built-in HTTP client.

Optional Dependencies

Install extras for additional document formats and capabilities:

# PDF reading
pip install semantic-qa-gen[pdf]

# DOCX reading
pip install semantic-qa-gen[docx]

# OCR for scanned PDFs (also requires the Tesseract binary; see System Dependencies)
pip install semantic-qa-gen[ocr]

# Advanced NLP analysis (scikit-learn, numpy)
pip install semantic-qa-gen[advanced]

# All document-format support (PDF + DOCX + OCR + advanced analysis)
pip install semantic-qa-gen[formats]

# Sentence-aware chunking helpers (NLTK)
pip install semantic-qa-gen[nlp]

# Everything above, plus the official OpenAI SDK and Rich console output
pip install semantic-qa-gen[full]

# Development tools (tests, linting, docs)
pip install semantic-qa-gen[dev]

A note on LLM providers: You do not need a provider-specific extra to use hosted or local OpenAI-compatible APIs. The library communicates with them directly over its built-in HTTP client, so OpenAI, Azure OpenAI, OpenRouter, and local servers all work with the core install. The [openai] extra only installs the official openai SDK, which SemanticQAGen itself does not use — install it only if your own code depends on that SDK.

System Dependencies

A couple of features rely on libraries outside of pip:

File-type detection uses python-magic, which requires the system libmagic library. On Debian/Ubuntu: sudo apt-get install libmagic1. On macOS: brew install libmagic. On Windows, install python-magic-bin instead of relying on the system library.
OCR ([ocr]) requires the Tesseract OCR engine on your PATH. On Debian/Ubuntu: sudo apt-get install tesseract-ocr. On macOS: brew install tesseract.

Requirements

Python 3.10 or higher
Core Python dependencies install automatically; the system libraries above are only needed for their corresponding features.

Quickstart

Basic Usage

from semantic_qa_gen import SemanticQAGen

# Initialize with default settings
qa_gen = SemanticQAGen()

# Process a document
result = qa_gen.process_document("path/to/document.txt")

# Save the questions to a JSON file
qa_gen.save_questions(result, "output")

CLI Usage

# Generate questions from a document with default settings
semantic-qa-gen process document.pdf -o questions_output

# Create a config file interactively 
semantic-qa-gen init-config config.yml --interactive

# Process with a specific configuration
semantic-qa-gen process document.txt --config config.yml --format json

Feature Overview

SemanticQAGen offers a comprehensive set of features designed to produce high-quality question and answer sets:

Feature Category	Capability	Status
Document Processing	Document format support: TXT, PDF, DOCX, MD	✅
	Automatic document type detection	✅
	Cross-page content handling	✅
	Header/footer detection and removal	✅
Content Analysis	Semantic document chunking	✅
	Information density analysis	✅
	Topic coherence evaluation	✅
	Key concept extraction	✅
Question Generation	Multi-level cognitive questions (factual, inferential, conceptual)	✅
	Adaptive generation based on content quality	✅
	Question diversity enforcement	✅
	Custom question categories	✅
Answer Validation	Faithfulness verification (accuracy + completeness)	✅
	Standalone / decontextualization rewriting	✅
	Answer-leakage filtering	✅
	Diversity filtering	✅
LLM Integration	OpenAI-compatible API support (OpenAI, Azure, OpenRouter)	✅
	Local LLM support (Ollama, etc.)	✅
	Hybrid task routing	✅
	Automatic fallback mechanisms	✅
Processing Control	Checkpoint and resume capability	✅
	Concurrent processing	✅
	Progress tracking and reporting	✅
Output Options	Multiple export formats (JSON, CSV, JSONL)	✅
	Metadata inclusion	✅
	Statistics and analytics	✅
Extensibility	Custom document loaders	✅
	Custom chunking strategies	✅
	Custom validators	✅

Core Capabilities

Document Processing

Multiple Format Support

SemanticQAGen can read and process a variety of document formats including plain text, PDF, Markdown, and DOCX. Each format is handled by specialized loaders that extract content while preserving document structure. (PDF support requires the [pdf] extra and DOCX support requires the [docx] extra; TXT and Markdown work with the core install.)

# Process different file types the same way
result_txt = qa_gen.process_document("document.txt")
result_pdf = qa_gen.process_document("document.pdf")
result_md = qa_gen.process_document("document.md")
result_docx = qa_gen.process_document("document.docx")

Batch Processing

Process multiple files from a directory:

# Process all files in a directory
batch_results = qa_gen.process_input_directory()

Automatic Document Type Detection

The system automatically detects document types using both file extensions and content analysis, ensuring the correct loader is used even when file extensions are missing or incorrect.

Cross-Page Content Handling

For PDF documents, the system intelligently handles sentences and paragraphs that span across page boundaries, creating a seamless text flow for better semantic analysis.

Header/Footer Detection

Automatic detection and optional removal of repeating headers and footers in PDF documents, preventing them from being included in generated questions.

Content Analysis

Semantic Document Chunking

Documents are intelligently broken down into semantically coherent chunks based on content structure rather than arbitrary size limits. This preserves context and produces more meaningful question-answer pairs.

# Configure chunking strategy
config = {
    "chunking": {
        "strategy": "semantic",  # Options: semantic, fixed_size
        "target_chunk_size": 1500,
        "preserve_headings": True
    }
}

Information Density Analysis

Each chunk is analyzed for information density - how rich in facts and teachable content it is. This analysis guides question generation to focus on content-rich sections.

Topic Coherence Evaluation

The system evaluates how well each chunk maintains a coherent topic or theme, which helps ensure generated questions relate to a consistent subject area.

Key Concept Extraction

Important concepts, terms, and ideas are automatically identified in each chunk, forming the basis for targeted question generation.

Question Generation

Multi-level Cognitive Questions

The system generates questions across three cognitive domains:

Factual: Direct recall of information stated in the content
Inferential: Questions requiring connecting multiple pieces of information
Conceptual: Higher-order questions about principles, implications, or broader understanding

# Configure question categories
config = {
    "question_generation": {
        "categories": {
            "factual": {"min_questions": 3, "weight": 1.0},
            "inferential": {"min_questions": 2, "weight": 1.2},
            "conceptual": {"min_questions": 1, "weight": 1.5}
        }
    }
}

Adaptive Generation

The number and types of questions generated adapt automatically based on content quality. Information-dense chunks yield more questions, while sparse chunks yield fewer.

Question Diversity Enforcement

To avoid repetitive or overly similar questions, the system enforces diversity by comparing newly generated questions with existing ones and filtering out duplicates.

Custom Question Categories

Users can define custom question categories beyond the standard factual/inferential/conceptual to target specific learning objectives.

Answer Validation

Faithfulness Verification

Generated answers are scored for faithfulness against the source content — both factual accuracy and completeness — to ensure they do not contain errors or hallucinations and that they fully address the question.

Standalone / Decontextualization Rewriting

Question-answer pairs are evaluated for whether they make sense without the source passage, and can be rewritten to stand alone while preserving their grounded meaning. This produces self-contained pairs suitable for fine-tuning datasets.

Answer-Leakage Filtering

A dedicated filter detects and removes questions that inadvertently reveal their answer (or that restate the source verbatim), keeping the generated set genuinely question-shaped.

Diversity Filtering

Newly generated questions are compared against the existing set so near-duplicates are filtered out, keeping the output varied.

Advanced Features

LLM Integration

OpenAI-Compatible API Support

SemanticQAGen talks to OpenAI-compatible chat-completion endpoints over its own built-in HTTP client, with optimized prompting strategies for each task in the pipeline. This covers OpenAI, Azure OpenAI, OpenRouter, and any other service that exposes an OpenAI-compatible API — no provider SDK is required.

Local LLM Support

Support for local LLM deployment via Ollama and similar OpenAI-compatible servers, allowing use of models like Mistral, running on your own hardware without requiring external API access.

Hybrid Task Routing

Intelligently route different tasks to the most appropriate LLM based on task complexity and model capability. For example, use a strong remote model for complex question generation but a local model for simple validation tasks.

config = {
    "llm_services": {
        "local": {
            "enabled": True,
            "url": "http://localhost:11434/api",
            "model": "mistral:7b",
            "preferred_tasks": ["validation"]
        },
        "remote": {
            "enabled": True,
            "provider": "openai",
            "model": "gpt-4o",
            "preferred_tasks": ["analysis", "generation"]
        }
    }
}

Automatic Fallback Mechanisms

If a primary LLM service fails, the system automatically tries fallback services, ensuring robustness in production environments.

Processing Control

Checkpoint and Resume Capability

Processing can be interrupted and resumed later using a checkpoint system. This is essential for large documents or when processing must be paused.

config = {
    "processing": {
        "enable_checkpoints": True,
        "checkpoint_dir": "./checkpoints",
        "checkpoint_interval": 10  # Save every 10 chunks
    }
}

Concurrent Processing

Multi-threaded processing of chunks with configurable concurrency levels to maximize throughput on multi-core systems.

Progress Tracking and Reporting

Detailed progress reporting during processing, with support for both simple console output and rich interactive displays (when installed with Rich, e.g. via the [full] extra).

Output Options

Multiple Export Formats

Export question-answer pairs in various formats including JSON, CSV, and JSONL with customizable formatting options.

# Save questions in different formats
qa_gen.save_questions(result, "questions_output", format_name="json")
qa_gen.save_questions(result, "questions_output", format_name="csv")
qa_gen.save_questions(result, "questions_output", format_name="jsonl")

Metadata Inclusion

Include rich metadata about source documents, generation parameters, and validation results with the generated questions.

Statistics and Analytics

Comprehensive statistics about generated questions, including category distribution, validation success rates, and content coverage.

Project File Organization

.
├── pyproject.toml
├── tox.ini
├── README.md
└── src
    ├── main.py
    └── semantic_qa_gen
        ├── __init__.py
        ├── version.py
        ├── semantic_qa_gen.py
        ├── chunking
        │   ├── __init__.py
        │   ├── analyzer.py
        │   ├── engine.py
        │   └── strategies
        │       ├── __init__.py
        │       ├── base.py
        │       ├── fixed_size.py
        │       ├── nlp_helpers.py
        │       └── semantic.py
        ├── cli
        │   ├── __init__.py
        │   └── commands.py
        ├── config
        │   ├── __init__.py
        │   ├── manager.py
        │   └── schema.py
        ├── document
        │   ├── __init__.py
        │   ├── models.py
        │   ├── processor.py
        │   └── loaders
        │       ├── __init__.py
        │       ├── base.py
        │       ├── docx.py
        │       ├── markdown.py
        │       ├── pdf.py
        │       └── text.py
        ├── llm
        │   ├── __init__.py
        │   ├── router.py
        │   ├── adapters
        │   │   ├── __init__.py
        │   │   ├── base.py
        │   │   └── openai_adapter.py
        │   └── prompts
        │       ├── __init__.py
        │       ├── manager.py
        │       └── templates
        │           ├── analysis_prompts.yaml
        │           ├── generation_prompts.yaml
        │           ├── validation_prompts.yaml
        │           └── decontextualize_prompts.yaml
        ├── output
        │   ├── __init__.py
        │   ├── formatter.py
        │   └── adapters
        │       ├── __init__.py
        │       ├── csv.py
        │       ├── json.py
        │       └── jsonl.py
        ├── pipeline
        │   ├── __init__.py
        │   └── semantic.py
        ├── question
        │   ├── __init__.py
        │   ├── generator.py
        │   ├── processor.py
        │   ├── filters
        │   │   ├── __init__.py
        │   │   └── leak_filter.py
        │   └── validation
        │       ├── __init__.py
        │       ├── base.py
        │       ├── decontextualizer.py
        │       ├── diversity.py
        │       ├── engine.py
        │       └── factual.py
        └── utils
            ├── __init__.py
            ├── checkpoint.py
            ├── error.py
            ├── logging.py
            ├── progress.py
            └── project.py

Architecture

SemanticQAGen implements a modular pipeline architecture with clearly defined components and interfaces:

                              ARCHITECTURE OVERVIEW
┌───────────────────────────────────────────────────────────────────────────────┐
│                                                                               │
│                       ┌─────────────────────────────────┐                     │
│                       │       SemanticQAGen Class       │                     │
│                       │      (Main User Interface)      │                     │
│                       └─────────────┬───────────────────┘                     │
│                                     │                                         │
│                                     ▼                                         │
│           ┌────────────────────────────────────────────────────┐              │
│           │              SemanticPipeline Orchestrator         │              │
│           └┬────────────────┬───────────────────┬─────────────┬┘              │
│            │                │                   │             │               │
│  ┌─────────▼──────────┐     │    ┌──────────────▼───────────┐ │               │
│  │  Document Manager  │     │    │   Chunking & Analysis    │ │               │
│  └┬────────────────┬──┘     │    └┬─────────────────────┬───┘ │               │
│   │                │        │     │                     │     │               │
│┌──▼───────┐   ┌────▼────┐   │  ┌──▼───────┐       ┌────▼────┐ │               │
││ Document │   │Document │   │  │ Chunking │       │Semantic │ │               │
││ Loaders  │   │Processor│   │  │ Engine   │       │Analyzer │ │               │
│└──────────┘   └─────────┘   │  └──────────┘       └─────────┘ │               │
│                             │                                 │               │
│      ┌─────────────────────────────────────────────────┐      │               │
│      │               LLM Service Router                │      │               │
│      │                                                 │      │               │
│      │  ┌────────────────┐         ┌────────────────┐  │      │               │
│      │  │ Remote LLM     │         │ Local LLM      │  │      │               │
│      │  │ (OpenAI, etc.) │         │ (Ollama, etc.) │  │      │               │
│      │  └────────────────┘         └────────────────┘  │      │               │
│      └─────────────────────────────────────────────────┘      │               │
│                             │                                 │               │
│  ┌────────────────────────┐ │ ┌───────────────────────────────▼────────┐      │
│  │  Question Generator    │ │ │         Validation Engine              │      │
│  │                        │◄┼─┼────┐                                   │      │
│  │  ┌──────────────────┐  │ │ │    │                                   │      │
│  │  │Category: Factual │  │ │ │    │  ┌─────────────┐                  │      │
│  │  └──────────────────┘  │ │ │    ├─►│ Traditional │                  │      │
│  │  ┌──────────────────┐  │ │ │    │  │ Validators  │                  │      │
│  │  │Cat: Inferential  │  │ │ │    │  └─────────────┘                  │      │
│  │  └──────────────────┘  │ │ │    │                                   │      │
│  │  ┌──────────────────┐  │ │ │    │                                   │      │
│  │  │Cat: Conceptual   │  │ │ │    │                                   │      │
│  │  └──────────────────┘  │ │ │    │                                   │      │
│  └─────────┬──────────────┘ │ │    │                                   │      │
│            │                │ │    │                                   │      │
│            └────────────────┼─┼────┘                                   │      │
│                             │ └───────────────────────────────────────┬┘      │
│                             │                                         │       │
│                             │                                         │       │
│          ┌──────────────────▼─────────────────────┐                   │       │
│          │           Output Formatter             │                   │       │
│          │                                        │                   │       │
│          │  ┌─────────────┐    ┌────────────────┐ │                   │       │
│          │  │ JSON Adapter│    │  CSV Adapter   │ │                   │       │
│          │  └─────────────┘    └────────────────┘ │                   │       │
│          └────────────────────────────────────────┘                   │       │
│                             │                                         │       │
│           ┌─────────────────▼────────────────────┐                    │       │
│           │           Output Results             │                    │       │
│           │  • Questions & Answers               │                    │       │
│           │  • Document Metadata                 │                    │       │
│           │  • Statistics                        │                    │       │
│           └──────────────────────────────────────┘                    │       │
│                                                                       │       │
│                      ┌─────────────────────────────┐                  │       │
│                      │     Checkpoint Manager      │◄─────────────────┘       │
│                      │   (Resume Capabilities)     │                          │
│                      └─────────────────────────────┘                          │
│                                                                               │
│                      ┌─────────────────────────────┐                          │
│                      │     Progress Reporter       │                          │
│                      │   (Processing Feedback)     │                          │
│                      └─────────────────────────────┘                          │
└───────────────────────────────────────────────────────────────────────────────┘

Core Components

Document Processor: Handles document loading and preprocessing
Chunking Engine: Splits documents into semantically coherent chunks
Semantic Analyzer: Evaluates information density and question potential
Question Generator: Creates diverse questions based on content analysis
Validation Engine: Ensures question quality and diversity
Output Formatter: Formats and exports the generated Q&A pairs

Processing Pipeline

Document → Chunks → Analysis → Questions → Validation → Output

The pipeline implements a two-phase approach:

Analysis Phase: Document is processed, chunked, and analyzed for content quality
Generation Phase: Questions are generated, validated, and formatted based on analysis

Configuration

SemanticQAGen uses a hierarchical YAML configuration system with schema validation.

Configuration File Example

# SemanticQAGen configuration
version: 1.0

# Document processing settings
document:
  loaders:
    text:
      enabled: true
      encoding: utf-8
    pdf:
      enabled: true
      extract_images: false
      ocr_enabled: false
      detect_headers_footers: true
    markdown:
      enabled: true
      extract_metadata: true
    docx:
      enabled: true
      extract_images: false

# Chunking settings
chunking:
  strategy: semantic
  target_chunk_size: 1500
  overlap_size: 150
  preserve_headings: true
  min_chunk_size: 500
  max_chunk_size: 2500

# LLM services configuration
llm_services:
  local:
    enabled: true
    url: "http://localhost:11434/api"
    model: "mistral:7b"
    preferred_tasks: [validation]
    timeout: 60
  remote:
    enabled: true
    provider: openai
    model: gpt-4o
    api_key: ${OPENAI_API_KEY}
    preferred_tasks: [analysis, generation]
    timeout: 120
    rate_limit_tokens: 90000
    rate_limit_requests: 100

# Question generation settings
question_generation:
  max_questions_per_chunk: 10
  adaptive_generation: true
  categories:
    factual:
      min_questions: 2
      weight: 1.0
    inferential:
      min_questions: 2
      weight: 1.2
    conceptual:
      min_questions: 1
      weight: 1.5

# Validation settings
validation:
  factual_accuracy:
    enabled: true
    threshold: 0.7
  answer_completeness:
    enabled: true
    threshold: 0.7
  question_clarity:
    enabled: true
    threshold: 0.7
  diversity:
    enabled: true
    threshold: 0.85

# Processing settings
processing:
  concurrency: 3
  enable_checkpoints: true
  checkpoint_interval: 10
  checkpoint_dir: "./checkpoints"
  log_level: "INFO"
  debug_mode: false

# Output settings
output:
  format: json
  include_metadata: true
  include_statistics: true
  output_dir: "./output"
  fine_tuning_format: "default"
  json_indent: 2
  json_ensure_ascii: false
  csv_delimiter: ","
  csv_quotechar: "\""

Environment Variables

Configuration values can be specified using environment variables:

llm_services:
  remote:
    api_key: ${OPENAI_API_KEY}

Configuration Layering

Configuration is resolved in the following order:

Default values
Configuration file
Environment variables
Command-line arguments
Programmatic overrides

API Reference

Main Class: `SemanticQAGen`

class SemanticQAGen:
    """Main interface for generating question-answer pairs from text documents."""
    
    def __init__(self, config_path: Optional[str] = None, 
                config_dict: Optional[Dict[str, Any]] = None,
                verbose: bool = False,
                project_path: Optional[str] = None):
        """Initialize SemanticQAGen with optional configuration."""
        
    def process_document(self, document_path: str) -> Dict[str, Any]:
        """
        Process a document to generate question-answer pairs.
        
        Args:
            document_path: Path to the document file.
            
        Returns:
            Dictionary containing questions, statistics, and metadata.
        """
        
    def process_input_directory(self, output_format: Optional[str] = None) -> Dict[str, Any]:
        """
        Processes all readable files in the project's input directory.
        
        Args:
            output_format: Optional output format to override config.
            
        Returns:
            A dictionary summarizing the batch processing results.
        """
        
    def save_questions(self, result: Dict[str, Any], 
                      output_path: str,
                      format_name: Optional[str] = None) -> str:
        """
        Save generated questions to a file.
        
        Args:
            result: Results from process_document.
            output_path: Path where to save the output.
            format_name: Format to save in (json, csv, jsonl).
            
        Returns:
            Path to the saved file.
        """
        
    def create_default_config_file(self, output_path: str, include_comments: bool = True) -> None:
        """Create a default configuration file."""
        
    def dump_failed_chunks(self, output_path: Optional[str] = None) -> int:
        """
        Generate a detailed report of failed chunks for debugging.
        
        Args:
            output_path: Optional path to write the report.
            
        Returns:
            Number of failed chunks reported.
        """

CLI Reference

SemanticQAGen provides a command-line interface:

Main Commands

semantic-qa-gen process <document> [-o OUTPUT] [-f {json,csv,jsonl}] [-c CONFIG] [-v]
semantic-qa-gen create-project [path]
semantic-qa-gen init-config <output> [-i]
semantic-qa-gen interactive
semantic-qa-gen formats
semantic-qa-gen info
semantic-qa-gen version

Command Details

process             Process a document and generate questions
  document          Path to the document file
  -o, --output      Path for output file
  -f, --format      Output format (json, csv, jsonl)
  -c, --config      Path to config file
  -p, --project     Path to QAGenProject directory
  -v, --verbose     Enable verbose output

create-project      Create a new QAGenProject structure
  path              Path for the new project (default: current directory)

init-config         Create a default configuration file
  output            Path for the config file
  -i, --interactive Create config interactively
  -p, --project     Path to QAGenProject directory

interactive         Run in interactive mode
formats             List supported file formats
info                Show system information
version             Show the version and exit

Examples

# Process a PDF document
semantic-qa-gen process document.pdf -o questions_output

# Create a new project
semantic-qa-gen create-project my_qa_project

# Create a default configuration file
semantic-qa-gen init-config config.yml

# Create a configuration file interactively
semantic-qa-gen init-config config.yml --interactive

Usage Examples

Basic Document Processing

from semantic_qa_gen import SemanticQAGen

# Initialize with default settings
qa_gen = SemanticQAGen()

# Process a document
result = qa_gen.process_document("path/to/document.txt")

# Save the questions to a JSON file
qa_gen.save_questions(result, "qa_pairs")

# Display stats
print(f"Generated {len(result['questions'])} questions")
print(f"Factual questions: {result['statistics']['categories'].get('factual', 0)}")
print(f"Inferential questions: {result['statistics']['categories'].get('inferential', 0)}")
print(f"Conceptual questions: {result['statistics']['categories'].get('conceptual', 0)}")

Using a Project Structure

from semantic_qa_gen import SemanticQAGen

# Create or use an existing project structure
qa_gen = SemanticQAGen(project_path="my_qa_project")

# Process a document (can be in project's input directory)
result = qa_gen.process_document("input/document.txt")

# Save questions (will save to project's output directory)
qa_gen.save_questions(result, "questions_output")

# Process all documents in the input directory
batch_results = qa_gen.process_input_directory()

Using Local and Remote LLMs Together

from semantic_qa_gen import SemanticQAGen

# Configuration for hybrid LLM setup
config = {
    "llm_services": {
        "local": {
            "enabled": True,
            "url": "http://localhost:11434/api",
            "model": "mistral:7b",
            "preferred_tasks": ["validation"]
        },
        "remote": {
            "enabled": True,
            "provider": "openai",
            "model": "gpt-4o",
            "api_key": "YOUR_API_KEY",
            "preferred_tasks": ["analysis", "generation"]
        }
    }
}

# Initialize with hybrid LLM config
qa_gen = SemanticQAGen(config_dict=config)

# Process document using hybrid approach
# - Local model will handle validation
# - Remote model will handle analysis and question generation
result = qa_gen.process_document("document.pdf")

Custom Question Categories

config = {
    "question_generation": {
        "max_questions_per_chunk": 12,
        "categories": {
            "factual": {
                "min_questions": 4,  # Prefer more factual questions
                "weight": 1.5
            },
            "inferential": {
                "min_questions": 3,
                "weight": 1.2
            },
            "conceptual": {
                "min_questions": 2,
                "weight": 1.0
            },
            "applied": {  # Custom category - practical applications
                "min_questions": 3,
                "weight": 1.3
            }
        }
    }
}

qa_gen = SemanticQAGen(config_dict=config)

Processing with Checkpoints

from semantic_qa_gen import SemanticQAGen

config = {
    "processing": {
        "enable_checkpoints": True,
        "checkpoint_interval": 5  # Save checkpoints every 5 chunks
    }
}

qa_gen = SemanticQAGen(config_dict=config)
result = qa_gen.process_document("large_document.pdf")

Extension

SemanticQAGen is designed to be easily extended with custom components.

Creating a Custom Document Loader

from semantic_qa_gen.document.loaders.base import BaseLoader
from semantic_qa_gen.document.models import Document, DocumentType, DocumentMetadata
from semantic_qa_gen.utils.error import DocumentError

class CustomFileLoader(BaseLoader):
    """Loader for custom file format."""
    
    def __init__(self, config: Optional[Dict[str, Any]] = None):
        super().__init__(config)
        
    def load(self, path: str) -> Document:
        """Load a document from a custom file format."""
        if not self.supports_type(path):
            raise DocumentError(f"Unsupported file type: {path}")
            
        # Implementation for loading custom format
        with open(path, 'r', encoding='utf-8') as file:
            content = file.read()
            
        # Create and return document
        return Document(
            content=content,
            doc_type=DocumentType.TEXT,
            path=path,
            metadata=self.extract_metadata(path)
        )
        
    def supports_type(self, file_path: str) -> bool:
        """Check if this loader supports the given file type."""
        _, ext = os.path.splitext(file_path.lower())
        return ext == '.custom'
        
    def extract_metadata(self, path: str) -> DocumentMetadata:
        """Extract metadata from the custom file."""
        # Implementation for extracting metadata
        return DocumentMetadata(
            title=os.path.basename(path),
            source=path
        )

Creating a Custom Validator

from semantic_qa_gen.question.validation.base import BaseValidator, ValidationResult
from semantic_qa_gen.document.models import Question, Chunk

class CustomValidator(BaseValidator):
    """Custom validator for specialized validation logic."""
    
    def __init__(self, config: Optional[Dict[str, Any]] = None):
        super().__init__(config)
        self.threshold = self.config.get('threshold', 0.7)
    
    async def validate(self, question: Question, 
                    chunk: Chunk,
                    llm_validation_data: Optional[Dict[str, Any]] = None) -> ValidationResult:
        """Implement custom validation logic."""
        # Custom validation implementation
        score = 0.8  # Example score
        
        return ValidationResult(
            question_id=question.id,
            validator_name=self.name,
            is_valid=score >= self.threshold,
            scores={"custom_score": score},
            reasons=[f"Custom validation: {score:.2f}"],
            suggested_improvements=None if score >= self.threshold else "Suggestion for improvement"
        )

Creating a Custom Chunking Strategy

from semantic_qa_gen.chunking.strategies.base import BaseChunkingStrategy
from semantic_qa_gen.document.models import Document, Section, Chunk

class CustomChunkingStrategy(BaseChunkingStrategy):
    """Custom strategy for document chunking."""
    
    def __init__(self, config: Optional[Dict[str, Any]] = None):
        super().__init__(config)
        self.target_size = self.config.get('target_chunk_size', 1500)
        
    def chunk_document(self, document: Document, sections: List[Section]) -> List[Chunk]:
        """Break a document into chunks using a custom strategy."""
        chunks = []
        
        # Custom implementation of chunking algorithm
        
        return chunks

Troubleshooting

Common Issues and Solutions

Installation Problems

Issue: Missing dependencies when installing Solution: Install with the appropriate extra dependencies:

pip install semantic-qa-gen[full]

Issue: ImportError or "failed to find libmagic" on startup Solution: python-magic needs the system libmagic library. Install it (sudo apt-get install libmagic1, or brew install libmagic), or on Windows install python-magic-bin.

Issue: OCR produces no text from scanned PDFs Solution: Install the [ocr] extra and the Tesseract engine itself (sudo apt-get install tesseract-ocr or brew install tesseract), then set ocr_enabled: true in the PDF loader config.

Issue: Conflicts with existing packages Solution: Use a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install semantic-qa-gen

Processing Issues

Issue: Out of memory errors with large documents Solution: Adjust chunking and processing settings:

config = {
    "chunking": {
        "target_chunk_size": 1000,  # Smaller chunks
        "max_chunk_size": 1500
    },
    "processing": {
        "concurrency": 1,  # Reduce concurrency
        "enable_checkpoints": True,
        "checkpoint_interval": 3  # More frequent checkpoints
    }
}

Issue: Slow processing with PDF documents Solution: Disable unnecessary PDF features:

config = {
    "document": {
        "loaders": {
            "pdf": {
                "extract_images": False,
                "ocr_enabled": False,
                "use_advanced_reading_order": False
            }
        }
    }
}

LLM Service Issues

Issue: OpenAI rate limits Solution: Adjust rate limiting settings:

config = {
    "llm_services": {
        "remote": {
            "rate_limit_tokens": 60000,  # Reduce token usage
            "rate_limit_requests": 50  # Reduce requests per minute
        }
    }
}

Issue: Local LLM not responding Solution: Check connection settings and increase timeout:

config = {
    "llm_services": {
        "local": {
            "url": "http://localhost:11434/api",  # Verify URL
            "timeout": 120  # Increase timeout
        }
    }
}

Logging and Debugging

To enable detailed logging for troubleshooting:

from semantic_qa_gen import SemanticQAGen
import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)

# Or enable verbose mode
qa_gen = SemanticQAGen(verbose=True)

For CLI usage:

semantic-qa-gen process document.pdf -o output --verbose

License

SemanticQAGen is released under the MIT License.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.5.2

May 30, 2026

0.5.1

May 30, 2026

0.1.0

Apr 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_qa_gen-0.5.2.tar.gz (184.1 kB view details)

Uploaded May 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semantic_qa_gen-0.5.2-py3-none-any.whl (196.9 kB view details)

Uploaded May 30, 2026 Python 3

File details

Details for the file semantic_qa_gen-0.5.2.tar.gz.

File metadata

Download URL: semantic_qa_gen-0.5.2.tar.gz
Upload date: May 30, 2026
Size: 184.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for semantic_qa_gen-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`291ca1176655fb436c25a288686caadd5b2994701031b5efab514f7bb988fc76`
MD5	`74182620af1c146ae5c0215c36a0c08d`
BLAKE2b-256	`a252cc17065ac1f6c2fd7e5f6b0ed3522ea1a4f669914ae0a8a5e1ad95cba8c3`

See more details on using hashes here.

File details

Details for the file semantic_qa_gen-0.5.2-py3-none-any.whl.

File metadata

Download URL: semantic_qa_gen-0.5.2-py3-none-any.whl
Upload date: May 30, 2026
Size: 196.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for semantic_qa_gen-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5b3c0031fecd0efaba640348913f8720474ccdb84d84bf3ed6f5836f68180dd2`
MD5	`de4191f2218032cba353f9a976194a24`
BLAKE2b-256	`c73851840a2ee1efa8477056e9ff70f77486f4d77118d82f6a9b8b383ea5d8cf`

See more details on using hashes here.

semantic-qa-gen 0.5.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

SemanticQAGen

Overview

Installation

Basic Installation

Optional Dependencies

System Dependencies

Requirements

Quickstart

Basic Usage

CLI Usage

Feature Overview

Core Capabilities

Document Processing

Multiple Format Support

Batch Processing

Automatic Document Type Detection

Cross-Page Content Handling

Header/Footer Detection

Content Analysis

Semantic Document Chunking

Information Density Analysis

Topic Coherence Evaluation

Key Concept Extraction

Question Generation

Multi-level Cognitive Questions

Adaptive Generation

Question Diversity Enforcement

Custom Question Categories

Answer Validation

Faithfulness Verification

Standalone / Decontextualization Rewriting

Answer-Leakage Filtering

Diversity Filtering

Advanced Features

LLM Integration

OpenAI-Compatible API Support

Local LLM Support

Hybrid Task Routing

Automatic Fallback Mechanisms

Processing Control

Checkpoint and Resume Capability

Concurrent Processing

Progress Tracking and Reporting

Output Options

Multiple Export Formats

Metadata Inclusion

Statistics and Analytics

Project File Organization

Architecture

Core Components

Processing Pipeline

Configuration

Configuration File Example

Environment Variables

Configuration Layering

API Reference

Main Class: SemanticQAGen

CLI Reference

Main Commands

Command Details

Examples

Usage Examples

Basic Document Processing

Using a Project Structure

Using Local and Remote LLMs Together

Custom Question Categories

Processing with Checkpoints

Extension

Creating a Custom Document Loader

Creating a Custom Validator

Creating a Custom Chunking Strategy

Main Class: `SemanticQAGen`