Skip to main content

Base interfaces and models for document analysis frameworks

Project description

analysis-framework-base

Lightweight foundation package defining shared interfaces and contracts for document analysis frameworks.

Python Version License Development Status

Overview

analysis-framework-base provides abstract base classes and data models that establish a consistent interface across multiple specialized document analysis frameworks. This package serves as the foundation for:

  • xml-analysis-framework - XML and S1000D technical documentation
  • docling-analysis-framework - PDF, Word, PowerPoint via Docling
  • document-analysis-framework - General document processing
  • data-analysis-framework - Structured data analysis

Key Features

  • Zero Dependencies - Pure Python standard library only
  • Type Hints - Full typing support for better IDE integration
  • Multiple Access Patterns - Dict-style and attribute-style access
  • Extensible - Easy to implement new frameworks
  • Well Documented - Comprehensive docstrings and examples

Installation

pip install analysis-framework-base

For Development

pip install analysis-framework-base[dev]

Quick Start

Implementing a Framework Analyzer

from analysis_framework_base import (
    BaseAnalyzer,
    UnifiedAnalysisResult,
    UnsupportedFormatError,
    AnalysisError
)

class MyDocumentAnalyzer(BaseAnalyzer):
    """Custom analyzer for my document format."""

    def analyze_unified(self, file_path: str, **kwargs) -> UnifiedAnalysisResult:
        """Analyze a document and return unified results."""
        # Check file format
        if not file_path.endswith('.mydoc'):
            raise UnsupportedFormatError(f"Unsupported format: {file_path}")

        try:
            # Perform analysis
            with open(file_path, 'r') as f:
                content = f.read()

            return UnifiedAnalysisResult(
                document_type="MyDoc Technical Document",
                confidence=0.95,
                framework="my-doc-analyzer",
                metadata={
                    "version": "1.0",
                    "word_count": len(content.split())
                },
                content=content,
                ai_opportunities=[
                    "Document summarization",
                    "Question answering",
                    "Entity extraction"
                ]
            )
        except Exception as e:
            raise AnalysisError(f"Analysis failed: {e}")

    def get_supported_formats(self) -> list[str]:
        """Return supported file extensions."""
        return ['.mydoc', '.md']

Implementing a Document Chunker

from analysis_framework_base import (
    BaseChunker,
    ChunkInfo,
    UnifiedAnalysisResult,
    ChunkingError
)

class MyDocumentChunker(BaseChunker):
    """Custom chunker for my document format."""

    def chunk_document(
        self,
        file_path: str,
        analysis: UnifiedAnalysisResult,
        strategy: str = "auto",
        **kwargs
    ) -> list[ChunkInfo]:
        """Split document into chunks."""
        if strategy not in self.get_supported_strategies():
            raise ChunkingError(f"Unknown strategy: {strategy}")

        # Implement chunking logic
        chunks = []
        content = analysis.content or ""

        # Simple paragraph-based chunking
        paragraphs = content.split('\n\n')
        for i, para in enumerate(paragraphs):
            chunk = ChunkInfo(
                chunk_id=f"{file_path}_chunk_{i:04d}",
                content=para.strip(),
                metadata={
                    "paragraph_index": i,
                    "source_file": file_path
                },
                token_count=len(para.split()) * 1.3,  # Rough estimate
                chunk_type="paragraph"
            )
            chunks.append(chunk)

        return chunks

    def get_supported_strategies(self) -> list[str]:
        """Return supported chunking strategies."""
        return ['auto', 'paragraph', 'sliding_window']

Using the Framework

# Initialize analyzer
analyzer = MyDocumentAnalyzer()

# Analyze a document
result = analyzer.analyze_unified('document.mydoc')

# Access results (multiple patterns supported)
print(result.document_type)           # Attribute access
print(result['confidence'])           # Dict-style access
print(result.get('framework'))        # Dict get method

# Convert to dict for JSON serialization
result_dict = result.to_dict()

# Initialize chunker
chunker = MyDocumentChunker()

# Chunk the document
chunks = chunker.chunk_document('document.mydoc', result, strategy='paragraph')

# Process chunks
for chunk in chunks:
    print(f"Chunk {chunk.chunk_id}: {chunk.token_count} tokens")
    print(chunk.content[:100])  # First 100 chars

Core Interfaces

BaseAnalyzer

Abstract base class for document analyzers. Requires implementation of:

  • analyze_unified(file_path: str, **kwargs) -> UnifiedAnalysisResult
  • get_supported_formats() -> List[str]

BaseChunker

Abstract base class for document chunkers. Requires implementation of:

  • chunk_document(file_path: str, analysis: UnifiedAnalysisResult, strategy: str, **kwargs) -> List[ChunkInfo]
  • get_supported_strategies() -> List[str]

Data Models

UnifiedAnalysisResult

Standard result structure with:

  • document_type: str - Human-readable document type
  • confidence: float - Confidence score (0.0-1.0)
  • framework: str - Framework identifier
  • metadata: Dict[str, Any] - Framework-specific metadata
  • content: Optional[str] - Extracted text content
  • ai_opportunities: List[str] - Suggested AI use cases
  • raw_analysis: Dict[str, Any] - Complete framework results

Supports both attribute and dict-style access:

result.document_type        # Attribute
result['document_type']     # Dict-style
result.get('document_type') # Get method
'document_type' in result   # Contains check

ChunkInfo

Standard chunk structure with:

  • chunk_id: str - Unique identifier
  • content: str - Chunk text content
  • metadata: Dict[str, Any] - Chunk metadata
  • token_count: int - Estimated token count
  • chunk_type: str - Type (text, code, table, etc.)

Exception Hierarchy

FrameworkError                  # Base exception
├── UnsupportedFormatError     # File format not supported
├── AnalysisError              # Analysis failed
└── ChunkingError              # Chunking failed

Constants

ChunkStrategy Enum

Standard chunking strategy names:

  • AUTO - Framework auto-selects best strategy
  • HIERARCHICAL - Structure-based (sections, headings)
  • SLIDING_WINDOW - Fixed-size overlapping chunks
  • CONTENT_AWARE - Semantic boundary detection
  • STRUCTURAL - Element-based (paragraphs, tables)
  • TABLE_AWARE - Special table handling
  • PAGE_AWARE - Page-boundary chunking
from analysis_framework_base import ChunkStrategy

strategy = ChunkStrategy.HIERARCHICAL
print(strategy.value)  # 'hierarchical'

Framework Suite

This package is part of a suite of analysis frameworks:

  • xml-analysis-framework - 29+ specialized XML handlers, S1000D support, hierarchical chunking
  • docling-analysis-framework - PDF, DOCX, PPTX via Docling, table-aware chunking
  • document-analysis-framework - General document processing, format detection
  • data-analysis-framework - CSV, JSON, Parquet with query paradigm

Each framework implements the interfaces defined in this package for consistent usage.

Development

Setup Development Environment

# Clone repository
git clone https://github.com/redhat-ai-americas/analysis-framework-base.git
cd analysis-framework-base

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with dev dependencies
pip install -e ".[dev]"

Running Tests

# Run all tests with coverage
pytest

# Run specific test file
pytest tests/test_models.py

# Run with verbose output
pytest -v

# Generate HTML coverage report
pytest --cov-report=html

Code Quality

# Format code
black src/ tests/

# Check formatting
black --check src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Run tests and quality checks
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

Code Standards

  • Follow PEP 8 style guidelines
  • Add type hints to all functions
  • Write comprehensive docstrings
  • Include examples in docstrings
  • Maintain test coverage above 80%
  • Keep the package dependency-free (stdlib only)

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Authors

Red Hat AI Americas Team

Changelog

See CHANGELOG.md for version history and changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

analysis_framework_base-1.0.0.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

analysis_framework_base-1.0.0-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file analysis_framework_base-1.0.0.tar.gz.

File metadata

  • Download URL: analysis_framework_base-1.0.0.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for analysis_framework_base-1.0.0.tar.gz
Algorithm Hash digest
SHA256 dccbee6e11ea0ce012a0f345c0ee716cd8d5239cf987598c330cf70f627bbd73
MD5 0b0a52c7fdf64717863c2d5663deb4e6
BLAKE2b-256 87e135309c113ac4acc656df56601b5176933c8106b843db130e5955cc23f7f7

See more details on using hashes here.

File details

Details for the file analysis_framework_base-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for analysis_framework_base-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b3da597e23ecdae04649a37762de25ec1e1597847c5befd6fc2f55053dd0b9c9
MD5 25fb7e72b3ec6e1930c13d6531250f39
BLAKE2b-256 88b99b01e7c4371d29d87dab5fa48c3831c9e50b83080ade3c754fce3b04115c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page