Base interfaces and models for document analysis frameworks

These details have not been verified by PyPI

Project links

Project description

analysis-framework-base

Lightweight foundation package defining shared interfaces and contracts for document analysis frameworks.

Overview

analysis-framework-base provides abstract base classes and data models that establish a consistent interface across multiple specialized document analysis frameworks. This package serves as the foundation for:

xml-analysis-framework - XML and S1000D technical documentation
docling-analysis-framework - PDF, Word, PowerPoint via Docling
document-analysis-framework - General document processing
data-analysis-framework - Structured data analysis

Key Features

Zero Dependencies - Pure Python standard library only
Type Hints - Full typing support for better IDE integration
Multiple Access Patterns - Dict-style and attribute-style access
Extensible - Easy to implement new frameworks
Well Documented - Comprehensive docstrings and examples

Installation

pip install analysis-framework-base

For Development

pip install analysis-framework-base[dev]

Quick Start

Implementing a Framework Analyzer

from analysis_framework_base import (
    BaseAnalyzer,
    UnifiedAnalysisResult,
    UnsupportedFormatError,
    AnalysisError
)

class MyDocumentAnalyzer(BaseAnalyzer):
    """Custom analyzer for my document format."""

    def analyze_unified(self, file_path: str, **kwargs) -> UnifiedAnalysisResult:
        """Analyze a document and return unified results."""
        # Check file format
        if not file_path.endswith('.mydoc'):
            raise UnsupportedFormatError(f"Unsupported format: {file_path}")

        try:
            # Perform analysis
            with open(file_path, 'r') as f:
                content = f.read()

            return UnifiedAnalysisResult(
                document_type="MyDoc Technical Document",
                confidence=0.95,
                framework="my-doc-analyzer",
                metadata={
                    "version": "1.0",
                    "word_count": len(content.split())
                },
                content=content,
                ai_opportunities=[
                    "Document summarization",
                    "Question answering",
                    "Entity extraction"
                ]
            )
        except Exception as e:
            raise AnalysisError(f"Analysis failed: {e}")

    def get_supported_formats(self) -> list[str]:
        """Return supported file extensions."""
        return ['.mydoc', '.md']

Implementing a Document Chunker

from analysis_framework_base import (
    BaseChunker,
    ChunkInfo,
    UnifiedAnalysisResult,
    ChunkingError
)

class MyDocumentChunker(BaseChunker):
    """Custom chunker for my document format."""

    def chunk_document(
        self,
        file_path: str,
        analysis: UnifiedAnalysisResult,
        strategy: str = "auto",
        **kwargs
    ) -> list[ChunkInfo]:
        """Split document into chunks."""
        if strategy not in self.get_supported_strategies():
            raise ChunkingError(f"Unknown strategy: {strategy}")

        # Implement chunking logic
        chunks = []
        content = analysis.content or ""

        # Simple paragraph-based chunking
        paragraphs = content.split('\n\n')
        for i, para in enumerate(paragraphs):
            chunk = ChunkInfo(
                chunk_id=f"{file_path}_chunk_{i:04d}",
                content=para.strip(),
                metadata={
                    "paragraph_index": i,
                    "source_file": file_path
                },
                token_count=len(para.split()) * 1.3,  # Rough estimate
                chunk_type="paragraph"
            )
            chunks.append(chunk)

        return chunks

    def get_supported_strategies(self) -> list[str]:
        """Return supported chunking strategies."""
        return ['auto', 'paragraph', 'sliding_window']

Using the Framework

# Initialize analyzer
analyzer = MyDocumentAnalyzer()

# Analyze a document
result = analyzer.analyze_unified('document.mydoc')

# Access results (multiple patterns supported)
print(result.document_type)           # Attribute access
print(result['confidence'])           # Dict-style access
print(result.get('framework'))        # Dict get method

# Convert to dict for JSON serialization
result_dict = result.to_dict()

# Initialize chunker
chunker = MyDocumentChunker()

# Chunk the document
chunks = chunker.chunk_document('document.mydoc', result, strategy='paragraph')

# Process chunks
for chunk in chunks:
    print(f"Chunk {chunk.chunk_id}: {chunk.token_count} tokens")
    print(chunk.content[:100])  # First 100 chars

Core Interfaces

BaseAnalyzer

Abstract base class for document analyzers. Requires implementation of:

analyze_unified(file_path: str, **kwargs) -> UnifiedAnalysisResult
get_supported_formats() -> List[str]

BaseChunker

Abstract base class for document chunkers. Requires implementation of:

chunk_document(file_path: str, analysis: UnifiedAnalysisResult, strategy: str, **kwargs) -> List[ChunkInfo]
get_supported_strategies() -> List[str]

Data Models

UnifiedAnalysisResult

Standard result structure with:

document_type: str - Human-readable document type
confidence: float - Confidence score (0.0-1.0)
framework: str - Framework identifier
metadata: Dict[str, Any] - Framework-specific metadata
content: Optional[str] - Extracted text content
ai_opportunities: List[str] - Suggested AI use cases
raw_analysis: Dict[str, Any] - Complete framework results

Supports both attribute and dict-style access:

result.document_type        # Attribute
result['document_type']     # Dict-style
result.get('document_type') # Get method
'document_type' in result   # Contains check

ChunkInfo

Standard chunk structure with:

chunk_id: str - Unique identifier
content: str - Chunk text content
metadata: Dict[str, Any] - Chunk metadata
token_count: int - Estimated token count
chunk_type: str - Type (text, code, table, etc.)

Exception Hierarchy

FrameworkError                  # Base exception
├── UnsupportedFormatError     # File format not supported
├── AnalysisError              # Analysis failed
└── ChunkingError              # Chunking failed

Constants

ChunkStrategy Enum

Standard chunking strategy names:

AUTO - Framework auto-selects best strategy
HIERARCHICAL - Structure-based (sections, headings)
SLIDING_WINDOW - Fixed-size overlapping chunks
CONTENT_AWARE - Semantic boundary detection
STRUCTURAL - Element-based (paragraphs, tables)
TABLE_AWARE - Special table handling
PAGE_AWARE - Page-boundary chunking

from analysis_framework_base import ChunkStrategy

strategy = ChunkStrategy.HIERARCHICAL
print(strategy.value)  # 'hierarchical'

Framework Suite

This package is part of a suite of analysis frameworks:

xml-analysis-framework - 29+ specialized XML handlers, S1000D support, hierarchical chunking
docling-analysis-framework - PDF, DOCX, PPTX via Docling, table-aware chunking
document-analysis-framework - General document processing, format detection
data-analysis-framework - CSV, JSON, Parquet with query paradigm

Each framework implements the interfaces defined in this package for consistent usage.

Development

Setup Development Environment

# Clone repository
git clone https://github.com/redhat-ai-americas/analysis-framework-base.git
cd analysis-framework-base

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with dev dependencies
pip install -e ".[dev]"

Running Tests

# Run all tests with coverage
pytest

# Run specific test file
pytest tests/test_models.py

# Run with verbose output
pytest -v

# Generate HTML coverage report
pytest --cov-report=html

Code Quality

# Format code
black src/ tests/

# Check formatting
black --check src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Run tests and quality checks
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Code Standards

Follow PEP 8 style guidelines
Add type hints to all functions
Write comprehensive docstrings
Include examples in docstrings
Maintain test coverage above 80%
Keep the package dependency-free (stdlib only)

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Issues: GitHub Issues
Documentation: GitHub Repository

Authors

Red Hat AI Americas Team

Changelog

See CHANGELOG.md for version history and changes.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Oct 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

analysis_framework_base-1.0.0.tar.gz (16.3 kB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

analysis_framework_base-1.0.0-py3-none-any.whl (11.9 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file analysis_framework_base-1.0.0.tar.gz.

File metadata

Download URL: analysis_framework_base-1.0.0.tar.gz
Upload date: Oct 28, 2025
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for analysis_framework_base-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`dccbee6e11ea0ce012a0f345c0ee716cd8d5239cf987598c330cf70f627bbd73`
MD5	`0b0a52c7fdf64717863c2d5663deb4e6`
BLAKE2b-256	`87e135309c113ac4acc656df56601b5176933c8106b843db130e5955cc23f7f7`

See more details on using hashes here.

File details

Details for the file analysis_framework_base-1.0.0-py3-none-any.whl.

File metadata

Download URL: analysis_framework_base-1.0.0-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 11.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for analysis_framework_base-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b3da597e23ecdae04649a37762de25ec1e1597847c5befd6fc2f55053dd0b9c9`
MD5	`25fb7e72b3ec6e1930c13d6531250f39`
BLAKE2b-256	`88b99b01e7c4371d29d87dab5fa48c3831c9e50b83080ade3c754fce3b04115c`

See more details on using hashes here.

analysis-framework-base 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

analysis-framework-base

Overview

Key Features

Installation

For Development

Quick Start

Implementing a Framework Analyzer

Implementing a Document Chunker

Using the Framework

Core Interfaces

BaseAnalyzer

BaseChunker

Data Models

UnifiedAnalysisResult

ChunkInfo

Exception Hierarchy

Constants

ChunkStrategy Enum

Framework Suite

Development

Setup Development Environment

Running Tests

Code Quality

Contributing

Code Standards

License

Support

Authors

Changelog

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes