Skip to main content

AI-ready analysis framework for PDF and Office documents using Docling for content extraction - part of the unified analysis framework suite

Project description

Docling Analysis Framework

Python 3.8+ License: MIT Docling Powered AI Ready

AI-ready analysis framework for PDF and Office documents using Docling for content extraction. Transforms Docling's output into optimized chunks and structured analysis for AI/ML pipelines.

๐Ÿ”— Part of Analysis Framework Suite

This framework is part of a unified suite of document analysis tools that share a consistent interface:

All frameworks implement the same BaseAnalyzer and BaseChunker interfaces from analysis-framework-base, enabling:

  • Consistent API across document types
  • Easy framework switching with minimal code changes
  • Unified result format for downstream processing
  • Shared tooling and utilities

๐Ÿ“ For text, code, and configuration files, use our companion document-analysis-framework which uses only Python standard library.

๐Ÿš€ Quick Start

Simple API - Get Started in Seconds

import docling_analysis_framework as daf

# ๐ŸŽฏ One-line analysis with Docling extraction
result = daf.analyze("document.pdf")
print(f"Document type: {result['document_type'].type_name}")
print(f"Pages: {result['document_type'].pages}")
print(f"Handler used: {result['handler_used']}")

# โœ‚๏ธ Smart chunking for AI/ML
chunks = daf.chunk("document.pdf", strategy="auto")
print(f"Created {len(chunks)} optimized chunks")

# ๐Ÿš€ Enhanced analysis with both analysis and chunking
enhanced = daf.analyze_enhanced("document.pdf", chunking_strategy="structural")
print(f"Document: {enhanced['analysis']['document_type'].type_name}")
print(f"Chunks: {len(enhanced['chunks'])}")

# ๐Ÿ’พ Save chunks to JSON
daf.save_chunks_to_json(chunks, "chunks_output.json")

# ๐Ÿ’พ Save analysis to JSON  
daf.save_analysis_to_json(result, "analysis_output.json")

๐Ÿ”„ Unified Interface Support

This framework now supports the unified interface standard, providing consistent access patterns across all analysis frameworks:

import docling_analysis_framework as daf

# Use the unified interface
result = daf.analyze_unified("document.pdf")

# All access patterns work consistently
doc_type = result['document_type']        # Dict access โœ“
doc_type = result.document_type           # Attribute access โœ“
doc_type = result.get('document_type')    # get() method โœ“
as_dict = result.to_dict()                # Full dict conversion โœ“

# Works the same across all frameworks
print(f"Framework: {result.framework}")   # 'docling-analysis-framework'
print(f"Type: {result.document_type}")
print(f"Confidence: {result.confidence}")
print(f"AI opportunities: {result.ai_opportunities}")

The unified interface ensures compatibility when switching between frameworks or using multiple frameworks together.

Advanced Usage

import docling_analysis_framework as daf

# Enhanced analysis with full results
enhanced = daf.analyze_enhanced("research_paper.pdf")

print(f"Type: {enhanced['analysis']['document_type'].type_name}")
print(f"Confidence: {enhanced['analysis']['confidence']:.2f}")
print(f"AI use cases: {len(enhanced['analysis']['analysis'].ai_use_cases)}")
if enhanced['analysis']['analysis'].quality_metrics:
    for metric, score in enhanced['analysis']['analysis'].quality_metrics.items():
        print(f"{metric}: {score:.2f}")

# Different chunking strategies
hierarchical_chunks = daf.chunk("document.pdf", strategy="structural")
table_aware_chunks = daf.chunk("document.pdf", strategy="table_aware") 
page_aware_chunks = daf.chunk("document.pdf", strategy="page_aware")

# Process chunks
for chunk in hierarchical_chunks:
    print(f"Chunk {chunk.chunk_id}: {chunk.token_count} tokens")
    print(f"Type: {chunk.chunk_type}")
    # Send to your AI model...

# ๐Ÿ’พ Save different strategies to separate files
strategies = {
    "structural": hierarchical_chunks,
    "table_aware": table_aware_chunks,
    "page_aware": page_aware_chunks
}

for strategy_name, chunks in strategies.items():
    daf.save_chunks_to_json(chunks, f"chunks_{strategy_name}.json")
    print(f"Saved {len(chunks)} chunks to chunks_{strategy_name}.json")

Expert Usage - Direct Class Access

# For advanced customization, use the classes directly
from docling_analysis_framework import DoclingAnalyzer, DoclingChunkingOrchestrator, ChunkingConfig

analyzer = DoclingAnalyzer(max_file_size_mb=100)
result = analyzer.analyze_document("document.pdf")

# Custom chunking with config
config = ChunkingConfig(
    max_chunk_size=1500,
    min_chunk_size=200,
    overlap_size=100,
    preserve_structure=True
)

orchestrator = DoclingChunkingOrchestrator(config=config)
# Note: In advanced usage, you'd pass the actual docling_result object

๐Ÿ“‹ Supported Document Types

Format Extensions Confidence Powered By
๐Ÿ“„ PDF Documents .pdf 95% Docling extraction
๐Ÿ“ Word Documents .docx 95% Docling extraction
๐Ÿ“Š Spreadsheets .xlsx 70% Docling extraction
๐Ÿ“… Presentations .pptx 70% Docling extraction
๐Ÿ–ผ๏ธ Images with Text .png, .jpg, .tiff 70% Docling OCR

๐ŸŽฏ Key Features

๐Ÿ” Docling-Powered Extraction

  • PDF text extraction - High-quality content extraction
  • Table detection - Preserves table structure in markdown
  • Figure references - Maintains image/figure relationships
  • Header hierarchy - Document structure preservation

๐Ÿค– AI Preparation Layer

  • Quality assessment - Extraction quality scoring
  • Structure analysis - Document type detection and analysis
  • Chunk optimization - AI-ready segmentation strategies
  • Rich metadata - Page counts, figures, tables, quality metrics

โšก Smart Chunking Strategies

  • Structural chunking - Respects document hierarchy (headers, sections)
  • Table-aware chunking - Separates tables from text content
  • Page-aware chunking - Considers original page boundaries
  • Auto-selection - Document-type-aware strategy selection

๐Ÿ“Š Extraction Quality Analysis

  • Text coverage - How much text was successfully extracted
  • Structure preservation - Whether headers, lists, tables are maintained
  • Overall confidence - Combined quality score for AI processing

๐ŸŽ‰ Framework Status - COMPLETED

The Docling Analysis Framework is now fully functional and follows the same successful patterns as the XML Analysis Framework:

โœ… Completed Features

  • ๐ŸŽฏ Simple API: One-line functions for analyze(), chunk(), analyze_enhanced()
  • ๐Ÿ”ง Advanced API: Direct class access for customization with DoclingAnalyzer, DoclingChunkingOrchestrator
  • โš™๏ธ Configurable Chunking: ChunkingConfig class for fine-tuning chunk parameters
  • ๐Ÿ“ฆ Multiple Strategies: Structural, table-aware, page-aware, and auto-selection chunking
  • ๐Ÿ’พ JSON Export: Easy export of analysis results and chunks to JSON files
  • ๐Ÿ›ก๏ธ Enhanced Error Handling: Comprehensive logging and error reporting
  • ๐Ÿ“Š Quality Metrics: Extraction quality assessment and content analysis
  • ๐Ÿงช Testing Framework: Complete Jupyter notebook for validation
  • ๐Ÿ“š Documentation: Comprehensive README with examples and usage patterns

๐Ÿš€ Ready for AI/ML Integration

The framework provides everything needed for AI/ML pipelines:

  • Token-optimized chunks sized for LLM context windows
  • Rich metadata with document structure and quality metrics
  • JSON export for vector database ingestion
  • Multiple chunking strategies for different document types
  • Quality assessment to determine AI suitability

๐Ÿ”„ Comparison with XML Framework

Feature XML Framework Docling Framework Status
Simple API โœ… โœ… Complete
Advanced Classes โœ… โœ… Complete
Multiple Strategies โœ… โœ… Complete
Configuration โœ… โœ… Complete
JSON Export โœ… โœ… Complete
Error Handling โœ… โœ… Complete
Testing Notebooks โœ… โœ… Complete
Quality Metrics โœ… โœ… Complete

Both frameworks now provide identical API patterns and functionality!

๐Ÿ”ง Installation

# Install Docling first
pip install docling

# Install framework
git clone https://github.com/rdwj/docling-analysis-framework.git
cd docling-analysis-framework  
pip install -e .

๐Ÿ“– Usage Examples

Document Analysis with Quality Assessment

from core.analyzer import DoclingAnalyzer

analyzer = DoclingAnalyzer()
result = analyzer.analyze_document("contract.pdf")

# Access extraction quality
quality = result['analysis'].extraction_quality
print(f"Text coverage: {quality['text_coverage']:.2f}")
print(f"Structure score: {quality['structure_score']:.2f}")
print(f"Overall quality: {quality['overall_score']:.2f}")

# Access document insights
findings = result['analysis'].key_findings
print(f"Pages: {findings['page_count']}")
print(f"Tables: {findings['table_rows']}")
print(f"Figures: {findings['figure_count']}")

Advanced Chunking for Academic Papers

# Perfect for research papers with complex structure
chunks = orchestrator.chunk_document(
    "paper.pdf",
    markdown_content,
    docling_result,
    strategy='structural'
)

# Chunks respect paper structure
for chunk in chunks:
    if chunk.chunk_type == 'section':
        print(f"Section: {chunk.metadata['section_title']}")
    elif chunk.chunk_type == 'table':
        print(f"Table data: {len(chunk.content)} chars")
    elif chunk.chunk_type == 'figure':
        print(f"Figure reference preserved")

File Size Management

from core.analyzer import DoclingAnalyzer

# Large PDF processing with limits
analyzer = DoclingAnalyzer(max_file_size_mb=100.0)

result = analyzer.analyze_document("large_manual.pdf")
if 'error' in result:
    print(f"File too large: {result['error']}")
else:
    print(f"Successfully processed {result['file_size']} bytes")

๐Ÿงช Framework Ecosystem

This framework is part of a larger document analysis ecosystem:

  • xml-analysis-framework - Specialized XML document analysis
  • docling-analysis-framework - PDF and Office documents (this package)
  • document-analysis-framework - Text, code, and configuration files
  • unified-analysis-orchestrator - Routes documents to appropriate frameworks

๐Ÿ”— Integration with Docling

This framework acts as an AI preparation layer on top of Docling:

  1. Docling handles the heavy lifting of document extraction
  2. Our framework adds AI-specific analysis and chunking
  3. Result is AI-ready structured data and optimized chunks
# What Docling provides:
docling_result = docling.convert("document.pdf")
markdown_content = docling_result.document.export_to_markdown()

# What we add:
ai_analysis = our_analyzer.analyze(docling_result)
ai_chunks = our_chunker.chunk(markdown_content, ai_analysis)
quality_scores = our_quality_assessor.assess(docling_result)

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Powered by Docling for document extraction
  • Built for modern AI/ML development workflows
  • Part of the AI Building Blocks initiative

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_analysis_framework-2.0.0.tar.gz (40.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docling_analysis_framework-2.0.0-py3-none-any.whl (22.4 kB view details)

Uploaded Python 3

File details

Details for the file docling_analysis_framework-2.0.0.tar.gz.

File metadata

File hashes

Hashes for docling_analysis_framework-2.0.0.tar.gz
Algorithm Hash digest
SHA256 f3f6fb642786e3fbee9d45d0af6ef0b3e80179cee263790f10eed574a3ba6edc
MD5 5bc460b9e572b7052b394cbe00329f59
BLAKE2b-256 8234ca855b43118cd49e98ef37b83732a8aeb543586dd8ac3e1999f35451b75d

See more details on using hashes here.

File details

Details for the file docling_analysis_framework-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for docling_analysis_framework-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d786c0e44bab660e09f9d8b534bc66260186b4b7e39617b2a92c54c0ce98f6b3
MD5 e077b6b4efc5be6f3a5d41a034995e38
BLAKE2b-256 42b7ed340ba6f12aea962b4a4f14e721ed897e58853882e943ed29d89ba9ffe1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page