Skip to main content

AI-ready analysis framework for PDF and Office documents using Docling for content extraction

Project description

Docling Analysis Framework

Python 3.8+ License: MIT Docling Powered AI Ready

AI-ready analysis framework for PDF and Office documents using Docling for content extraction. Transforms Docling's output into optimized chunks and structured analysis for AI/ML pipelines.

๐Ÿ“ For text, code, and configuration files, use our companion document-analysis-framework which uses only Python standard library.

๐Ÿš€ Quick Start

Simple API - Get Started in Seconds

import docling_analysis_framework as daf

# ๐ŸŽฏ One-line analysis with Docling extraction
result = daf.analyze("document.pdf")
print(f"Document type: {result['document_type'].type_name}")
print(f"Pages: {result['document_type'].pages}")
print(f"Handler used: {result['handler_used']}")

# โœ‚๏ธ Smart chunking for AI/ML
chunks = daf.chunk("document.pdf", strategy="auto")
print(f"Created {len(chunks)} optimized chunks")

# ๐Ÿš€ Enhanced analysis with both analysis and chunking
enhanced = daf.analyze_enhanced("document.pdf", chunking_strategy="structural")
print(f"Document: {enhanced['analysis']['document_type'].type_name}")
print(f"Chunks: {len(enhanced['chunks'])}")

# ๐Ÿ’พ Save chunks to JSON
daf.save_chunks_to_json(chunks, "chunks_output.json")

# ๐Ÿ’พ Save analysis to JSON  
daf.save_analysis_to_json(result, "analysis_output.json")

Advanced Usage

import docling_analysis_framework as daf

# Enhanced analysis with full results
enhanced = daf.analyze_enhanced("research_paper.pdf")

print(f"Type: {enhanced['analysis']['document_type'].type_name}")
print(f"Confidence: {enhanced['analysis']['confidence']:.2f}")
print(f"AI use cases: {len(enhanced['analysis']['analysis'].ai_use_cases)}")
if enhanced['analysis']['analysis'].quality_metrics:
    for metric, score in enhanced['analysis']['analysis'].quality_metrics.items():
        print(f"{metric}: {score:.2f}")

# Different chunking strategies
hierarchical_chunks = daf.chunk("document.pdf", strategy="structural")
table_aware_chunks = daf.chunk("document.pdf", strategy="table_aware") 
page_aware_chunks = daf.chunk("document.pdf", strategy="page_aware")

# Process chunks
for chunk in hierarchical_chunks:
    print(f"Chunk {chunk.chunk_id}: {chunk.token_count} tokens")
    print(f"Type: {chunk.chunk_type}")
    # Send to your AI model...

# ๐Ÿ’พ Save different strategies to separate files
strategies = {
    "structural": hierarchical_chunks,
    "table_aware": table_aware_chunks,
    "page_aware": page_aware_chunks
}

for strategy_name, chunks in strategies.items():
    daf.save_chunks_to_json(chunks, f"chunks_{strategy_name}.json")
    print(f"Saved {len(chunks)} chunks to chunks_{strategy_name}.json")

Expert Usage - Direct Class Access

# For advanced customization, use the classes directly
from docling_analysis_framework import DoclingAnalyzer, DoclingChunkingOrchestrator, ChunkingConfig

analyzer = DoclingAnalyzer(max_file_size_mb=100)
result = analyzer.analyze_document("document.pdf")

# Custom chunking with config
config = ChunkingConfig(
    max_chunk_size=1500,
    min_chunk_size=200,
    overlap_size=100,
    preserve_structure=True
)

orchestrator = DoclingChunkingOrchestrator(config=config)
# Note: In advanced usage, you'd pass the actual docling_result object

๐Ÿ“‹ Supported Document Types

Format Extensions Confidence Powered By
๐Ÿ“„ PDF Documents .pdf 95% Docling extraction
๐Ÿ“ Word Documents .docx 95% Docling extraction
๐Ÿ“Š Spreadsheets .xlsx 70% Docling extraction
๐Ÿ“… Presentations .pptx 70% Docling extraction
๐Ÿ–ผ๏ธ Images with Text .png, .jpg, .tiff 70% Docling OCR

๐ŸŽฏ Key Features

๐Ÿ” Docling-Powered Extraction

  • PDF text extraction - High-quality content extraction
  • Table detection - Preserves table structure in markdown
  • Figure references - Maintains image/figure relationships
  • Header hierarchy - Document structure preservation

๐Ÿค– AI Preparation Layer

  • Quality assessment - Extraction quality scoring
  • Structure analysis - Document type detection and analysis
  • Chunk optimization - AI-ready segmentation strategies
  • Rich metadata - Page counts, figures, tables, quality metrics

โšก Smart Chunking Strategies

  • Structural chunking - Respects document hierarchy (headers, sections)
  • Table-aware chunking - Separates tables from text content
  • Page-aware chunking - Considers original page boundaries
  • Auto-selection - Document-type-aware strategy selection

๐Ÿ“Š Extraction Quality Analysis

  • Text coverage - How much text was successfully extracted
  • Structure preservation - Whether headers, lists, tables are maintained
  • Overall confidence - Combined quality score for AI processing

๐ŸŽ‰ Framework Status - COMPLETED

The Docling Analysis Framework is now fully functional and follows the same successful patterns as the XML Analysis Framework:

โœ… Completed Features

  • ๐ŸŽฏ Simple API: One-line functions for analyze(), chunk(), analyze_enhanced()
  • ๐Ÿ”ง Advanced API: Direct class access for customization with DoclingAnalyzer, DoclingChunkingOrchestrator
  • โš™๏ธ Configurable Chunking: ChunkingConfig class for fine-tuning chunk parameters
  • ๐Ÿ“ฆ Multiple Strategies: Structural, table-aware, page-aware, and auto-selection chunking
  • ๐Ÿ’พ JSON Export: Easy export of analysis results and chunks to JSON files
  • ๐Ÿ›ก๏ธ Enhanced Error Handling: Comprehensive logging and error reporting
  • ๐Ÿ“Š Quality Metrics: Extraction quality assessment and content analysis
  • ๐Ÿงช Testing Framework: Complete Jupyter notebook for validation
  • ๐Ÿ“š Documentation: Comprehensive README with examples and usage patterns

๐Ÿš€ Ready for AI/ML Integration

The framework provides everything needed for AI/ML pipelines:

  • Token-optimized chunks sized for LLM context windows
  • Rich metadata with document structure and quality metrics
  • JSON export for vector database ingestion
  • Multiple chunking strategies for different document types
  • Quality assessment to determine AI suitability

๐Ÿ”„ Comparison with XML Framework

Feature XML Framework Docling Framework Status
Simple API โœ… โœ… Complete
Advanced Classes โœ… โœ… Complete
Multiple Strategies โœ… โœ… Complete
Configuration โœ… โœ… Complete
JSON Export โœ… โœ… Complete
Error Handling โœ… โœ… Complete
Testing Notebooks โœ… โœ… Complete
Quality Metrics โœ… โœ… Complete

Both frameworks now provide identical API patterns and functionality!

๐Ÿ”ง Installation

# Install Docling first
pip install docling

# Install framework
git clone https://github.com/redhat-ai-americas/docling-analysis-framework.git
cd docling-analysis-framework  
pip install -e .

๐Ÿ“– Usage Examples

Document Analysis with Quality Assessment

from core.analyzer import DoclingAnalyzer

analyzer = DoclingAnalyzer()
result = analyzer.analyze_document("contract.pdf")

# Access extraction quality
quality = result['analysis'].extraction_quality
print(f"Text coverage: {quality['text_coverage']:.2f}")
print(f"Structure score: {quality['structure_score']:.2f}")
print(f"Overall quality: {quality['overall_score']:.2f}")

# Access document insights
findings = result['analysis'].key_findings
print(f"Pages: {findings['page_count']}")
print(f"Tables: {findings['table_rows']}")
print(f"Figures: {findings['figure_count']}")

Advanced Chunking for Academic Papers

# Perfect for research papers with complex structure
chunks = orchestrator.chunk_document(
    "paper.pdf",
    markdown_content,
    docling_result,
    strategy='structural'
)

# Chunks respect paper structure
for chunk in chunks:
    if chunk.chunk_type == 'section':
        print(f"Section: {chunk.metadata['section_title']}")
    elif chunk.chunk_type == 'table':
        print(f"Table data: {len(chunk.content)} chars")
    elif chunk.chunk_type == 'figure':
        print(f"Figure reference preserved")

File Size Management

from core.analyzer import DoclingAnalyzer

# Large PDF processing with limits
analyzer = DoclingAnalyzer(max_file_size_mb=100.0)

result = analyzer.analyze_document("large_manual.pdf")
if 'error' in result:
    print(f"File too large: {result['error']}")
else:
    print(f"Successfully processed {result['file_size']} bytes")

๐Ÿงช Framework Ecosystem

This framework is part of a larger document analysis ecosystem:

  • xml-analysis-framework - Specialized XML document analysis
  • docling-analysis-framework - PDF and Office documents (this package)
  • document-analysis-framework - Text, code, and configuration files
  • unified-analysis-orchestrator - Routes documents to appropriate frameworks

๐Ÿ”— Integration with Docling

This framework acts as an AI preparation layer on top of Docling:

  1. Docling handles the heavy lifting of document extraction
  2. Our framework adds AI-specific analysis and chunking
  3. Result is AI-ready structured data and optimized chunks
# What Docling provides:
docling_result = docling.convert("document.pdf")
markdown_content = docling_result.document.export_to_markdown()

# What we add:
ai_analysis = our_analyzer.analyze(docling_result)
ai_chunks = our_chunker.chunk(markdown_content, ai_analysis)
quality_scores = our_quality_assessor.assess(docling_result)

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Powered by Docling for document extraction
  • Built for modern AI/ML development workflows
  • Part of the AI Building Blocks initiative

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_analysis_framework-1.0.0.tar.gz (83.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docling_analysis_framework-1.0.0-py3-none-any.whl (16.0 kB view details)

Uploaded Python 3

File details

Details for the file docling_analysis_framework-1.0.0.tar.gz.

File metadata

File hashes

Hashes for docling_analysis_framework-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5df4a9aedcfa61e7a036981e13aff35f60ef07aeaf7b58eb4239020dc3e8f713
MD5 3ee4ed1099c5f74b72ac885b1109c7fe
BLAKE2b-256 aabf89a6641922a297233fc4f87f3bbe2bccacee8bc8f2d42cbdc5591dcc40a3

See more details on using hashes here.

File details

Details for the file docling_analysis_framework-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for docling_analysis_framework-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 766d527f10be9ad1939ac71460a84e40ff5e5615616efe5121b2bcea8e1e2d05
MD5 e75278a708d38f2c7be5ed07b36674ab
BLAKE2b-256 9166a5401f71d4454741c601f79c280ec7e6de2a1c8f8c15a082f1a10d183585

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page