Skip to main content

AI-ready analysis framework for PDF and Office documents using Docling

Project description

Docling Analysis Framework

Python 3.8+ License: MIT Docling Powered AI Ready

AI-ready analysis framework for PDF and Office documents using Docling for content extraction. Transforms Docling's output into optimized chunks and structured analysis for AI/ML pipelines.

๐Ÿ“ For text, code, and configuration files, use our companion document-analysis-framework which uses only Python standard library.

๐Ÿš€ Quick Start

Simple API - Get Started in Seconds

import docling_analysis_framework as daf

# ๐ŸŽฏ One-line analysis with Docling extraction
result = daf.analyze("document.pdf")
print(f"Document type: {result['document_type'].type_name}")
print(f"Pages: {result['document_type'].pages}")
print(f"Handler used: {result['handler_used']}")

# โœ‚๏ธ Smart chunking for AI/ML
chunks = daf.chunk("document.pdf", strategy="auto")
print(f"Created {len(chunks)} optimized chunks")

# ๐Ÿš€ Enhanced analysis with both analysis and chunking
enhanced = daf.analyze_enhanced("document.pdf", chunking_strategy="structural")
print(f"Document: {enhanced['analysis']['document_type'].type_name}")
print(f"Chunks: {len(enhanced['chunks'])}")

# ๐Ÿ’พ Save chunks to JSON
daf.save_chunks_to_json(chunks, "chunks_output.json")

# ๐Ÿ’พ Save analysis to JSON  
daf.save_analysis_to_json(result, "analysis_output.json")

Advanced Usage

import docling_analysis_framework as daf

# Enhanced analysis with full results
enhanced = daf.analyze_enhanced("research_paper.pdf")

print(f"Type: {enhanced['analysis']['document_type'].type_name}")
print(f"Confidence: {enhanced['analysis']['confidence']:.2f}")
print(f"AI use cases: {len(enhanced['analysis']['analysis'].ai_use_cases)}")
if enhanced['analysis']['analysis'].quality_metrics:
    for metric, score in enhanced['analysis']['analysis'].quality_metrics.items():
        print(f"{metric}: {score:.2f}")

# Different chunking strategies
hierarchical_chunks = daf.chunk("document.pdf", strategy="structural")
table_aware_chunks = daf.chunk("document.pdf", strategy="table_aware") 
page_aware_chunks = daf.chunk("document.pdf", strategy="page_aware")

# Process chunks
for chunk in hierarchical_chunks:
    print(f"Chunk {chunk.chunk_id}: {chunk.token_count} tokens")
    print(f"Type: {chunk.chunk_type}")
    # Send to your AI model...

# ๐Ÿ’พ Save different strategies to separate files
strategies = {
    "structural": hierarchical_chunks,
    "table_aware": table_aware_chunks,
    "page_aware": page_aware_chunks
}

for strategy_name, chunks in strategies.items():
    daf.save_chunks_to_json(chunks, f"chunks_{strategy_name}.json")
    print(f"Saved {len(chunks)} chunks to chunks_{strategy_name}.json")

Expert Usage - Direct Class Access

# For advanced customization, use the classes directly
from docling_analysis_framework import DoclingAnalyzer, DoclingChunkingOrchestrator, ChunkingConfig

analyzer = DoclingAnalyzer(max_file_size_mb=100)
result = analyzer.analyze_document("document.pdf")

# Custom chunking with config
config = ChunkingConfig(
    max_chunk_size=1500,
    min_chunk_size=200,
    overlap_size=100,
    preserve_structure=True
)

orchestrator = DoclingChunkingOrchestrator(config=config)
# Note: In advanced usage, you'd pass the actual docling_result object

๐Ÿ“‹ Supported Document Types

Format Extensions Confidence Powered By
๐Ÿ“„ PDF Documents .pdf 95% Docling extraction
๐Ÿ“ Word Documents .docx 95% Docling extraction
๐Ÿ“Š Spreadsheets .xlsx 70% Docling extraction
๐Ÿ“… Presentations .pptx 70% Docling extraction
๐Ÿ–ผ๏ธ Images with Text .png, .jpg, .tiff 70% Docling OCR

๐ŸŽฏ Key Features

๐Ÿ” Docling-Powered Extraction

  • PDF text extraction - High-quality content extraction
  • Table detection - Preserves table structure in markdown
  • Figure references - Maintains image/figure relationships
  • Header hierarchy - Document structure preservation

๐Ÿค– AI Preparation Layer

  • Quality assessment - Extraction quality scoring
  • Structure analysis - Document type detection and analysis
  • Chunk optimization - AI-ready segmentation strategies
  • Rich metadata - Page counts, figures, tables, quality metrics

โšก Smart Chunking Strategies

  • Structural chunking - Respects document hierarchy (headers, sections)
  • Table-aware chunking - Separates tables from text content
  • Page-aware chunking - Considers original page boundaries
  • Auto-selection - Document-type-aware strategy selection

๐Ÿ“Š Extraction Quality Analysis

  • Text coverage - How much text was successfully extracted
  • Structure preservation - Whether headers, lists, tables are maintained
  • Overall confidence - Combined quality score for AI processing

๐ŸŽ‰ Framework Status - COMPLETED

The Docling Analysis Framework is now fully functional and follows the same successful patterns as the XML Analysis Framework:

โœ… Completed Features

  • ๐ŸŽฏ Simple API: One-line functions for analyze(), chunk(), analyze_enhanced()
  • ๐Ÿ”ง Advanced API: Direct class access for customization with DoclingAnalyzer, DoclingChunkingOrchestrator
  • โš™๏ธ Configurable Chunking: ChunkingConfig class for fine-tuning chunk parameters
  • ๐Ÿ“ฆ Multiple Strategies: Structural, table-aware, page-aware, and auto-selection chunking
  • ๐Ÿ’พ JSON Export: Easy export of analysis results and chunks to JSON files
  • ๐Ÿ›ก๏ธ Enhanced Error Handling: Comprehensive logging and error reporting
  • ๐Ÿ“Š Quality Metrics: Extraction quality assessment and content analysis
  • ๐Ÿงช Testing Framework: Complete Jupyter notebook for validation
  • ๐Ÿ“š Documentation: Comprehensive README with examples and usage patterns

๐Ÿš€ Ready for AI/ML Integration

The framework provides everything needed for AI/ML pipelines:

  • Token-optimized chunks sized for LLM context windows
  • Rich metadata with document structure and quality metrics
  • JSON export for vector database ingestion
  • Multiple chunking strategies for different document types
  • Quality assessment to determine AI suitability

๐Ÿ”„ Comparison with XML Framework

Feature XML Framework Docling Framework Status
Simple API โœ… โœ… Complete
Advanced Classes โœ… โœ… Complete
Multiple Strategies โœ… โœ… Complete
Configuration โœ… โœ… Complete
JSON Export โœ… โœ… Complete
Error Handling โœ… โœ… Complete
Testing Notebooks โœ… โœ… Complete
Quality Metrics โœ… โœ… Complete

Both frameworks now provide identical API patterns and functionality!

๐Ÿ”ง Installation

# Install Docling first
pip install docling

# Install framework
git clone https://github.com/redhat-ai-americas/docling-analysis-framework.git
cd docling-analysis-framework  
pip install -e .

๐Ÿ“– Usage Examples

Document Analysis with Quality Assessment

from core.analyzer import DoclingAnalyzer

analyzer = DoclingAnalyzer()
result = analyzer.analyze_document("contract.pdf")

# Access extraction quality
quality = result['analysis'].extraction_quality
print(f"Text coverage: {quality['text_coverage']:.2f}")
print(f"Structure score: {quality['structure_score']:.2f}")
print(f"Overall quality: {quality['overall_score']:.2f}")

# Access document insights
findings = result['analysis'].key_findings
print(f"Pages: {findings['page_count']}")
print(f"Tables: {findings['table_rows']}")
print(f"Figures: {findings['figure_count']}")

Advanced Chunking for Academic Papers

# Perfect for research papers with complex structure
chunks = orchestrator.chunk_document(
    "paper.pdf",
    markdown_content,
    docling_result,
    strategy='structural'
)

# Chunks respect paper structure
for chunk in chunks:
    if chunk.chunk_type == 'section':
        print(f"Section: {chunk.metadata['section_title']}")
    elif chunk.chunk_type == 'table':
        print(f"Table data: {len(chunk.content)} chars")
    elif chunk.chunk_type == 'figure':
        print(f"Figure reference preserved")

File Size Management

from core.analyzer import DoclingAnalyzer

# Large PDF processing with limits
analyzer = DoclingAnalyzer(max_file_size_mb=100.0)

result = analyzer.analyze_document("large_manual.pdf")
if 'error' in result:
    print(f"File too large: {result['error']}")
else:
    print(f"Successfully processed {result['file_size']} bytes")

๐Ÿงช Framework Ecosystem

This framework is part of a larger document analysis ecosystem:

  • xml-analysis-framework - Specialized XML document analysis
  • docling-analysis-framework - PDF and Office documents (this package)
  • document-analysis-framework - Text, code, and configuration files
  • unified-analysis-orchestrator - Routes documents to appropriate frameworks

๐Ÿ”— Integration with Docling

This framework acts as an AI preparation layer on top of Docling:

  1. Docling handles the heavy lifting of document extraction
  2. Our framework adds AI-specific analysis and chunking
  3. Result is AI-ready structured data and optimized chunks
# What Docling provides:
docling_result = docling.convert("document.pdf")
markdown_content = docling_result.document.export_to_markdown()

# What we add:
ai_analysis = our_analyzer.analyze(docling_result)
ai_chunks = our_chunker.chunk(markdown_content, ai_analysis)
quality_scores = our_quality_assessor.assess(docling_result)

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Powered by Docling for document extraction
  • Built for modern AI/ML development workflows
  • Part of the AI Building Blocks initiative

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docling_analysis_framework-1.0.1.tar.gz (9.2 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docling_analysis_framework-1.0.1-py3-none-any.whl (18.7 kB view details)

Uploaded Python 3

File details

Details for the file docling_analysis_framework-1.0.1.tar.gz.

File metadata

File hashes

Hashes for docling_analysis_framework-1.0.1.tar.gz
Algorithm Hash digest
SHA256 c0c0bfae4111421c82ae5f4080ce852321ff7305d8eec05adee2070554d0a88a
MD5 b16cb68d937512d2ca288dccde017de5
BLAKE2b-256 2e3c8fd9aadc3d643bb9bef1e63eb98cf8f529a49b624054e2ecdd99d3c85928

See more details on using hashes here.

File details

Details for the file docling_analysis_framework-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for docling_analysis_framework-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 50f233590df8758b40b7fa3791122992b043aad097b986aefd6d5d18a5427864
MD5 c18a2adc5090d524cd303793a6d6a303
BLAKE2b-256 e6197950c0f77135ceac624245d33cdf5fe087194b16b7c51a3a73f37c87276d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page