AI-ready analysis framework for PDF and Office documents using Docling for content extraction - part of the unified analysis framework suite
Project description
Docling Analysis Framework
AI-ready analysis framework for PDF and Office documents using Docling for content extraction. Transforms Docling's output into optimized chunks and structured analysis for AI/ML pipelines.
๐ Part of Analysis Framework Suite
This framework is part of a unified suite of document analysis tools that share a consistent interface:
- analysis-framework-base - Base interfaces and shared models
- xml-analysis-framework - XML document analysis
- docling-analysis-framework - PDF/Office documents via Docling (this package)
- document-analysis-framework - Text, code, config files
- data-analysis-framework - Structured data analysis
All frameworks implement the same BaseAnalyzer and BaseChunker interfaces from analysis-framework-base, enabling:
- Consistent API across document types
- Easy framework switching with minimal code changes
- Unified result format for downstream processing
- Shared tooling and utilities
๐ For text, code, and configuration files, use our companion document-analysis-framework which uses only Python standard library.
๐ Quick Start
Simple API - Get Started in Seconds
import docling_analysis_framework as daf
# ๐ฏ One-line analysis with Docling extraction
result = daf.analyze("document.pdf")
print(f"Document type: {result['document_type'].type_name}")
print(f"Pages: {result['document_type'].pages}")
print(f"Handler used: {result['handler_used']}")
# โ๏ธ Smart chunking for AI/ML
chunks = daf.chunk("document.pdf", strategy="auto")
print(f"Created {len(chunks)} optimized chunks")
# ๐ Enhanced analysis with both analysis and chunking
enhanced = daf.analyze_enhanced("document.pdf", chunking_strategy="structural")
print(f"Document: {enhanced['analysis']['document_type'].type_name}")
print(f"Chunks: {len(enhanced['chunks'])}")
# ๐พ Save chunks to JSON
daf.save_chunks_to_json(chunks, "chunks_output.json")
# ๐พ Save analysis to JSON
daf.save_analysis_to_json(result, "analysis_output.json")
๐ Unified Interface Support
This framework now supports the unified interface standard, providing consistent access patterns across all analysis frameworks:
import docling_analysis_framework as daf
# Use the unified interface
result = daf.analyze_unified("document.pdf")
# All access patterns work consistently
doc_type = result['document_type'] # Dict access โ
doc_type = result.document_type # Attribute access โ
doc_type = result.get('document_type') # get() method โ
as_dict = result.to_dict() # Full dict conversion โ
# Works the same across all frameworks
print(f"Framework: {result.framework}") # 'docling-analysis-framework'
print(f"Type: {result.document_type}")
print(f"Confidence: {result.confidence}")
print(f"AI opportunities: {result.ai_opportunities}")
The unified interface ensures compatibility when switching between frameworks or using multiple frameworks together.
Advanced Usage
import docling_analysis_framework as daf
# Enhanced analysis with full results
enhanced = daf.analyze_enhanced("research_paper.pdf")
print(f"Type: {enhanced['analysis']['document_type'].type_name}")
print(f"Confidence: {enhanced['analysis']['confidence']:.2f}")
print(f"AI use cases: {len(enhanced['analysis']['analysis'].ai_use_cases)}")
if enhanced['analysis']['analysis'].quality_metrics:
for metric, score in enhanced['analysis']['analysis'].quality_metrics.items():
print(f"{metric}: {score:.2f}")
# Different chunking strategies
hierarchical_chunks = daf.chunk("document.pdf", strategy="structural")
table_aware_chunks = daf.chunk("document.pdf", strategy="table_aware")
page_aware_chunks = daf.chunk("document.pdf", strategy="page_aware")
# Process chunks
for chunk in hierarchical_chunks:
print(f"Chunk {chunk.chunk_id}: {chunk.token_count} tokens")
print(f"Type: {chunk.chunk_type}")
# Send to your AI model...
# ๐พ Save different strategies to separate files
strategies = {
"structural": hierarchical_chunks,
"table_aware": table_aware_chunks,
"page_aware": page_aware_chunks
}
for strategy_name, chunks in strategies.items():
daf.save_chunks_to_json(chunks, f"chunks_{strategy_name}.json")
print(f"Saved {len(chunks)} chunks to chunks_{strategy_name}.json")
Expert Usage - Direct Class Access
# For advanced customization, use the classes directly
from docling_analysis_framework import DoclingAnalyzer, DoclingChunkingOrchestrator, ChunkingConfig
analyzer = DoclingAnalyzer(max_file_size_mb=100)
result = analyzer.analyze_document("document.pdf")
# Custom chunking with config
config = ChunkingConfig(
max_chunk_size=1500,
min_chunk_size=200,
overlap_size=100,
preserve_structure=True
)
orchestrator = DoclingChunkingOrchestrator(config=config)
# Note: In advanced usage, you'd pass the actual docling_result object
๐ Supported Document Types
| Format | Extensions | Confidence | Powered By |
|---|---|---|---|
| ๐ PDF Documents | 95% | Docling extraction | |
| ๐ Word Documents | .docx | 95% | Docling extraction |
| ๐ Spreadsheets | .xlsx | 70% | Docling extraction |
| ๐ Presentations | .pptx | 70% | Docling extraction |
| ๐ผ๏ธ Images with Text | .png, .jpg, .tiff | 70% | Docling OCR |
๐ฏ Key Features
๐ Docling-Powered Extraction
- PDF text extraction - High-quality content extraction
- Table detection - Preserves table structure in markdown
- Figure references - Maintains image/figure relationships
- Header hierarchy - Document structure preservation
๐ค AI Preparation Layer
- Quality assessment - Extraction quality scoring
- Structure analysis - Document type detection and analysis
- Chunk optimization - AI-ready segmentation strategies
- Rich metadata - Page counts, figures, tables, quality metrics
โก Smart Chunking Strategies
- Structural chunking - Respects document hierarchy (headers, sections)
- Table-aware chunking - Separates tables from text content
- Page-aware chunking - Considers original page boundaries
- Auto-selection - Document-type-aware strategy selection
๐ Extraction Quality Analysis
- Text coverage - How much text was successfully extracted
- Structure preservation - Whether headers, lists, tables are maintained
- Overall confidence - Combined quality score for AI processing
๐ Framework Status - COMPLETED
The Docling Analysis Framework is now fully functional and follows the same successful patterns as the XML Analysis Framework:
โ Completed Features
- ๐ฏ Simple API: One-line functions for
analyze(),chunk(),analyze_enhanced() - ๐ง Advanced API: Direct class access for customization with
DoclingAnalyzer,DoclingChunkingOrchestrator - โ๏ธ Configurable Chunking:
ChunkingConfigclass for fine-tuning chunk parameters - ๐ฆ Multiple Strategies: Structural, table-aware, page-aware, and auto-selection chunking
- ๐พ JSON Export: Easy export of analysis results and chunks to JSON files
- ๐ก๏ธ Enhanced Error Handling: Comprehensive logging and error reporting
- ๐ Quality Metrics: Extraction quality assessment and content analysis
- ๐งช Testing Framework: Complete Jupyter notebook for validation
- ๐ Documentation: Comprehensive README with examples and usage patterns
๐ Ready for AI/ML Integration
The framework provides everything needed for AI/ML pipelines:
- Token-optimized chunks sized for LLM context windows
- Rich metadata with document structure and quality metrics
- JSON export for vector database ingestion
- Multiple chunking strategies for different document types
- Quality assessment to determine AI suitability
๐ Comparison with XML Framework
| Feature | XML Framework | Docling Framework | Status |
|---|---|---|---|
| Simple API | โ | โ | Complete |
| Advanced Classes | โ | โ | Complete |
| Multiple Strategies | โ | โ | Complete |
| Configuration | โ | โ | Complete |
| JSON Export | โ | โ | Complete |
| Error Handling | โ | โ | Complete |
| Testing Notebooks | โ | โ | Complete |
| Quality Metrics | โ | โ | Complete |
Both frameworks now provide identical API patterns and functionality!
๐ง Installation
# Install Docling first
pip install docling
# Install framework
git clone https://github.com/rdwj/docling-analysis-framework.git
cd docling-analysis-framework
pip install -e .
๐ Usage Examples
Document Analysis with Quality Assessment
from core.analyzer import DoclingAnalyzer
analyzer = DoclingAnalyzer()
result = analyzer.analyze_document("contract.pdf")
# Access extraction quality
quality = result['analysis'].extraction_quality
print(f"Text coverage: {quality['text_coverage']:.2f}")
print(f"Structure score: {quality['structure_score']:.2f}")
print(f"Overall quality: {quality['overall_score']:.2f}")
# Access document insights
findings = result['analysis'].key_findings
print(f"Pages: {findings['page_count']}")
print(f"Tables: {findings['table_rows']}")
print(f"Figures: {findings['figure_count']}")
Advanced Chunking for Academic Papers
# Perfect for research papers with complex structure
chunks = orchestrator.chunk_document(
"paper.pdf",
markdown_content,
docling_result,
strategy='structural'
)
# Chunks respect paper structure
for chunk in chunks:
if chunk.chunk_type == 'section':
print(f"Section: {chunk.metadata['section_title']}")
elif chunk.chunk_type == 'table':
print(f"Table data: {len(chunk.content)} chars")
elif chunk.chunk_type == 'figure':
print(f"Figure reference preserved")
File Size Management
from core.analyzer import DoclingAnalyzer
# Large PDF processing with limits
analyzer = DoclingAnalyzer(max_file_size_mb=100.0)
result = analyzer.analyze_document("large_manual.pdf")
if 'error' in result:
print(f"File too large: {result['error']}")
else:
print(f"Successfully processed {result['file_size']} bytes")
๐งช Framework Ecosystem
This framework is part of a larger document analysis ecosystem:
xml-analysis-framework- Specialized XML document analysisdocling-analysis-framework- PDF and Office documents (this package)document-analysis-framework- Text, code, and configuration filesunified-analysis-orchestrator- Routes documents to appropriate frameworks
๐ Integration with Docling
This framework acts as an AI preparation layer on top of Docling:
- Docling handles the heavy lifting of document extraction
- Our framework adds AI-specific analysis and chunking
- Result is AI-ready structured data and optimized chunks
# What Docling provides:
docling_result = docling.convert("document.pdf")
markdown_content = docling_result.document.export_to_markdown()
# What we add:
ai_analysis = our_analyzer.analyze(docling_result)
ai_chunks = our_chunker.chunk(markdown_content, ai_analysis)
quality_scores = our_quality_assessor.assess(docling_result)
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Powered by Docling for document extraction
- Built for modern AI/ML development workflows
- Part of the AI Building Blocks initiative
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docling_analysis_framework-2.0.0.tar.gz.
File metadata
- Download URL: docling_analysis_framework-2.0.0.tar.gz
- Upload date:
- Size: 40.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f3f6fb642786e3fbee9d45d0af6ef0b3e80179cee263790f10eed574a3ba6edc
|
|
| MD5 |
5bc460b9e572b7052b394cbe00329f59
|
|
| BLAKE2b-256 |
8234ca855b43118cd49e98ef37b83732a8aeb543586dd8ac3e1999f35451b75d
|
File details
Details for the file docling_analysis_framework-2.0.0-py3-none-any.whl.
File metadata
- Download URL: docling_analysis_framework-2.0.0-py3-none-any.whl
- Upload date:
- Size: 22.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d786c0e44bab660e09f9d8b534bc66260186b4b7e39617b2a92c54c0ce98f6b3
|
|
| MD5 |
e077b6b4efc5be6f3a5d41a034995e38
|
|
| BLAKE2b-256 |
42b7ed340ba6f12aea962b4a4f14e721ed897e58853882e943ed29d89ba9ffe1
|