Unified interface for document analysis frameworks - automatically routes to xml, docling, document, or data analysis frameworks
Project description
Unified Document Analysis
A thin wrapper providing a single entry point for all document analysis frameworks with automatic routing based on file type.
Overview
This package provides a unified interface for analyzing any document type by automatically routing files to the appropriate specialized framework:
- xml-analysis-framework - For XML files (S1000D, DITA, etc.)
- docling-analysis-framework - For PDFs, Office documents, and images
- document-analysis-framework - For code, text, and configuration files
- data-analysis-framework - For CSV, Parquet, databases, and data files
Why Use This?
Instead of learning and managing 4 different frameworks, you get:
- Single API - One
analyze()andchunk()function for everything - Automatic routing - File type detection happens automatically
- Lazy loading - Only loads frameworks when needed
- Minimal dependencies - Install only what you need via optional dependencies
- Helpful errors - Clear messages when frameworks are missing
Installation
Install the base package:
pip install unified-document-analysis
Then install the frameworks you need:
# Individual frameworks
pip install unified-document-analysis[xml]
pip install unified-document-analysis[docling]
pip install unified-document-analysis[document]
pip install unified-document-analysis[data]
# Lightweight set (xml + document, no heavy dependencies)
pip install unified-document-analysis[lightweight]
# Everything
pip install unified-document-analysis[all]
Quick Start
Basic Usage
from unified_document_analysis import analyze, chunk
# Analyze any file - automatically detects type and routes to correct framework
result = analyze('technical_manual.xml')
result = analyze('report.pdf')
result = analyze('data.csv')
result = analyze('config.py')
# Chunk the file
chunks = chunk('report.pdf', result, strategy="semantic")
Using the UnifiedAnalyzer Class
from unified_document_analysis import UnifiedAnalyzer
analyzer = UnifiedAnalyzer()
# Check what's installed
print(analyzer.get_available_frameworks())
# Output: ['xml', 'document', 'data']
# Detect framework for a file (without analyzing)
info = analyzer.detect_framework_for_file('document.json')
print(f"Framework: {info['framework']}, Confidence: {info['confidence']}")
# Analyze with optional framework hint (for ambiguous files)
result = analyzer.analyze('ambiguous.json', framework_hint='data')
# Get framework information
info = analyzer.get_framework_info('xml')
print(f"Installed: {info['installed']}")
print(f"Extensions: {info['extensions']}")
Advanced Features
from unified_document_analysis import (
UnifiedAnalyzer,
get_available_frameworks,
get_supported_extensions,
detect_framework_for_file
)
# Check what frameworks are installed
frameworks = get_available_frameworks()
print(f"Available: {frameworks}")
# Get all supported extensions
all_exts = get_supported_extensions()
print(f"XML extensions: {all_exts['xml']}")
print(f"Docling extensions: {all_exts['docling']}")
# Detect framework for a file
info = detect_framework_for_file('document.yaml')
if info['is_ambiguous']:
print(f"Ambiguous file. Alternatives: {info['alternatives']}")
Framework Routing Table
| File Type | Extensions | Framework Used |
|---|---|---|
| XML | .xml |
xml-analysis-framework |
| PDFs | .pdf |
docling-analysis-framework |
| Office | .docx, .pptx, .xlsx |
docling-analysis-framework |
| Images | .png, .jpg, .jpeg, .tiff, .bmp |
docling-analysis-framework |
| Data | .csv, .parquet, .db, .sqlite |
data-analysis-framework |
| Code | .py, .js, .ts, .java, .c, etc. |
document-analysis-framework |
| Text | .md, .txt, .rst, .tex |
document-analysis-framework |
| Config | .json, .yaml, .toml, .ini |
document-analysis-framework |
Ambiguous File Types
Some extensions could belong to multiple frameworks. By default:
.json→ document-analysis-framework (confidence: 0.7).yaml/.yml→ document-analysis-framework (confidence: 0.7)
Use framework_hint to override:
# Treat JSON as data instead of document
result = analyze('data.json', framework_hint='data')
API Reference
Main Functions
analyze(file_path, framework_hint=None, **kwargs)
Analyze any supported file type.
Args:
file_path(str): Path to file to analyzeframework_hint(str, optional): Force specific framework ('xml', 'docling', 'document', 'data')**kwargs: Additional arguments passed to framework's analyze method
Returns: Analysis result from appropriate framework
Raises:
FrameworkNotInstalledError: If required framework is not installedUnsupportedFileTypeError: If file type is not supportedAnalysisError: If analysis fails
chunk(file_path, analysis_result, strategy="auto", framework_hint=None, **kwargs)
Chunk a file based on its analysis result.
Args:
file_path(str): Path to file to chunkanalysis_result: Analysis result fromanalyze()strategy(str): Chunking strategy (framework-specific)framework_hint(str, optional): Force specific framework**kwargs: Additional arguments passed to framework's chunk method
Returns: List of chunks from appropriate framework
get_available_frameworks()
Get list of installed frameworks.
Returns: List of framework names (e.g., ['xml', 'document'])
detect_framework_for_file(file_path, hint=None)
Detect which framework would be used for a file.
Returns: Dictionary with:
framework: Detected framework nameconfidence: Confidence score (0.0-1.0)is_ambiguous: Whether file type is ambiguousalternatives: List of alternative frameworks for ambiguous typesinstalled: Whether framework is installed
UnifiedAnalyzer Class
Methods
analyze(file_path, framework_hint=None, **kwargs)- Analyze a filechunk(file_path, analysis_result, strategy="auto", **kwargs)- Chunk a fileget_available_frameworks()- Get installed frameworksget_framework_info(framework_name)- Get framework detailsdetect_framework_for_file(file_path, hint=None)- Detect frameworkget_supported_extensions(framework=None)- Get supported extensions
Error Handling
The package provides helpful error messages:
Framework Not Installed
# If you try to analyze a PDF but docling framework is not installed
try:
result = analyze('document.pdf')
except FrameworkNotInstalledError as e:
print(e)
# Output:
# The 'docling' framework is required to process 'document.pdf'
# but is not installed.
#
# Install it with:
# pip install unified-document-analysis[docling]
#
# Or install all frameworks:
# pip install unified-document-analysis[all]
Unsupported File Type
try:
result = analyze('document.unknown')
except UnsupportedFileTypeError as e:
print(e)
# Provides list of all supported file types
When to Use What
Use unified-document-analysis when:
- You're building an application that needs to handle multiple file types
- You want a simple API that "just works"
- You want to minimize dependencies by installing only needed frameworks
- You're prototyping and want quick file analysis
Use individual frameworks when:
- You only need one framework (e.g., only XML files)
- You need advanced framework-specific features
- You want maximum control over configuration
How It Works
Lazy Loading
Frameworks are only imported when needed:
# Only imports base package (lightweight)
from unified_document_analysis import analyze
# Framework only loaded when analyze() is called
result = analyze('document.pdf') # Now docling is imported
Smart Routing
The router examines file extensions and routes to the appropriate framework:
- Get file extension
- Look up extension in routing table
- Dynamically import framework module
- Call framework's analyze/chunk methods
- Return results
Optional Dependencies
[project.optional-dependencies]
xml = ["xml-analysis-framework>=2.0.0"]
docling = ["docling-analysis-framework>=2.0.0"]
document = ["document-analysis-framework>=2.0.0"]
data = ["data-analysis-framework>=2.0.0"]
Examples
Multi-Format Document Pipeline
from unified_document_analysis import analyze, chunk
def process_documents(file_paths):
"""Process multiple document types in a single pipeline."""
results = []
for path in file_paths:
# Analyze (auto-routes to correct framework)
analysis = analyze(path)
# Chunk (uses same framework)
chunks = chunk(path, analysis, strategy="semantic")
results.append({
'path': path,
'analysis': analysis,
'chunks': chunks
})
return results
# Works with any mix of file types
files = [
'manual.xml',
'report.pdf',
'data.csv',
'config.py'
]
results = process_documents(files)
Check Before Processing
from unified_document_analysis import detect_framework_for_file, get_available_frameworks
def can_process_file(file_path):
"""Check if a file can be processed with installed frameworks."""
info = detect_framework_for_file(file_path)
available = get_available_frameworks()
if not info['installed']:
print(f"Cannot process {file_path}: {info['framework']} framework not installed")
return False
if info['is_ambiguous']:
print(f"Ambiguous file type. Will use {info['framework']} framework.")
print(f"Alternatives: {info['alternatives']}")
return True
# Check before processing
if can_process_file('document.json'):
result = analyze('document.json')
Custom Framework Selection
from unified_document_analysis import UnifiedAnalyzer
analyzer = UnifiedAnalyzer()
# Override automatic detection for JSON data files
json_files = ['data1.json', 'data2.json']
for file_path in json_files:
# Force data framework instead of document framework
result = analyzer.analyze(file_path, framework_hint='data')
print(f"Processed {file_path} as data")
Contributing
Contributions welcome! Please submit issues and pull requests on GitHub.
Repository: https://github.com/rdwj/unified-document-analysis
License
Apache License 2.0
Related Projects
- analysis-framework-base - Base interfaces and types
- xml-analysis-framework - XML document analysis
- docling-analysis-framework - PDF/Office analysis
- document-analysis-framework - Code/text analysis
- data-analysis-framework - Data file analysis
Support
For issues, questions, or contributions, please visit: https://github.com/rdwj/unified-document-analysis/issues
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file unified_document_analysis-1.0.1.tar.gz.
File metadata
- Download URL: unified_document_analysis-1.0.1.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4861acf0f0793f035b01b7256d45e5c134a5848af3eea31b14ef1780763d1925
|
|
| MD5 |
2dd7a73eecddb3cef472f41535d48110
|
|
| BLAKE2b-256 |
e1f6eba1b59c407e7cd5b503051099e8e3ccd6a37a71bf17d6e9f8ecd5fbdcaa
|
File details
Details for the file unified_document_analysis-1.0.1-py3-none-any.whl.
File metadata
- Download URL: unified_document_analysis-1.0.1-py3-none-any.whl
- Upload date:
- Size: 16.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
185527e3460016f3e1b68e193ff7e713bd3040f93dfc97f2f7ea3197693a04db
|
|
| MD5 |
ab5b37bc0863ddfc6cf3247f9512749d
|
|
| BLAKE2b-256 |
b21548b757dc36d837b903248e92013a31597eadcb16f8933af21ee86c9c4dc1
|