Text, code, and configuration file analysis framework for AI/ML data pipelines - part of the unified analysis framework suite
Project description
Document Analysis Framework v2.0.0
A lightweight document analysis framework for text, code, configuration, and other text-based files designed for AI/ML data pipelines. Part of the unified analysis framework suite built on analysis-framework-base.
🎯 When to Use This Framework
This is a fallback framework for text-based files not handled by our specialized frameworks:
Specialized Frameworks (Use These First!)
-
- For: XML documents of all types
- Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)
- Install:
pip install xml-analysis-framework
-
- For: PDF, Word, Excel, PowerPoint, and images with text
- Features: Docling-powered extraction with OCR support
- Install:
pip install docling-analysis-framework
-
- For: Structured data that needs AI agent interaction
- Features: Safe query interface for AI agents to analyze data
- Install:
pip install data-analysis-framework
Use This Framework For:
- Code files: Python, JavaScript, TypeScript, Go, Rust, etc.
- Config files: Dockerfile, package.json, .env, INI files, etc.
- Text/Markup: Markdown, plain text, LaTeX, AsciiDoc, etc.
- Data files: CSV, JSON, YAML, TOML, TSV, etc.
- Other text-based formats not covered above
Note: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:
- Use
data-analysis-frameworkfor AI agent querying of structured data- Use
document-analysis-frameworkfor chunking and document analysis
🚀 Quick Start
Document Analysis
from core.analyzer import DocumentAnalyzer
analyzer = DocumentAnalyzer()
result = analyzer.analyze_document("path/to/file.py")
print(f"Document Type: {result['document_type'].type_name}")
print(f"Language: {result['document_type'].language}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")
Smart Chunking
from core.analyzer import DocumentAnalyzer
from core.chunking import ChunkingOrchestrator
# Analyze document
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze_document("file.py")
# Convert format for chunking
chunking_analysis = {
'document_type': {
'type_name': analysis['document_type'].type_name,
'confidence': analysis['document_type'].confidence,
'category': analysis['document_type'].category
},
'analysis': analysis['analysis']
}
# Generate AI-optimized chunks
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document("file.py", chunking_analysis, strategy='auto')
🔄 Unified Interface Support
This framework now supports the unified interface standard, providing consistent access patterns across all analysis frameworks:
import src as daf # or from the installed package
# Use the unified interface
result = daf.analyze_unified("config.yaml")
# All access patterns work consistently
doc_type = result['document_type'] # Dict access ✓
doc_type = result.document_type # Attribute access ✓
doc_type = result.get('document_type') # get() method ✓
as_dict = result.to_dict() # Full dict conversion ✓
# Works the same across all frameworks
print(f"Framework: {result.framework}") # 'document-analysis-framework'
print(f"Type: {result.document_type}")
print(f"Confidence: {result.confidence}")
print(f"AI opportunities: {result.ai_opportunities}")
The unified interface ensures compatibility when switching between frameworks or using multiple frameworks together.
Chunking for RAG
📋 Currently Supported File Types
| Category | File Types | Extensions | Confidence |
|---|---|---|---|
| 📝 Text & Data | Markdown, CSV, JSON, YAML, TOML, Plain Text | .md, .csv, .json, .yaml, .toml, .txt | 90-95% |
| 💻 Code Files | Python, JavaScript, Java, C++, SQL | .py, .js, .java, .cpp, .sql | 90-95% |
| ⚙️ Configuration | Dockerfile, package.json, requirements.txt, Makefile | Various | 95% |
Coming Soon:
- TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin
- Shell scripts, PowerShell, R, MATLAB
- INI files, .env files, Apache/Nginx configs
- LaTeX, AsciiDoc, reStructuredText
- Log files, CSS/SCSS, Vue/Svelte components
🎯 Key Features
- 🔍 Intelligent Document Detection - Content-based recognition with confidence scoring
- 🤖 AI-Ready Output - Structured analysis with quality metrics and use case recommendations
- ⚡ Smart Chunking - Document-type-aware segmentation strategies
- 🔒 Security & Reliability - File size limits, safe handling, pure Python stdlib
- 🔄 Extensible - Easy to add new handlers for additional file types
🔧 Installation
pip install document-analysis-framework
Or from source:
git clone https://github.com/rdwj/document-analysis-framework.git
cd document-analysis-framework
pip install -e .
🧪 Framework Ecosystem
This framework is part of the unified analysis framework suite built on analysis-framework-base:
| Framework | Purpose | Key Features |
|---|---|---|
| analysis-framework-base | Base interfaces & standards | Common API, unified interface, framework integration |
| xml-analysis-framework | XML document specialist | 29+ handlers, security-focused, enterprise configs |
| docling-analysis-framework | PDF & Office documents | OCR support, table extraction, figure handling |
| data-analysis-framework | Structured data AI agent | Safe queries, natural language interface |
| document-analysis-framework | Everything else (text-based) | Fallback handler, pure Python, extensible |
Choosing the Right Framework
graph TD
A[Have a document to analyze?] --> B{What type?}
B -->|XML| C[xml-analysis-framework]
B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]
B -->|Need AI to query data| E[data-analysis-framework]
B -->|Text/Code/Config/Other| F[document-analysis-framework]
📄 License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file document_analysis_framework-2.0.0.tar.gz.
File metadata
- Download URL: document_analysis_framework-2.0.0.tar.gz
- Upload date:
- Size: 114.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9f9a7e22064ef9ccf0ac9339cd4bd175763451d9e5fe2eb1decfd6b16fea2206
|
|
| MD5 |
a87286dae1cbbf53754f528a5b692bcd
|
|
| BLAKE2b-256 |
7f0f7c3e119ff34cdca29eb9fd061a66491dbfd4c68b036d7420706abf364ff7
|
File details
Details for the file document_analysis_framework-2.0.0-py3-none-any.whl.
File metadata
- Download URL: document_analysis_framework-2.0.0-py3-none-any.whl
- Upload date:
- Size: 116.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
513937005579052924640646ed6b59c58ffcd5d8669e43235659743051bef8bb
|
|
| MD5 |
0f2b6205f9b1d78f4ccdef144bd6fdb7
|
|
| BLAKE2b-256 |
5484f7058c52f4edd1f02a36e66d7dbf0fbb71a8249a6b7fc1378e741e928063
|