Text, code, and configuration file analysis framework for AI/ML data pipelines - part of the unified analysis framework suite

These details have not been verified by PyPI

Project links

Project description

Document Analysis Framework v2.0.0

A lightweight document analysis framework for text, code, configuration, and other text-based files designed for AI/ML data pipelines. Part of the unified analysis framework suite built on analysis-framework-base.

🎯 When to Use This Framework

This is a fallback framework for text-based files not handled by our specialized frameworks:

Specialized Frameworks (Use These First!)

xml-analysis-framework 📑
- For: XML documents of all types
- Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)
- Install: pip install xml-analysis-framework
docling-analysis-framework 📄
- For: PDF, Word, Excel, PowerPoint, and images with text
- Features: Docling-powered extraction with OCR support
- Install: pip install docling-analysis-framework
data-analysis-framework 📊
- For: Structured data that needs AI agent interaction
- Features: Safe query interface for AI agents to analyze data
- Install: pip install data-analysis-framework

Use This Framework For:

Code files: Python, JavaScript, TypeScript, Go, Rust, etc.
Config files: Dockerfile, package.json, .env, INI files, etc.
Text/Markup: Markdown, plain text, LaTeX, AsciiDoc, etc.
Data files: CSV, JSON, YAML, TOML, TSV, etc.
Other text-based formats not covered above

Note: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:

Use data-analysis-framework for AI agent querying of structured data

Use document-analysis-framework for chunking and document analysis

🚀 Quick Start

Document Analysis

from core.analyzer import DocumentAnalyzer

analyzer = DocumentAnalyzer()
result = analyzer.analyze_document("path/to/file.py")

print(f"Document Type: {result['document_type'].type_name}")
print(f"Language: {result['document_type'].language}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")

Smart Chunking

from core.analyzer import DocumentAnalyzer
from core.chunking import ChunkingOrchestrator

# Analyze document
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze_document("file.py")

# Convert format for chunking
chunking_analysis = {
    'document_type': {
        'type_name': analysis['document_type'].type_name,
        'confidence': analysis['document_type'].confidence,
        'category': analysis['document_type'].category
    },
    'analysis': analysis['analysis']
}

# Generate AI-optimized chunks
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document("file.py", chunking_analysis, strategy='auto')

🔄 Unified Interface Support

This framework now supports the unified interface standard, providing consistent access patterns across all analysis frameworks:

import src as daf  # or from the installed package

# Use the unified interface
result = daf.analyze_unified("config.yaml")

# All access patterns work consistently
doc_type = result['document_type']        # Dict access ✓
doc_type = result.document_type           # Attribute access ✓
doc_type = result.get('document_type')    # get() method ✓
as_dict = result.to_dict()                # Full dict conversion ✓

# Works the same across all frameworks
print(f"Framework: {result.framework}")   # 'document-analysis-framework'
print(f"Type: {result.document_type}")
print(f"Confidence: {result.confidence}")
print(f"AI opportunities: {result.ai_opportunities}")

The unified interface ensures compatibility when switching between frameworks or using multiple frameworks together.

Chunking for RAG

📋 Currently Supported File Types

Category	File Types	Extensions	Confidence
📝 Text & Data	Markdown, CSV, JSON, YAML, TOML, Plain Text	.md, .csv, .json, .yaml, .toml, .txt	90-95%
💻 Code Files	Python, JavaScript, Java, C++, SQL	.py, .js, .java, .cpp, .sql	90-95%
⚙️ Configuration	Dockerfile, package.json, requirements.txt, Makefile	Various	95%

Coming Soon:

TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin
Shell scripts, PowerShell, R, MATLAB
INI files, .env files, Apache/Nginx configs
LaTeX, AsciiDoc, reStructuredText
Log files, CSS/SCSS, Vue/Svelte components

🎯 Key Features

🔍 Intelligent Document Detection - Content-based recognition with confidence scoring
🤖 AI-Ready Output - Structured analysis with quality metrics and use case recommendations
⚡ Smart Chunking - Document-type-aware segmentation strategies
🔒 Security & Reliability - File size limits, safe handling, pure Python stdlib
🔄 Extensible - Easy to add new handlers for additional file types

🔧 Installation

pip install document-analysis-framework

Or from source:

git clone https://github.com/rdwj/document-analysis-framework.git
cd document-analysis-framework
pip install -e .

🧪 Framework Ecosystem

This framework is part of the unified analysis framework suite built on analysis-framework-base:

Framework	Purpose	Key Features
analysis-framework-base	Base interfaces & standards	Common API, unified interface, framework integration
xml-analysis-framework	XML document specialist	29+ handlers, security-focused, enterprise configs
docling-analysis-framework	PDF & Office documents	OCR support, table extraction, figure handling
data-analysis-framework	Structured data AI agent	Safe queries, natural language interface
document-analysis-framework	Everything else (text-based)	Fallback handler, pure Python, extensible

Choosing the Right Framework

graph TD
    A[Have a document to analyze?] --> B{What type?}
    B -->|XML| C[xml-analysis-framework]
    B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]
    B -->|Need AI to query data| E[data-analysis-framework]
    B -->|Text/Code/Config/Other| F[document-analysis-framework]

📄 License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Oct 28, 2025

1.1.0

Jul 29, 2025

1.0.0

Jul 28, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analysis_framework-2.0.0.tar.gz (114.8 kB view details)

Uploaded Oct 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

document_analysis_framework-2.0.0-py3-none-any.whl (116.0 kB view details)

Uploaded Oct 28, 2025 Python 3

File details

Details for the file document_analysis_framework-2.0.0.tar.gz.

File metadata

Download URL: document_analysis_framework-2.0.0.tar.gz
Upload date: Oct 28, 2025
Size: 114.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for document_analysis_framework-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`9f9a7e22064ef9ccf0ac9339cd4bd175763451d9e5fe2eb1decfd6b16fea2206`
MD5	`a87286dae1cbbf53754f528a5b692bcd`
BLAKE2b-256	`7f0f7c3e119ff34cdca29eb9fd061a66491dbfd4c68b036d7420706abf364ff7`

See more details on using hashes here.

File details

Details for the file document_analysis_framework-2.0.0-py3-none-any.whl.

File metadata

Download URL: document_analysis_framework-2.0.0-py3-none-any.whl
Upload date: Oct 28, 2025
Size: 116.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for document_analysis_framework-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`513937005579052924640646ed6b59c58ffcd5d8669e43235659743051bef8bb`
MD5	`0f2b6205f9b1d78f4ccdef144bd6fdb7`
BLAKE2b-256	`5484f7058c52f4edd1f02a36e66d7dbf0fbb71a8249a6b7fc1378e741e928063`

See more details on using hashes here.

document-analysis-framework 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Document Analysis Framework v2.0.0

🎯 When to Use This Framework

Specialized Frameworks (Use These First!)

Use This Framework For:

🚀 Quick Start

Document Analysis

Smart Chunking

🔄 Unified Interface Support

Chunking for RAG

📋 Currently Supported File Types

Coming Soon:

🎯 Key Features

🔧 Installation

🧪 Framework Ecosystem

Choosing the Right Framework

📄 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes