Skip to main content

Text, code, and configuration file analysis framework for AI/ML data pipelines - part of the unified analysis framework suite

Project description

Document Analysis Framework v2.0.0

Python 3.8+ License: MIT AI Ready Framework Suite

A lightweight document analysis framework for text, code, configuration, and other text-based files designed for AI/ML data pipelines. Part of the unified analysis framework suite built on analysis-framework-base.

🎯 When to Use This Framework

This is a fallback framework for text-based files not handled by our specialized frameworks:

Specialized Frameworks (Use These First!)

  1. xml-analysis-framework 📑

    • For: XML documents of all types
    • Includes: 29+ specialized XML handlers (SCAP, RSS, Maven POM, Spring configs, etc.)
    • Install: pip install xml-analysis-framework
  2. docling-analysis-framework 📄

    • For: PDF, Word, Excel, PowerPoint, and images with text
    • Features: Docling-powered extraction with OCR support
    • Install: pip install docling-analysis-framework
  3. data-analysis-framework 📊

    • For: Structured data that needs AI agent interaction
    • Features: Safe query interface for AI agents to analyze data
    • Install: pip install data-analysis-framework

Use This Framework For:

  • Code files: Python, JavaScript, TypeScript, Go, Rust, etc.
  • Config files: Dockerfile, package.json, .env, INI files, etc.
  • Text/Markup: Markdown, plain text, LaTeX, AsciiDoc, etc.
  • Data files: CSV, JSON, YAML, TOML, TSV, etc.
  • Other text-based formats not covered above

Note: Some file types (like CSV, JSON) can be handled by multiple frameworks. Choose based on your use case:

  • Use data-analysis-framework for AI agent querying of structured data
  • Use document-analysis-framework for chunking and document analysis

🚀 Quick Start

Document Analysis

from core.analyzer import DocumentAnalyzer

analyzer = DocumentAnalyzer()
result = analyzer.analyze_document("path/to/file.py")

print(f"Document Type: {result['document_type'].type_name}")
print(f"Language: {result['document_type'].language}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")

Smart Chunking

from core.analyzer import DocumentAnalyzer
from core.chunking import ChunkingOrchestrator

# Analyze document
analyzer = DocumentAnalyzer()
analysis = analyzer.analyze_document("file.py")

# Convert format for chunking
chunking_analysis = {
    'document_type': {
        'type_name': analysis['document_type'].type_name,
        'confidence': analysis['document_type'].confidence,
        'category': analysis['document_type'].category
    },
    'analysis': analysis['analysis']
}

# Generate AI-optimized chunks
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document("file.py", chunking_analysis, strategy='auto')

🔄 Unified Interface Support

This framework now supports the unified interface standard, providing consistent access patterns across all analysis frameworks:

import src as daf  # or from the installed package

# Use the unified interface
result = daf.analyze_unified("config.yaml")

# All access patterns work consistently
doc_type = result['document_type']        # Dict access ✓
doc_type = result.document_type           # Attribute access ✓
doc_type = result.get('document_type')    # get() method ✓
as_dict = result.to_dict()                # Full dict conversion ✓

# Works the same across all frameworks
print(f"Framework: {result.framework}")   # 'document-analysis-framework'
print(f"Type: {result.document_type}")
print(f"Confidence: {result.confidence}")
print(f"AI opportunities: {result.ai_opportunities}")

The unified interface ensures compatibility when switching between frameworks or using multiple frameworks together.

Chunking for RAG

📋 Currently Supported File Types

Category File Types Extensions Confidence
📝 Text & Data Markdown, CSV, JSON, YAML, TOML, Plain Text .md, .csv, .json, .yaml, .toml, .txt 90-95%
💻 Code Files Python, JavaScript, Java, C++, SQL .py, .js, .java, .cpp, .sql 90-95%
⚙️ Configuration Dockerfile, package.json, requirements.txt, Makefile Various 95%

Coming Soon:

  • TypeScript, Go, Rust, Ruby, PHP, Swift, Kotlin
  • Shell scripts, PowerShell, R, MATLAB
  • INI files, .env files, Apache/Nginx configs
  • LaTeX, AsciiDoc, reStructuredText
  • Log files, CSS/SCSS, Vue/Svelte components

🎯 Key Features

  • 🔍 Intelligent Document Detection - Content-based recognition with confidence scoring
  • 🤖 AI-Ready Output - Structured analysis with quality metrics and use case recommendations
  • ⚡ Smart Chunking - Document-type-aware segmentation strategies
  • 🔒 Security & Reliability - File size limits, safe handling, pure Python stdlib
  • 🔄 Extensible - Easy to add new handlers for additional file types

🔧 Installation

pip install document-analysis-framework

Or from source:

git clone https://github.com/rdwj/document-analysis-framework.git
cd document-analysis-framework
pip install -e .

🧪 Framework Ecosystem

This framework is part of the unified analysis framework suite built on analysis-framework-base:

Framework Purpose Key Features
analysis-framework-base Base interfaces & standards Common API, unified interface, framework integration
xml-analysis-framework XML document specialist 29+ handlers, security-focused, enterprise configs
docling-analysis-framework PDF & Office documents OCR support, table extraction, figure handling
data-analysis-framework Structured data AI agent Safe queries, natural language interface
document-analysis-framework Everything else (text-based) Fallback handler, pure Python, extensible

Choosing the Right Framework

graph TD
    A[Have a document to analyze?] --> B{What type?}
    B -->|XML| C[xml-analysis-framework]
    B -->|PDF/Word/Excel/PPT| D[docling-analysis-framework]
    B -->|Need AI to query data| E[data-analysis-framework]
    B -->|Text/Code/Config/Other| F[document-analysis-framework]

📄 License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_analysis_framework-2.0.0.tar.gz (114.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_analysis_framework-2.0.0-py3-none-any.whl (116.0 kB view details)

Uploaded Python 3

File details

Details for the file document_analysis_framework-2.0.0.tar.gz.

File metadata

File hashes

Hashes for document_analysis_framework-2.0.0.tar.gz
Algorithm Hash digest
SHA256 9f9a7e22064ef9ccf0ac9339cd4bd175763451d9e5fe2eb1decfd6b16fea2206
MD5 a87286dae1cbbf53754f528a5b692bcd
BLAKE2b-256 7f0f7c3e119ff34cdca29eb9fd061a66491dbfd4c68b036d7420706abf364ff7

See more details on using hashes here.

File details

Details for the file document_analysis_framework-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_analysis_framework-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 513937005579052924640646ed6b59c58ffcd5d8669e43235659743051bef8bb
MD5 0f2b6205f9b1d78f4ccdef144bd6fdb7
BLAKE2b-256 5484f7058c52f4edd1f02a36e66d7dbf0fbb71a8249a6b7fc1378e741e928063

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page