Comprehensive framework for analyzing XML documents with AI/ML processing support

These details have not been verified by PyPI

Project links

Project description

XML Analysis Framework

A production-ready framework for analyzing XML documents with 29 specialized handlers and intelligent AI integration capabilities. Transform any XML document into structured, AI-ready data with 100% success rate across 71 diverse test files.

🚀 Quick Start

Document Analysis

from core.analyzer import XMLDocumentAnalyzer

analyzer = XMLDocumentAnalyzer()
analysis = analyzer.analyze_document("path/to/file.xml")

# Analysis result structure:
{
    "file_path": "path/to/file.xml",
    "document_type": DocumentTypeInfo(type_name="Apache Ant Build", confidence=0.95, ...),
    "handler_used": "AntBuildHandler", 
    "confidence": 0.95,
    "analysis": SpecializedAnalysis(...),
    "namespaces": {...},
    "file_size": 1234
}

Smart Chunking

from core.chunking import ChunkingOrchestrator, ChunkingConfig

orchestrator = ChunkingOrchestrator()

# Convert analysis for chunking (required format)
chunking_analysis = {
    "document_type": {
        "type_name": analysis["document_type"].type_name,
        "confidence": analysis["document_type"].confidence
    },
    "analysis": analysis["analysis"]
}

# Perform chunking
chunks = orchestrator.chunk_document(
    file_path="path/to/file.xml",
    specialized_analysis=chunking_analysis,
    strategy='auto'  # or 'hierarchical', 'sliding_window', 'content_aware'
)

Complete Workflow

# 1. Analyze document
analyzer = XMLDocumentAnalyzer()
analysis = analyzer.analyze_document("file.xml")

# 2. Convert to chunking format
chunking_analysis = {
    "document_type": {
        "type_name": analysis["document_type"].type_name,
        "confidence": analysis["document_type"].confidence
    },
    "analysis": analysis["analysis"]
}

# 3. Generate optimal chunks
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document("file.xml", chunking_analysis)

# 4. Process results
for chunk in chunks:
    print(f"Chunk {chunk.chunk_id}: {chunk.token_estimate} tokens")
    print(f"Content: {chunk.content[:100]}...")

🎯 Key Features

1. 🏆 Production Proven Results

100% Success Rate: All 71 test files processed successfully
2,752 Chunks Generated: Average 38.8 optimized chunks per file
54 Document Types Detected: Comprehensive XML format coverage
Zero Dependencies: Pure Python stdlib implementation

2. 🧠 29 Specialized XML Handlers

Enterprise-grade document intelligence:

Security & Compliance: SCAP, SAML, SOAP (90-100% confidence)
DevOps & Build: Maven POM, Ant, Ivy, Spring, Log4j (95-100% confidence)
Content & Documentation: RSS/Atom, DocBook, XHTML, SVG
Enterprise Systems: ServiceNow, Hibernate, Struts configurations
Data & APIs: GPX, KML, GraphML, WADL/WSDL, XML Schemas

3. ⚡ Intelligent Processing Pipeline

Smart Document Detection: Confidence scoring with graceful fallbacks
Semantic Chunking: Document-type-aware optimal segmentation
Token Optimization: LLM context window optimized chunks
Quality Assessment: Automated data quality metrics

4. 🤖 AI-Ready Integration

Vector Store Ready: Structured embeddings with rich metadata
Graph Database Compatible: Relationship and dependency mapping
LLM Agent Optimized: Context-aware, actionable insights
Complete AI Workflows: See AI Integration Guide

📋 Supported Document Types (29 Handlers)

Category	Handlers	Confidence	Use Cases
🔐 Security & Compliance	SCAP, SAML, SOAP	90-100%	Vulnerability assessment, compliance monitoring, security posture analysis
⚙️ DevOps & Build Tools	Maven POM, Ant, Ivy	95-100%	Dependency analysis, build optimization, technical debt assessment
🏢 Enterprise Configuration	Spring, Hibernate, Struts, Log4j	95-100%	Configuration validation, security scanning, modernization planning
📄 Content & Documentation	RSS, DocBook, XHTML, SVG	90-100%	Content intelligence, documentation search, knowledge management
🗂️ Enterprise Systems	ServiceNow, XML Sitemap	95-100%	Incident analysis, process automation, system integration
🌍 Geospatial & Data	GPX, KML, GraphML	85-95%	Route optimization, geographic analysis, network intelligence
🔌 API & Integration	WADL, WSDL, XLIFF	90-95%	Service discovery, integration planning, translation workflows
📐 Schemas & Standards	XML Schema (XSD)	100%	Schema validation, data modeling, API documentation

🏗️ Architecture

xml-analysis-framework/
├── README.md                    # Project overview
├── LICENSE                      # MIT license
├── requirements.txt            # Dependencies (Python stdlib only)
├── setup.py                    # Package installation
├── .gitignore                  # Git ignore patterns
├── .github/workflows/          # CI/CD pipelines
│
├── src/                        # Source code
│   ├── core/                   # Core framework
│   │   ├── analyzer.py         # Main analysis engine
│   │   ├── schema_analyzer.py  # XML schema analysis
│   │   └── chunking.py         # Chunking strategies
│   ├── handlers/              # 28 specialized handlers
│   └── utils/                 # Utility functions
│
├── tests/                      # Comprehensive test suite
│   ├── unit/                  # Handler unit tests (16 files)
│   ├── integration/           # Integration tests (11 files)
│   ├── comprehensive/         # Full system tests (4 files)
│   └── run_all_tests.py      # Master test runner
│
├── examples/                   # Usage examples
│   ├── basic_analysis.py      # Simple analysis
│   ├── enhanced_analysis.py   # Full featured analysis
│   └── framework_demo.py      # Complete demonstration
│
├── scripts/                    # Utility scripts
│   ├── collect_test_files.py  # Test data collection
│   └── debug/                 # Debug utilities
│
├── docs/                       # Documentation
│   ├── architecture/          # Design documents
│   ├── guides/                # User guides
│   └── api/                   # API documentation
│
├── sample_data/               # Test XML files (99+ examples)
│   ├── test_files/           # Real-world examples
│   └── test_files_synthetic/ # Generated test cases
│
└── artifacts/                 # Build artifacts, results
    ├── analysis_results/     # JSON analysis outputs
    └── reports/             # Generated reports

🔧 Installation

# Install from source
git clone <repository-url>
cd xml-analysis-framework
pip install -e .

# Or install development dependencies
pip install -e .[dev]

No external dependencies required! Uses only Python standard library (3.7+).

📖 Usage Examples

Basic Analysis

from src.core.schema_analyzer import XMLSchemaAnalyzer

analyzer = XMLSchemaAnalyzer()
schema = analyzer.analyze_file('document.xml')
print(analyzer.generate_llm_description(schema))

Enhanced Analysis with Specialized Handlers

from src.core.analyzer import XMLDocumentAnalyzer

analyzer = XMLDocumentAnalyzer()
result = analyzer.analyze_document('maven-project.xml')

print(f"Document Type: {result.document_type.type_name}")
print(f"Confidence: {result.document_type.confidence:.2f}")
print(f"AI Use Cases: {result.analysis.ai_use_cases}")

Intelligent Chunking

from src.core.chunking import ChunkingOrchestrator

orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document(
    'large_document.xml',
    strategy='auto'  # Automatically selects best strategy
)

for chunk in chunks:
    print(f"Chunk {chunk.chunk_id}: ~{chunk.token_estimate} tokens")

🧪 Testing & Validation

Production-Tested Performance

✅ 100% Success Rate: All 71 XML files processed successfully
✅ 2,752 Chunks Generated: Optimal segmentation across diverse document types
✅ 54 Document Types: Comprehensive coverage from ServiceNow to SCAP to Maven
✅ Zero Dependencies: Pure Python stdlib implementation

Test Coverage

# Run comprehensive end-to-end test
python test_end_to_end_workflow.py

# Run individual component tests  
python test_all_chunking.py        # Chunking strategies
python test_servicenow_analysis.py # ServiceNow handler validation
python test_scap_analysis.py       # Security document analysis

Real-World Test Data

Enterprise Systems: ServiceNow incident exports (8 files)
Security Documents: SCAP/XCCDF compliance reports (4 files)
Build Configurations: Maven, Ant, Ivy projects (12 files)
Enterprise Config: Spring, Hibernate, Log4j (15 files)
Content & APIs: DocBook, RSS, WADL, Sitemaps (32 files)

🤖 AI Integration & Use Cases

AI Workflow Overview

graph LR
    A[XML Documents] --> B[XML Analysis Framework]
    B --> C[Document Analysis<br/>29 Specialized Handlers]
    B --> D[Smart Chunking<br/>Token-Optimized]
    B --> E[AI-Ready Output<br/>Structured JSON]
    
    E --> F[Vector Store<br/>Semantic Search]
    E --> G[Graph Database<br/>Relationships]
    E --> H[LLM Agent<br/>Intelligence]
    
    F --> I[Security Intelligence]
    G --> J[DevOps Automation] 
    H --> K[Knowledge Management]
    
    style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style E fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style I fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style J fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style K fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px

See Complete AI Integration Guide for detailed workflows, implementation examples, and advanced use cases.

🔐 Security Intelligence Applications

SCAP Compliance Monitoring: Automated vulnerability assessment and risk scoring
SAML Security Analysis: Authentication flow security validation and threat detection
Log4j Vulnerability Detection: CVE scanning and automated remediation guidance
SOAP Security Assessment: Web service configuration security review

⚙️ DevOps & Configuration Intelligence

Dependency Risk Analysis: Maven/Ant/Ivy vulnerability scanning and upgrade planning
Configuration Drift Detection: Spring/Hibernate consistency monitoring
Build Optimization: Performance analysis and security hardening recommendations
Technical Debt Assessment: Legacy system modernization planning

🏢 Enterprise System Intelligence

ServiceNow Process Mining: Incident pattern analysis and workflow optimization
Cross-System Correlation: Configuration impact analysis and change management
Compliance Automation: Regulatory requirement mapping and validation

📚 Knowledge Management Applications

Technical Documentation Search: Semantic search across DocBook, API documentation
Content Intelligence: RSS/Atom trend analysis and topic extraction
API Discovery: WADL/WSDL service catalog and integration recommendations

🔬 Production Metrics & Performance

Framework Statistics

✅ 100% Success Rate: 71/71 files processed without errors
📊 2,752 Chunks Generated: Optimal 38.8 avg chunks per document
🎯 54 Document Types: Comprehensive XML format coverage
⚡ High Performance: 0.015s average processing time per document
🏗️ Zero Dependencies: Pure Python standard library implementation

Handler Confidence Levels

100% Confidence: XML Schema (XSD), Maven POM, Log4j, RSS/Atom, Sitemaps
95% Confidence: ServiceNow, Apache Ant, Ivy, Spring, Hibernate, SAML, SOAP
90% Confidence: SCAP/XCCDF, DocBook, WADL/WSDL
Intelligent Fallback: Generic XML handler for unknown formats

🚀 Extending the Framework

Adding New Handlers

from src.core.analyzer import XMLHandler, SpecializedAnalysis

class CustomHandler(XMLHandler):
    def can_handle(self, root, namespaces):
        return root.tag == 'custom-format', 1.0
    
    def analyze(self, root, file_path):
        return SpecializedAnalysis(
            document_type="Custom Format",
            key_findings={...},
            ai_use_cases=["Custom AI application"]
        )

Custom Chunking Strategies

from src.core.chunking import XMLChunkingStrategy

class CustomChunking(XMLChunkingStrategy):
    def chunk_document(self, file_path, analysis_result):
        # Custom chunking logic
        return chunks

📊 Real Production Output Examples

ServiceNow Incident Analysis

{
  "document_summary": {
    "document_type": "ServiceNow Incident",
    "type_confidence": 0.95,
    "handler_used": "ServiceNowHandler",
    "file_size_mb": 0.029
  },
  "key_insights": {
    "data_highlights": {
      "state": "7", "priority": "4", "impact": "3",
      "assignment_group": "REDACTED_GROUP",
      "resolution_time": "240 days, 0:45:51",
      "journal_analysis": {
        "total_entries": 9,
        "unique_contributors": 1
      }
    },
    "ai_applications": [
      "Incident pattern analysis",
      "Resolution time prediction", 
      "Workload optimization"
    ]
  },
  "structured_content": {
    "chunking_strategy": "content_aware_medium",
    "total_chunks": 75,
    "quality_metrics": {
      "overall_readiness": 0.87
    }
  }
}

Log4j Security Analysis

{
  "document_summary": {
    "document_type": "Log4j Configuration",
    "type_confidence": 1.0,
    "handler_used": "Log4jConfigHandler"
  },
  "key_insights": {
    "data_highlights": {
      "security_concerns": {
        "security_risks": ["External socket appender detected"],
        "log4shell_vulnerable": false,
        "external_connections": [{"host": "log-server.example.com"}]
      },
      "performance": {
        "async_appenders": 1,
        "performance_risks": ["Location info impacts performance"]
      }
    },
    "ai_applications": [
      "Vulnerability assessment",
      "Performance optimization",
      "Security hardening"
    ]
  },
  "structured_content": {
    "total_chunks": 19,
    "chunking_strategy": "hierarchical_small"
  }
}

🤝 Contributing

We welcome contributions! Whether you're adding new XML handlers, improving chunking algorithms, or enhancing AI integrations, your contributions help make XML analysis more accessible and powerful.

Priority contribution areas:

🎯 New XML format handlers (ERP, CRM, healthcare, government)
⚡ Enhanced chunking algorithms and strategies
🚀 Performance optimizations for large files
🤖 Advanced AI/ML integration examples
📝 Documentation and usage examples

👉 See CONTRIBUTING.md for complete guidelines, development setup, and submission process.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Designed as part of the AI Building Blocks initiative
Built for the modern AI/ML ecosystem
Community-driven XML format support

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

2.0.0

Oct 28, 2025

1.4.4

Aug 12, 2025

1.4.3

Aug 12, 2025

1.4.2

Aug 12, 2025

1.4.1

Aug 12, 2025

1.4.0

Aug 12, 2025

1.3.1

Jul 31, 2025

1.3.0

Jul 29, 2025

1.2.13

Jul 28, 2025

1.2.12

Jul 27, 2025

1.2.11

Jul 27, 2025

1.2.10

Jul 27, 2025

1.2.9

Jul 27, 2025

1.2.8

Jul 27, 2025

1.2.7

Jul 27, 2025

1.2.6

Jul 27, 2025

1.2.5

Jul 26, 2025

1.2.4

Jul 26, 2025

1.2.3

Jul 26, 2025

1.2.2

Jul 26, 2025

1.2.1

Jul 26, 2025

1.2.0

Jul 25, 2025

1.1.0

Jul 25, 2025

This version

1.0.0

Jul 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml_analysis_framework-1.0.0.tar.gz (513.4 kB view details)

Uploaded Jul 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xml_analysis_framework-1.0.0-py3-none-any.whl (184.9 kB view details)

Uploaded Jul 25, 2025 Python 3

File details

Details for the file xml_analysis_framework-1.0.0.tar.gz.

File metadata

Download URL: xml_analysis_framework-1.0.0.tar.gz
Upload date: Jul 25, 2025
Size: 513.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for xml_analysis_framework-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`e400bf431bbfb32ca6922d65215703d020c3f4051a3a7cfdf5416175a5031b44`
MD5	`40eeceda7db4513861ed5d5ba434886c`
BLAKE2b-256	`56c98b6be4ae9a1d20225bfa7623003e3b5363ea6570e639a3931f58f598de0d`

See more details on using hashes here.

File details

Details for the file xml_analysis_framework-1.0.0-py3-none-any.whl.

File metadata

Download URL: xml_analysis_framework-1.0.0-py3-none-any.whl
Upload date: Jul 25, 2025
Size: 184.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for xml_analysis_framework-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6a669b31c859cb16f207fa88ce62316d16015d32e05d2ade57164b9615b4bb3`
MD5	`c0df67a543b31aac5a58a1088f255e4e`
BLAKE2b-256	`c6e4e0d7b4d8da7dbcf4ecc7404960a986f77eb459abe6aa46128bcba14927c0`

See more details on using hashes here.

xml-analysis-framework 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

XML Analysis Framework

🚀 Quick Start

Document Analysis

Smart Chunking

Complete Workflow

🎯 Key Features

1. 🏆 Production Proven Results

2. 🧠 29 Specialized XML Handlers

3. ⚡ Intelligent Processing Pipeline

4. 🤖 AI-Ready Integration

📋 Supported Document Types (29 Handlers)

🏗️ Architecture

🔧 Installation

📖 Usage Examples

Basic Analysis

Enhanced Analysis with Specialized Handlers

Intelligent Chunking

🧪 Testing & Validation

Production-Tested Performance

Test Coverage

Real-World Test Data

🤖 AI Integration & Use Cases

AI Workflow Overview

🔐 Security Intelligence Applications

⚙️ DevOps & Configuration Intelligence

🏢 Enterprise System Intelligence

📚 Knowledge Management Applications

🔬 Production Metrics & Performance

Framework Statistics

Handler Confidence Levels

🚀 Extending the Framework

Adding New Handlers

Custom Chunking Strategies

📊 Real Production Output Examples

ServiceNow Incident Analysis

Log4j Security Analysis

🤝 Contributing

📄 License

🙏 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes