Skip to main content

XML document analysis and preprocessing framework designed for AI/ML data pipelines

Project description

XML Analysis Framework

Python 3.8+ License: MIT Test Success Rate Handlers AI Ready

A production-ready XML document analysis and preprocessing framework with 29 specialized handlers designed for AI/ML data pipelines. Transform any XML document into structured, AI-ready data and optimized chunks with 100% success rate across 71 diverse test files.

๐Ÿš€ Quick Start

Simple API - Get Started in Seconds

import xml_analysis_framework as xaf

# ๐ŸŽฏ One-line analysis with specialized handlers
result = xaf.analyze("path/to/file.xml")
print(f"Document type: {result['document_type'].type_name}")
print(f"Handler used: {result['handler_used']}")

# ๐Ÿ“Š Basic schema analysis  
schema = xaf.analyze_schema("path/to/file.xml")
print(f"Elements: {schema.total_elements}, Depth: {schema.max_depth}")

# โœ‚๏ธ Smart chunking for AI/ML
chunks = xaf.chunk("path/to/file.xml", strategy="auto")
print(f"Created {len(chunks)} optimized chunks")

Advanced Usage

import xml_analysis_framework as xaf

# Enhanced analysis with full results
analysis = xaf.analyze_enhanced("document.xml")
doc_type = analysis['document_type']
specialized = analysis['analysis']

print(f"Type: {doc_type.type_name} (confidence: {doc_type.confidence:.2f})")
print(f"AI use cases: {len(specialized.ai_use_cases)}")
print(f"Quality score: {specialized.quality_metrics.get('completeness_score')}")

# Different chunking strategies
hierarchical_chunks = xaf.chunk("document.xml", strategy="hierarchical")
sliding_chunks = xaf.chunk("document.xml", strategy="sliding_window") 
content_chunks = xaf.chunk("document.xml", strategy="content_aware")

# Process chunks
for chunk in hierarchical_chunks:
    print(f"Chunk {chunk.chunk_id}: {len(chunk.content)} chars")
    print(f"Type: {chunk.chunk_type}, Elements: {len(chunk.elements_included)}")

Expert Usage - Direct Class Access

# For advanced customization, use the classes directly
from xml_analysis_framework import XMLDocumentAnalyzer, ChunkingOrchestrator

analyzer = XMLDocumentAnalyzer(max_file_size_mb=500)
orchestrator = ChunkingOrchestrator(max_file_size_mb=1000)

# Custom analysis
result = analyzer.analyze_document("file.xml")

# Custom chunking with config
from xml_analysis_framework import XMLChunkingStrategy
config = XMLChunkingStrategy.ChunkingConfig(max_chunk_size=2000)
chunks = orchestrator.chunk_document("file.xml", result, config=config)

๐ŸŽฏ Key Features

1. ๐Ÿ† Production Proven Results

  • 100% Success Rate: All 71 test files processed successfully
  • 2,752 Chunks Generated: Average 38.8 optimized chunks per file
  • 54 Document Types Detected: Comprehensive XML format coverage
  • Minimal Dependencies: Only defusedxml for security + Python stdlib

2. ๐Ÿง  29 Specialized XML Handlers

Enterprise-grade document intelligence:

  • Security & Compliance: SCAP, SAML, SOAP (90-100% confidence)
  • DevOps & Build: Maven POM, Ant, Ivy, Spring, Log4j (95-100% confidence)
  • Content & Documentation: RSS/Atom, DocBook, XHTML, SVG
  • Enterprise Systems: ServiceNow, Hibernate, Struts configurations
  • Data & APIs: GPX, KML, GraphML, WADL/WSDL, XML Schemas

3. โšก Intelligent Processing Pipeline

  • Smart Document Detection: Confidence scoring with graceful fallbacks
  • Semantic Chunking: Document-type-aware optimal segmentation
  • Token Optimization: LLM context window optimized chunks
  • Quality Assessment: Automated data quality metrics

4. ๐Ÿค– AI-Ready Integration

  • Vector Store Ready: Structured embeddings with rich metadata
  • Graph Database Compatible: Relationship and dependency mapping
  • LLM Agent Optimized: Context-aware, actionable insights
  • Complete AI Workflows: See AI Integration Guide

๐Ÿ“‹ Supported Document Types (29 Handlers)

Category Handlers Confidence Use Cases
๐Ÿ” Security & Compliance SCAP, SAML, SOAP 90-100% Vulnerability assessment, compliance monitoring, security posture analysis
โš™๏ธ DevOps & Build Tools Maven POM, Ant, Ivy 95-100% Dependency analysis, build optimization, technical debt assessment
๐Ÿข Enterprise Configuration Spring, Hibernate, Struts, Log4j 95-100% Configuration validation, security scanning, modernization planning
๐Ÿ“„ Content & Documentation RSS, DocBook, XHTML, SVG 90-100% Content intelligence, documentation search, knowledge management
๐Ÿ—‚๏ธ Enterprise Systems ServiceNow, XML Sitemap 95-100% Incident analysis, process automation, system integration
๐ŸŒ Geospatial & Data GPX, KML, GraphML 85-95% Route optimization, geographic analysis, network intelligence
๐Ÿ”Œ API & Integration WADL, WSDL, XLIFF 90-95% Service discovery, integration planning, translation workflows
๐Ÿ“ Schemas & Standards XML Schema (XSD) 100% Schema validation, data modeling, API documentation

๐Ÿ—๏ธ Architecture

xml-analysis-framework/
โ”œโ”€โ”€ README.md                    # Project overview
โ”œโ”€โ”€ LICENSE                      # MIT license
โ”œโ”€โ”€ requirements.txt            # Dependencies (Python stdlib only)
โ”œโ”€โ”€ setup.py                    # Package installation
โ”œโ”€โ”€ .gitignore                  # Git ignore patterns
โ”œโ”€โ”€ .github/workflows/          # CI/CD pipelines
โ”‚
โ”œโ”€โ”€ src/                        # Source code
โ”‚   โ”œโ”€โ”€ core/                   # Core framework
โ”‚   โ”‚   โ”œโ”€โ”€ analyzer.py         # Main analysis engine
โ”‚   โ”‚   โ”œโ”€โ”€ schema_analyzer.py  # XML schema analysis
โ”‚   โ”‚   โ””โ”€โ”€ chunking.py         # Chunking strategies
โ”‚   โ”œโ”€โ”€ handlers/              # 28 specialized handlers
โ”‚   โ””โ”€โ”€ utils/                 # Utility functions
โ”‚
โ”œโ”€โ”€ tests/                      # Comprehensive test suite
โ”‚   โ”œโ”€โ”€ unit/                  # Handler unit tests (16 files)
โ”‚   โ”œโ”€โ”€ integration/           # Integration tests (11 files)
โ”‚   โ”œโ”€โ”€ comprehensive/         # Full system tests (4 files)
โ”‚   โ””โ”€โ”€ run_all_tests.py      # Master test runner
โ”‚
โ”œโ”€โ”€ examples/                   # Usage examples
โ”‚   โ”œโ”€โ”€ basic_analysis.py      # Simple analysis
โ”‚   โ””โ”€โ”€ enhanced_analysis.py   # Full featured analysis
โ”‚
โ”œโ”€โ”€ scripts/                    # Utility scripts
โ”‚   โ”œโ”€โ”€ collect_test_files.py  # Test data collection
โ”‚   โ””โ”€โ”€ debug/                 # Debug utilities
โ”‚
โ”œโ”€โ”€ docs/                       # Documentation
โ”‚   โ”œโ”€โ”€ architecture/          # Design documents
โ”‚   โ”œโ”€โ”€ guides/                # User guides
โ”‚   โ””โ”€โ”€ api/                   # API documentation
โ”‚
โ”œโ”€โ”€ sample_data/               # Test XML files (99+ examples)
โ”‚   โ”œโ”€โ”€ test_files/           # Real-world examples
โ”‚   โ””โ”€โ”€ test_files_synthetic/ # Generated test cases
โ”‚
โ””โ”€โ”€ artifacts/                 # Build artifacts, results
    โ”œโ”€โ”€ analysis_results/     # JSON analysis outputs
    โ””โ”€โ”€ reports/             # Generated reports

๐Ÿ”’ Security

XML Security Protection

This framework uses defusedxml to protect against common XML security vulnerabilities:

  • XXE (XML External Entity) attacks: Prevents reading local files or making network requests
  • Billion Laughs attack: Prevents exponential entity expansion DoS attacks
  • DTD retrieval: Blocks external DTD fetching to prevent data exfiltration

Security Features

# All XML parsing is automatically protected
from core.analyzer import XMLDocumentAnalyzer

analyzer = XMLDocumentAnalyzer()
# Safe parsing - malicious XML will be rejected
result = analyzer.analyze_document("potentially_malicious.xml")

if result.get('security_issue'):
    print(f"Security threat detected: {result['error']}")

Best Practices

  1. Always use the framework's parsers - Never use xml.etree.ElementTree directly
  2. Validate file sizes - Set reasonable limits for your use case
  3. Sanitize file paths - Ensure input paths are properly validated
  4. Monitor for security exceptions - Log and alert on security-blocked parsing attempts

File Size Limits

The framework includes built-in file size limits to prevent memory exhaustion:

# Built-in size limits in analyzer and chunking
from core.analyzer import XMLDocumentAnalyzer
from core.chunking import ChunkingOrchestrator

# Create analyzer with 50MB limit
analyzer = XMLDocumentAnalyzer(max_file_size_mb=50.0)

# Create chunking orchestrator with 100MB limit  
orchestrator = ChunkingOrchestrator(max_file_size_mb=100.0)

# Utility functions for easy setup
from utils import create_analyzer_with_limits, FileSizeLimits

# Use predefined limits
analyzer = create_analyzer_with_limits(FileSizeLimits.PRODUCTION_MEDIUM)  # 50MB
safe_result = safe_analyze_document("file.xml", FileSizeLimits.REAL_TIME)  # 5MB

๐Ÿ”ง Installation

# Install from source
git clone https://github.com/redhat-ai-americas/xml-analysis-framework.git
cd xml-analysis-framework
pip install -e .

# Or install development dependencies
pip install -e .[dev]

Dependencies

  • defusedxml (0.7.1+): For secure XML parsing protection
  • Python standard library (3.8+) for all other functionality

๐Ÿ“– Usage Examples

Basic Analysis

from core.schema_analyzer import XMLSchemaAnalyzer

analyzer = XMLSchemaAnalyzer()
schema = analyzer.analyze_file('document.xml')

# Access schema properties
print(f"Root element: {schema.root_element}")
print(f"Total elements: {schema.total_elements}")
print(f"Namespaces: {schema.namespaces}")

Enhanced Analysis with Specialized Handlers

from core.analyzer import XMLDocumentAnalyzer

analyzer = XMLDocumentAnalyzer()
result = analyzer.analyze_document('maven-project.xml')

print(f"Document Type: {result['document_type'].type_name}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Handler Used: {result['handler_used']}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")

Safe Analysis with File Validation

from utils import safe_analyze_document, FileSizeLimits

# Safe analysis with comprehensive validation
result = safe_analyze_document(
    'document.xml', 
    max_size_mb=FileSizeLimits.PRODUCTION_MEDIUM
)

if result.get('error'):
    print(f"Analysis failed: {result['error']}")
else:
    print(f"Success: {result['document_type'].type_name}")

Intelligent Chunking

from core.chunking import ChunkingOrchestrator, XMLChunkingStrategy

orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document(
    'large_document.xml',
    specialized_analysis={},  # Analysis result from XMLDocumentAnalyzer
    strategy='auto'
)

# Token estimation
token_estimator = XMLChunkingStrategy()
for chunk in chunks:
    token_count = token_estimator.estimate_tokens(chunk.content)
    print(f"Chunk {chunk.chunk_id}: ~{token_count} tokens")

๐Ÿงช Testing & Validation

Production-Tested Performance

  • โœ… 100% Success Rate: All 71 XML files processed successfully
  • โœ… 2,752 Chunks Generated: Optimal segmentation across diverse document types
  • โœ… 54 Document Types: Comprehensive coverage from ServiceNow to SCAP to Maven
  • โœ… Secure by Default: Protected against XXE and billion laughs attacks

Test Coverage

# Run comprehensive end-to-end test
python test_end_to_end_workflow.py

# Run individual component tests  
python test_all_chunking.py        # Chunking strategies
python test_servicenow_analysis.py # ServiceNow handler validation
python test_scap_analysis.py       # Security document analysis

Real-World Test Data

  • Enterprise Systems: ServiceNow incident exports (8 files)
  • Security Documents: SCAP/XCCDF compliance reports (4 files)
  • Build Configurations: Maven, Ant, Ivy projects (12 files)
  • Enterprise Config: Spring, Hibernate, Log4j (15 files)
  • Content & APIs: DocBook, RSS, WADL, Sitemaps (32 files)

๐Ÿค– AI Integration & Use Cases

AI Workflow Overview

graph LR
    A[XML Documents] --> B[XML Analysis Framework]
    B --> C[Document Analysis<br/>29 Specialized Handlers]
    B --> D[Smart Chunking<br/>Token-Optimized]
    B --> E[AI-Ready Output<br/>Structured JSON]
  
    E --> F[Vector Store<br/>Semantic Search]
    E --> G[Graph Database<br/>Relationships]
    E --> H[LLM Agent<br/>Intelligence]
  
    F --> I[Security Intelligence]
    G --> J[DevOps Automation] 
    H --> K[Knowledge Management]
  
    style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    style E fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    style I fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style J fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    style K fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px

See Complete AI Integration Guide for detailed workflows, implementation examples, and advanced use cases.

๐Ÿ” Security Intelligence Applications

  • SCAP Compliance Monitoring: Automated vulnerability assessment and risk scoring
  • SAML Security Analysis: Authentication flow security validation and threat detection
  • Log4j Vulnerability Detection: CVE scanning and automated remediation guidance
  • SOAP Security Assessment: Web service configuration security review

โš™๏ธ DevOps & Configuration Intelligence

  • Dependency Risk Analysis: Maven/Ant/Ivy vulnerability scanning and upgrade planning
  • Configuration Drift Detection: Spring/Hibernate consistency monitoring
  • Build Optimization: Performance analysis and security hardening recommendations
  • Technical Debt Assessment: Legacy system modernization planning

๐Ÿข Enterprise System Intelligence

  • ServiceNow Process Mining: Incident pattern analysis and workflow optimization
  • Cross-System Correlation: Configuration impact analysis and change management
  • Compliance Automation: Regulatory requirement mapping and validation

๐Ÿ“š Knowledge Management Applications

  • Technical Documentation Search: Semantic search across DocBook, API documentation
  • Content Intelligence: RSS/Atom trend analysis and topic extraction
  • API Discovery: WADL/WSDL service catalog and integration recommendations

๐Ÿ”ฌ Production Metrics & Performance

Framework Statistics

  • โœ… 100% Success Rate: 71/71 files processed without errors
  • ๐Ÿ“Š 2,752 Chunks Generated: Optimal 38.8 avg chunks per document
  • ๐ŸŽฏ 54 Document Types: Comprehensive XML format coverage
  • โšก High Performance: 0.015s average processing time per document
  • ๐Ÿ”’ Secure Parsing: defusedxml protection against XML attacks

Handler Confidence Levels

  • 100% Confidence: XML Schema (XSD), Maven POM, Log4j, RSS/Atom, Sitemaps
  • 95% Confidence: ServiceNow, Apache Ant, Ivy, Spring, Hibernate, SAML, SOAP
  • 90% Confidence: SCAP/XCCDF, DocBook, WADL/WSDL
  • Intelligent Fallback: Generic XML handler for unknown formats

๐Ÿš€ Extending the Framework

Adding New Handlers

from core.analyzer import XMLHandler, SpecializedAnalysis, DocumentTypeInfo

class CustomHandler(XMLHandler):
    def can_handle(self, root, namespaces):
        return root.tag == 'custom-format', 1.0
  
    def detect_type(self, root, namespaces):
        return DocumentTypeInfo(
            type_name="Custom Format",
            confidence=1.0,
            version="1.0"
        )
  
    def analyze(self, root, file_path):
        return SpecializedAnalysis(
            document_type="Custom Format",
            key_findings={"custom_data": "value"},
            ai_use_cases=["Custom AI application"],
            structured_data={"extracted": "data"}
        )

Custom Chunking Strategies

from core.chunking import XMLChunkingStrategy, ChunkingOrchestrator

class CustomChunking(XMLChunkingStrategy):
    def chunk_document(self, file_path, specialized_analysis=None):
        # Custom chunking logic
        return chunks

# Register custom strategy
orchestrator = ChunkingOrchestrator()
orchestrator.strategies['custom'] = CustomChunking

๐Ÿ“Š Real Production Output Examples

ServiceNow Incident Analysis

{
  "document_summary": {
    "document_type": "ServiceNow Incident",
    "type_confidence": 0.95,
    "handler_used": "ServiceNowHandler",
    "file_size_mb": 0.029
  },
  "key_insights": {
    "data_highlights": {
      "state": "7", "priority": "4", "impact": "3",
      "assignment_group": "REDACTED_GROUP",
      "resolution_time": "240 days, 0:45:51",
      "journal_analysis": {
        "total_entries": 9,
        "unique_contributors": 1
      }
    },
    "ai_applications": [
      "Incident pattern analysis",
      "Resolution time prediction", 
      "Workload optimization"
    ]
  },
  "structured_content": {
    "chunking_strategy": "content_aware_medium",
    "total_chunks": 75,
    "quality_metrics": {
      "overall_readiness": 0.87
    }
  }
}

Log4j Security Analysis

{
  "document_summary": {
    "document_type": "Log4j Configuration",
    "type_confidence": 1.0,
    "handler_used": "Log4jConfigHandler"
  },
  "key_insights": {
    "data_highlights": {
      "security_concerns": {
        "security_risks": ["External socket appender detected"],
        "log4shell_vulnerable": false,
        "external_connections": [{"host": "log-server.example.com"}]
      },
      "performance": {
        "async_appenders": 1,
        "performance_risks": ["Location info impacts performance"]
      }
    },
    "ai_applications": [
      "Vulnerability assessment",
      "Performance optimization",
      "Security hardening"
    ]
  },
  "structured_content": {
    "total_chunks": 19,
    "chunking_strategy": "hierarchical_small"
  }
}

๐Ÿค Contributing

We welcome contributions! Whether you're adding new XML handlers, improving chunking algorithms, or enhancing AI integrations, your contributions help make XML analysis more accessible and powerful.

Priority contribution areas:

  • ๐ŸŽฏ New XML format handlers (ERP, CRM, healthcare, government)
  • โšก Enhanced chunking algorithms and strategies
  • ๐Ÿš€ Performance optimizations for large files
  • ๐Ÿค– Advanced AI/ML integration examples
  • ๐Ÿ“ Documentation and usage examples

๐Ÿ‘‰ See CONTRIBUTING.md for complete guidelines, development setup, and submission process.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Designed as part of the AI Building Blocks initiative
  • Built for the modern AI/ML ecosystem
  • Community-driven XML format support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xml_analysis_framework-1.2.3.tar.gz (9.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xml_analysis_framework-1.2.3-py3-none-any.whl (199.2 kB view details)

Uploaded Python 3

File details

Details for the file xml_analysis_framework-1.2.3.tar.gz.

File metadata

  • Download URL: xml_analysis_framework-1.2.3.tar.gz
  • Upload date:
  • Size: 9.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.12

File hashes

Hashes for xml_analysis_framework-1.2.3.tar.gz
Algorithm Hash digest
SHA256 f7642035a5d14b34f5794005ef27fb51bfd5d01c15c0a80d473421973a2c4471
MD5 5627efea72549f5b9e2044caa378ce1e
BLAKE2b-256 edc749c20dde3330f705a2cf592ea3b87ff003d1b6d548758cb3c9e8ebb30670

See more details on using hashes here.

File details

Details for the file xml_analysis_framework-1.2.3-py3-none-any.whl.

File metadata

File hashes

Hashes for xml_analysis_framework-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 182046c1311b484c4db08ed26705a4a3af45217c65b5fb741eb8446407dceaf4
MD5 cc07a33a7a575138c0d030eb47eb3ac7
BLAKE2b-256 d0cfdb79f9b4a1619aa39dec9a103e8768b1502656e627cb45cc0a3e08c3f663

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page