XML document analysis and preprocessing framework designed for AI/ML data pipelines
Project description
XML Analysis Framework
A production-ready XML document analysis and preprocessing framework with 29 specialized handlers designed for AI/ML data pipelines. Transform any XML document into structured, AI-ready data and optimized chunks with 100% success rate across 71 diverse test files.
๐ Quick Start
Simple API - Get Started in Seconds
import xml_analysis_framework as xaf
# ๐ฏ One-line analysis with specialized handlers
result = xaf.analyze("path/to/file.xml")
print(f"Document type: {result['document_type'].type_name}")
print(f"Handler used: {result['handler_used']}")
# ๐ Basic schema analysis
schema = xaf.analyze_schema("path/to/file.xml")
print(f"Elements: {schema.total_elements}, Depth: {schema.max_depth}")
# โ๏ธ Smart chunking for AI/ML
chunks = xaf.chunk("path/to/file.xml", strategy="auto")
print(f"Created {len(chunks)} optimized chunks")
Advanced Usage
import xml_analysis_framework as xaf
# Enhanced analysis with full results
analysis = xaf.analyze_enhanced("document.xml")
doc_type = analysis['document_type']
specialized = analysis['analysis'] # This contains the SpecializedAnalysis object
print(f"Type: {doc_type.type_name} (confidence: {doc_type.confidence:.2f})")
if specialized:
print(f"AI use cases: {len(specialized.ai_use_cases)}")
if specialized.quality_metrics:
print(f"Quality score: {specialized.quality_metrics.get('completeness_score')}")
else:
print("Quality metrics: Not available")
# Different chunking strategies
hierarchical_chunks = xaf.chunk("document.xml", strategy="hierarchical")
sliding_chunks = xaf.chunk("document.xml", strategy="sliding_window")
content_chunks = xaf.chunk("document.xml", strategy="content_aware")
# Process chunks
for chunk in hierarchical_chunks:
print(f"Chunk {chunk.chunk_id}: {len(chunk.content)} chars")
print(f"Type: {chunk.chunk_type}, Elements: {len(chunk.elements_included)}")
Expert Usage - Direct Class Access
# For advanced customization, use the classes directly
from xml_analysis_framework import XMLDocumentAnalyzer, ChunkingOrchestrator
analyzer = XMLDocumentAnalyzer(max_file_size_mb=500)
orchestrator = ChunkingOrchestrator(max_file_size_mb=1000)
# Custom analysis
result = analyzer.analyze_document("file.xml")
# Custom chunking with config (result works directly now!)
from xml_analysis_framework.core.chunking import ChunkingConfig
config = ChunkingConfig(
max_chunk_size=2000,
min_chunk_size=300,
overlap_size=150,
preserve_hierarchy=True
)
chunks = orchestrator.chunk_document("file.xml", result, strategy="auto", config=config)
๐ฏ Key Features
1. ๐ Production Proven Results
- 100% Success Rate: All 71 test files processed successfully
- 2,752 Chunks Generated: Average 38.8 optimized chunks per file
- 54 Document Types Detected: Comprehensive XML format coverage
- Minimal Dependencies: Only defusedxml for security + Python stdlib
2. ๐ง 29 Specialized XML Handlers
Enterprise-grade document intelligence:
- Security & Compliance: SCAP, SAML, SOAP (90-100% confidence)
- DevOps & Build: Maven POM, Ant, Ivy, Spring, Log4j (95-100% confidence)
- Content & Documentation: RSS/Atom, DocBook, XHTML, SVG
- Enterprise Systems: ServiceNow, Hibernate, Struts configurations
- Data & APIs: GPX, KML, GraphML, WADL/WSDL, XML Schemas
3. โก Intelligent Processing Pipeline
- Smart Document Detection: Confidence scoring with graceful fallbacks
- Semantic Chunking: Document-type-aware optimal segmentation
- Token Optimization: LLM context window optimized chunks
- Quality Assessment: Automated data quality metrics
4. ๐ค AI-Ready Integration
- Vector Store Ready: Structured embeddings with rich metadata
- Graph Database Compatible: Relationship and dependency mapping
- LLM Agent Optimized: Context-aware, actionable insights
- Complete AI Workflows: See AI Integration Guide
๐ Supported Document Types (29 Handlers)
| Category | Handlers | Confidence | Use Cases |
|---|---|---|---|
| ๐ Security & Compliance | SCAP, SAML, SOAP | 90-100% | Vulnerability assessment, compliance monitoring, security posture analysis |
| โ๏ธ DevOps & Build Tools | Maven POM, Ant, Ivy | 95-100% | Dependency analysis, build optimization, technical debt assessment |
| ๐ข Enterprise Configuration | Spring, Hibernate, Struts, Log4j | 95-100% | Configuration validation, security scanning, modernization planning |
| ๐ Content & Documentation | RSS, DocBook, XHTML, SVG | 90-100% | Content intelligence, documentation search, knowledge management |
| ๐๏ธ Enterprise Systems | ServiceNow, XML Sitemap | 95-100% | Incident analysis, process automation, system integration |
| ๐ Geospatial & Data | GPX, KML, GraphML | 85-95% | Route optimization, geographic analysis, network intelligence |
| ๐ API & Integration | WADL, WSDL, XLIFF | 90-95% | Service discovery, integration planning, translation workflows |
| ๐ Schemas & Standards | XML Schema (XSD) | 100% | Schema validation, data modeling, API documentation |
๐๏ธ Architecture
xml-analysis-framework/
โโโ README.md # Project overview
โโโ LICENSE # MIT license
โโโ requirements.txt # Dependencies (Python stdlib only)
โโโ setup.py # Package installation
โโโ .gitignore # Git ignore patterns
โโโ .github/workflows/ # CI/CD pipelines
โ
โโโ src/ # Source code
โ โโโ core/ # Core framework
โ โ โโโ analyzer.py # Main analysis engine
โ โ โโโ schema_analyzer.py # XML schema analysis
โ โ โโโ chunking.py # Chunking strategies
โ โโโ handlers/ # 28 specialized handlers
โ โโโ utils/ # Utility functions
โ
โโโ tests/ # Comprehensive test suite
โ โโโ unit/ # Handler unit tests (16 files)
โ โโโ integration/ # Integration tests (11 files)
โ โโโ comprehensive/ # Full system tests (4 files)
โ โโโ run_all_tests.py # Master test runner
โ
โโโ examples/ # Usage examples
โ โโโ basic_analysis.py # Simple analysis
โ โโโ enhanced_analysis.py # Full featured analysis
โ
โโโ scripts/ # Utility scripts
โ โโโ collect_test_files.py # Test data collection
โ โโโ debug/ # Debug utilities
โ
โโโ docs/ # Documentation
โ โโโ architecture/ # Design documents
โ โโโ guides/ # User guides
โ โโโ api/ # API documentation
โ
โโโ sample_data/ # Test XML files (99+ examples)
โ โโโ test_files/ # Real-world examples
โ โโโ test_files_synthetic/ # Generated test cases
โ
โโโ artifacts/ # Build artifacts, results
โโโ analysis_results/ # JSON analysis outputs
โโโ reports/ # Generated reports
๐ Security
XML Security Protection
This framework uses defusedxml to protect against common XML security vulnerabilities:
- XXE (XML External Entity) attacks: Prevents reading local files or making network requests
- Billion Laughs attack: Prevents exponential entity expansion DoS attacks
- DTD retrieval: Blocks external DTD fetching to prevent data exfiltration
Security Features
# All XML parsing is automatically protected
from core.analyzer import XMLDocumentAnalyzer
analyzer = XMLDocumentAnalyzer()
# Safe parsing - malicious XML will be rejected
result = analyzer.analyze_document("potentially_malicious.xml")
if result.get('security_issue'):
print(f"Security threat detected: {result['error']}")
Best Practices
- Always use the framework's parsers - Never use
xml.etree.ElementTreedirectly - Validate file sizes - Set reasonable limits for your use case
- Sanitize file paths - Ensure input paths are properly validated
- Monitor for security exceptions - Log and alert on security-blocked parsing attempts
File Size Limits
The framework includes built-in file size limits to prevent memory exhaustion:
# Built-in size limits in analyzer and chunking
from core.analyzer import XMLDocumentAnalyzer
from core.chunking import ChunkingOrchestrator
# Create analyzer with 50MB limit
analyzer = XMLDocumentAnalyzer(max_file_size_mb=50.0)
# Create chunking orchestrator with 100MB limit
orchestrator = ChunkingOrchestrator(max_file_size_mb=100.0)
# Utility functions for easy setup
from utils import create_analyzer_with_limits, FileSizeLimits
# Use predefined limits
analyzer = create_analyzer_with_limits(FileSizeLimits.PRODUCTION_MEDIUM) # 50MB
safe_result = safe_analyze_document("file.xml", FileSizeLimits.REAL_TIME) # 5MB
๐ง Installation
# Install from source
git clone https://github.com/redhat-ai-americas/xml-analysis-framework.git
cd xml-analysis-framework
pip install -e .
# Or install development dependencies
pip install -e .[dev]
Dependencies
- defusedxml (0.7.1+): For secure XML parsing protection
- Python standard library (3.8+) for all other functionality
๐ Usage Examples
Basic Analysis
from core.schema_analyzer import XMLSchemaAnalyzer
analyzer = XMLSchemaAnalyzer()
schema = analyzer.analyze_file('document.xml')
# Access schema properties
print(f"Root element: {schema.root_element}")
print(f"Total elements: {schema.total_elements}")
print(f"Namespaces: {schema.namespaces}")
Enhanced Analysis with Specialized Handlers
from core.analyzer import XMLDocumentAnalyzer
analyzer = XMLDocumentAnalyzer()
result = analyzer.analyze_document('maven-project.xml')
print(f"Document Type: {result['document_type'].type_name}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Handler Used: {result['handler_used']}")
print(f"AI Use Cases: {result['analysis'].ai_use_cases}")
Safe Analysis with File Validation
from utils import safe_analyze_document, FileSizeLimits
# Safe analysis with comprehensive validation
result = safe_analyze_document(
'document.xml',
max_size_mb=FileSizeLimits.PRODUCTION_MEDIUM
)
if result.get('error'):
print(f"Analysis failed: {result['error']}")
else:
print(f"Success: {result['document_type'].type_name}")
Intelligent Chunking
from core.chunking import ChunkingOrchestrator, XMLChunkingStrategy
orchestrator = ChunkingOrchestrator()
chunks = orchestrator.chunk_document(
'large_document.xml',
specialized_analysis={}, # Analysis result from XMLDocumentAnalyzer
strategy='auto'
)
# Token estimation
token_estimator = XMLChunkingStrategy()
for chunk in chunks:
token_count = token_estimator.estimate_tokens(chunk.content)
print(f"Chunk {chunk.chunk_id}: ~{token_count} tokens")
๐งช Testing & Validation
Production-Tested Performance
- โ 100% Success Rate: All 71 XML files processed successfully
- โ 2,752 Chunks Generated: Optimal segmentation across diverse document types
- โ 54 Document Types: Comprehensive coverage from ServiceNow to SCAP to Maven
- โ Secure by Default: Protected against XXE and billion laughs attacks
Test Coverage
# Run comprehensive end-to-end test
python test_end_to_end_workflow.py
# Run individual component tests
python test_all_chunking.py # Chunking strategies
python test_servicenow_analysis.py # ServiceNow handler validation
python test_scap_analysis.py # Security document analysis
Real-World Test Data
- Enterprise Systems: ServiceNow incident exports (8 files)
- Security Documents: SCAP/XCCDF compliance reports (4 files)
- Build Configurations: Maven, Ant, Ivy projects (12 files)
- Enterprise Config: Spring, Hibernate, Log4j (15 files)
- Content & APIs: DocBook, RSS, WADL, Sitemaps (32 files)
๐ค AI Integration & Use Cases
AI Workflow Overview
graph LR
A[XML Documents] --> B[XML Analysis Framework]
B --> C[Document Analysis<br/>29 Specialized Handlers]
B --> D[Smart Chunking<br/>Token-Optimized]
B --> E[AI-Ready Output<br/>Structured JSON]
E --> F[Vector Store<br/>Semantic Search]
E --> G[Graph Database<br/>Relationships]
E --> H[LLM Agent<br/>Intelligence]
F --> I[Security Intelligence]
G --> J[DevOps Automation]
H --> K[Knowledge Management]
style B fill:#e1f5fe,stroke:#01579b,stroke-width:2px
style E fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
style I fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
style J fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
style K fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
See Complete AI Integration Guide for detailed workflows, implementation examples, and advanced use cases.
๐ Security Intelligence Applications
- SCAP Compliance Monitoring: Automated vulnerability assessment and risk scoring
- SAML Security Analysis: Authentication flow security validation and threat detection
- Log4j Vulnerability Detection: CVE scanning and automated remediation guidance
- SOAP Security Assessment: Web service configuration security review
โ๏ธ DevOps & Configuration Intelligence
- Dependency Risk Analysis: Maven/Ant/Ivy vulnerability scanning and upgrade planning
- Configuration Drift Detection: Spring/Hibernate consistency monitoring
- Build Optimization: Performance analysis and security hardening recommendations
- Technical Debt Assessment: Legacy system modernization planning
๐ข Enterprise System Intelligence
- ServiceNow Process Mining: Incident pattern analysis and workflow optimization
- Cross-System Correlation: Configuration impact analysis and change management
- Compliance Automation: Regulatory requirement mapping and validation
๐ Knowledge Management Applications
- Technical Documentation Search: Semantic search across DocBook, API documentation
- Content Intelligence: RSS/Atom trend analysis and topic extraction
- API Discovery: WADL/WSDL service catalog and integration recommendations
๐ฌ Production Metrics & Performance
Framework Statistics
- โ 100% Success Rate: 71/71 files processed without errors
- ๐ 2,752 Chunks Generated: Optimal 38.8 avg chunks per document
- ๐ฏ 54 Document Types: Comprehensive XML format coverage
- โก High Performance: 0.015s average processing time per document
- ๐ Secure Parsing: defusedxml protection against XML attacks
Handler Confidence Levels
- 100% Confidence: XML Schema (XSD), Maven POM, Log4j, RSS/Atom, Sitemaps
- 95% Confidence: ServiceNow, Apache Ant, Ivy, Spring, Hibernate, SAML, SOAP
- 90% Confidence: SCAP/XCCDF, DocBook, WADL/WSDL
- Intelligent Fallback: Generic XML handler for unknown formats
๐ Extending the Framework
Adding New Handlers
from core.analyzer import XMLHandler, SpecializedAnalysis, DocumentTypeInfo
class CustomHandler(XMLHandler):
def can_handle(self, root, namespaces):
return root.tag == 'custom-format', 1.0
def detect_type(self, root, namespaces):
return DocumentTypeInfo(
type_name="Custom Format",
confidence=1.0,
version="1.0"
)
def analyze(self, root, file_path):
return SpecializedAnalysis(
document_type="Custom Format",
key_findings={"custom_data": "value"},
ai_use_cases=["Custom AI application"],
structured_data={"extracted": "data"}
)
Custom Chunking Strategies
from core.chunking import XMLChunkingStrategy, ChunkingOrchestrator
class CustomChunking(XMLChunkingStrategy):
def chunk_document(self, file_path, specialized_analysis=None):
# Custom chunking logic
return chunks
# Register custom strategy
orchestrator = ChunkingOrchestrator()
orchestrator.strategies['custom'] = CustomChunking
๐ Real Production Output Examples
ServiceNow Incident Analysis
{
"document_summary": {
"document_type": "ServiceNow Incident",
"type_confidence": 0.95,
"handler_used": "ServiceNowHandler",
"file_size_mb": 0.029
},
"key_insights": {
"data_highlights": {
"state": "7", "priority": "4", "impact": "3",
"assignment_group": "REDACTED_GROUP",
"resolution_time": "240 days, 0:45:51",
"journal_analysis": {
"total_entries": 9,
"unique_contributors": 1
}
},
"ai_applications": [
"Incident pattern analysis",
"Resolution time prediction",
"Workload optimization"
]
},
"structured_content": {
"chunking_strategy": "content_aware_medium",
"total_chunks": 75,
"quality_metrics": {
"overall_readiness": 0.87
}
}
}
Log4j Security Analysis
{
"document_summary": {
"document_type": "Log4j Configuration",
"type_confidence": 1.0,
"handler_used": "Log4jConfigHandler"
},
"key_insights": {
"data_highlights": {
"security_concerns": {
"security_risks": ["External socket appender detected"],
"log4shell_vulnerable": false,
"external_connections": [{"host": "log-server.example.com"}]
},
"performance": {
"async_appenders": 1,
"performance_risks": ["Location info impacts performance"]
}
},
"ai_applications": [
"Vulnerability assessment",
"Performance optimization",
"Security hardening"
]
},
"structured_content": {
"total_chunks": 19,
"chunking_strategy": "hierarchical_small"
}
}
๐ค Contributing
We welcome contributions! Whether you're adding new XML handlers, improving chunking algorithms, or enhancing AI integrations, your contributions help make XML analysis more accessible and powerful.
Priority contribution areas:
- ๐ฏ New XML format handlers (ERP, CRM, healthcare, government)
- โก Enhanced chunking algorithms and strategies
- ๐ Performance optimizations for large files
- ๐ค Advanced AI/ML integration examples
- ๐ Documentation and usage examples
๐ See CONTRIBUTING.md for complete guidelines, development setup, and submission process.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Acknowledgments
- Designed as part of the AI Building Blocks initiative
- Built for the modern AI/ML ecosystem
- Community-driven XML format support
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file xml_analysis_framework-1.2.6.tar.gz.
File metadata
- Download URL: xml_analysis_framework-1.2.6.tar.gz
- Upload date:
- Size: 9.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4e7f7e74bacf78373ce7c7c0435650df21210f39cd4f4f4e82ca50c2a4bf8f0b
|
|
| MD5 |
99cd500493425c4ffd5d2dc860752d78
|
|
| BLAKE2b-256 |
09debe281f8dbb1a1645fdbf133f4f4fd539c127a277c6de1300739f1673ac0b
|
File details
Details for the file xml_analysis_framework-1.2.6-py3-none-any.whl.
File metadata
- Download URL: xml_analysis_framework-1.2.6-py3-none-any.whl
- Upload date:
- Size: 201.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c8e0ad4c0bf2cb2f7996c4958c61d3e8e9db33c24e5c5fd57b19806fb6b27196
|
|
| MD5 |
9b1385ea9186577c3260811ec5fa30b5
|
|
| BLAKE2b-256 |
8eeff31965c196be7aa8bce618f0bf9d7bea106e7885b63bc35193145a3ce05f
|