Skip to main content

A sloppy XML parser for Python designed to be used with LLMs

Project description

A sloppy XML parser for Python designed to handle malformed XML gracefully.

License Python Version Tests

Sloppy XML is a single-file XML parser library that prioritizes resilience over strict XML compliance. In fact it tries not to be XML compliant at all. It's specifically designed to handle malformed XML commonly generated by LLMs, automated systems, and other sources where perfect XML structure cannot be guaranteed.

The parser provides both streaming and tree-building capabilities with robust error recovery mechanisms, making it ideal for parsing XML from unreliable sources while maintaining reasonable performance.

Note: this library was 100% AI generated with Claude Code and used experimentally for some evals I'm doing. I will try to fix it up as good as possible as I ran into issues, but I cannot vouch for the quality of it.

Goals

  • Graceful Error Recovery: Handle malformed XML without crashing
  • Dual API: Both streaming events and ElementTree construction
  • Zero Dependencies: Single file with only standard library dependencies
  • LLM-Friendly: Specifically designed for XML generated by language models
  • Detailed Diagnostics: Rich error reporting with line/column information

Quick Example

import sloppy_xml

# Streaming API - handle malformed XML gracefully
xml_content = '''
<root>
    <item name="test" broken-attr=>
        Some text with <unclosed-tag>
        <!-- Malformed comment --
    </item>
</root>
'''

# Stream parsing with error recovery
for event in sloppy_xml.stream_parse(xml_content):
    if isinstance(event, sloppy_xml.StartElement):
        print(f"Start: {event.name}, attrs: {event.attrs}")
    elif isinstance(event, sloppy_xml.EndElement):
        print(f"End: {event.name}")
    elif isinstance(event, sloppy_xml.Text):
        print(f"Text: {repr(event.content)}")
    elif isinstance(event, sloppy_xml.ParseError):
        print(f"Error recovered: {event.message} at {event.line}:{event.column}")

# Tree parsing - get an ElementTree despite malformed input
root = sloppy_xml.tree_parse(xml_content)
print(f"Parsed tree with root: {root.tag}")

Event Types

The streaming parser emits these event types:

  • StartElement - Opening tags with attributes and position info
  • EndElement - Closing tags (including auto-closed mismatched tags)
  • Text - Text content with CDATA detection
  • Comment - XML comments
  • ProcessingInstruction - Processing instructions like <?xml?>
  • EntityRef - Entity references with automatic resolution
  • ParseError - Recoverable parsing errors with diagnostic information

Error Recovery Features

  • Tag Stack Management: Automatically closes mismatched opening tags
  • Malformed Attribute Handling: Recovers from broken attribute syntax
  • Entity Resolution: Handles standard HTML entities and numeric references
  • CDATA Fallback: Treats malformed CDATA as regular text
  • Comment Recovery: Handles unclosed comments gracefully
  • State Recovery: Returns parser to valid state after errors

API Functions

Streaming API

sloppy_xml.stream_parse(xml_input)

Returns an iterator of events for streaming XML processing.

Tree API

sloppy_xml.tree_parse(xml_input)

Returns an xml.etree.ElementTree.Element root node.

Installation

uv add sloppy-xml

Development

This project uses uv for dependency management:

# Setup
uv sync

# Run tests
uv run pytest

# Format code
uv run ruff format

# Check code quality
uv run ruff check

# Build package
uv build

Similar Projects

  • xml.etree.ElementTree (stdlib) - Strict XML parsing, no error recovery
  • lxml - Fast XML parsing with some error tolerance
  • BeautifulSoup - HTML/XML parsing with tag soup handling

Sloppy XML fills the gap for applications that need structured XML parsing with aggressive error recovery, particularly for machine-generated content.

Sponsor

If you like the project and find it useful you can become a sponsor.

License and Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sloppy_xml-0.3.1.tar.gz (29.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sloppy_xml-0.3.1-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file sloppy_xml-0.3.1.tar.gz.

File metadata

  • Download URL: sloppy_xml-0.3.1.tar.gz
  • Upload date:
  • Size: 29.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sloppy_xml-0.3.1.tar.gz
Algorithm Hash digest
SHA256 c7973d8831ff3d452521da4da61bf1a96b12ffc0d560e2e6796d934d84f8d566
MD5 e408ca3c938936a2b269b73b5f1be91a
BLAKE2b-256 30aae9473a1d98bf5a5650e6de6f88f21bd3ae9ad47b4219522904d8f09431c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for sloppy_xml-0.3.1.tar.gz:

Publisher: release.yml on mitsuhiko/sloppy-xml-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sloppy_xml-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: sloppy_xml-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sloppy_xml-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a8c81b548f957cd4ae0e21b017cf9e8405be6350ce6ee8dad9a06b338894e3db
MD5 63238c6e4a36940ea572bc1bb87f18f7
BLAKE2b-256 9bd9cfcfed8984efae99ab794113ca08ac633d436b7d6e29e9bf62094dab0dd9

See more details on using hashes here.

Provenance

The following attestation bundles were made for sloppy_xml-0.3.1-py3-none-any.whl:

Publisher: release.yml on mitsuhiko/sloppy-xml-py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page