Skip to main content

Pretty-print XML and HTML files in a light, YAML-like, readable format

Project description

Unxml

Simplify and "flatten" XML files into a YAML-like readable format.

This is a Rust clone of the original unxml F# tool.

See it in action → — a gallery of real-world XML documents, schemas, stylesheets, and Schematron rules rendered with unxml, with original-vs-rendered size comparisons.

Installation

Using uv (Easiest)

Install the published wheel from PyPI as a standalone tool:

uv tool install unxml-rs

This puts the unxml command on your PATH. To try it without installing anything:

uvx --from unxml-rs unxml <xml_file>

Pre-built Binaries (Recommended)

Download the latest release for your platform from the GitHub Releases page:

  • Linux (x86_64): unxml-linux-x86_64.tar.gz
  • Windows (x86_64): unxml-windows-x86_64.zip
  • macOS (Intel): unxml-macos-x86_64.tar.gz
  • macOS (Apple Silicon): unxml-macos-arm64.tar.gz

Extract the archive and place the unxml binary in your PATH.

From Source

git clone https://github.com/yourusername/unxml-rs
cd unxml-rs
cargo install --path .

Using Cargo

cargo install unxml

Usage

unxml <xml_file>

By default files render as plain XML. Pass --auto to pick the processing mode from each file's extension:

Extension Mode applied
.xsl .xslt --xslt
.sch --schematron
.xsd --xsd

An explicit mode flag (--xslt, --schematron, --xsd, --special) always overrides autodetection.

Each mode rewrites its vocabulary into a terser pseudocode. The full set of transformations, with side-by-side samples, is documented per format:

Syntax-highlighted output (--bat)

unxml --bat some.xsd      # implies --auto (detects --xsd), pipes through `bat -l unxml`

--bat renders the output through bat using the bundled unxml grammar (see editor/) for paged, colourised display. If bat is not installed it falls back to plain stdout.

Claude Code skill (--install-skills)

unxml --install-skills      # writes ~/.claude/skills/unxml/SKILL.md

Installs a Claude Code skill for unxml. It doesn't auto-activate; invoke it with /unxml.

Hiding noisy namespace prefixes (--hide-ns)

Vocabularies like UBL bury the signal under repeated prefixes (cbc:, cac:). --hide-ns drops the named prefixes from element and attribute names — and their xmlns: declarations — so the output reads as bare local names:

unxml --hide-ns cbc,cac invoice.xml   # repeatable and comma-separated

Signal-carrying prefixes you don't list (e.g. ext:, bim:) are kept, so an extension subtree still stands out.

The special value --hide-ns ALL hides every prefix, reducing all element and attribute names to their bare local form. Useful when you don't know the prefixes up front — e.g. fingerprinting or clustering documents of unknown vocabularies with --paths:

unxml --paths --hide-ns ALL unknown.xml   # prefix-free structural signature

Under --auto/--bat, unxml also sniffs the document type and hides a sensible set automatically. Currently it recognises UBL instance documents (an unprefixed root such as <Invoice> in a UBL namespace) and hides whichever prefixes are bound to the Common Basic/Aggregate Components namespaces. A stylesheet or schema that merely references UBL (e.g. an xsl:stylesheet translating to UBL) is left untouched, since there the prefixes are real syntax.

Canonicalising for diffs (--canonical)

Two documents can mean the same thing yet differ byte-for-byte over things that carry no meaning: namespace prefixes are arbitrary local aliases for a URI, and sibling order is often incidental. --canonical removes both so the rendered output of equivalent documents diffs cleanly:

  • Prefixes are rebound to stable names. Recognised vocabularies keep their conventional prefix (xsl, xs, cac, ram, …); everything else becomes ns1, ns2, … in sorted-URI order. A default namespace (xmlns="…") is rewritten to the same explicit prefix, so <a:Foo> and <Foo xmlns="…"> for one URI collapse to the identical name. All xmlns:* declarations are re-emitted, sorted, on the root.
  • Sibling elements are sorted by a recursive signature, so order-only differences vanish. Mixed content (prose) keeps document order.
diff <(unxml --canonical a.xml) <(unxml --canonical b.xml)

Two documents differing only in prefix spelling, default-vs-explicit namespace, and sibling order produce byte-identical output:

<a:Order xmlns:a="urn:shop:order" xmlns:c="urn:shop:cust">
  <a:Line sku="X1"><a:Qty>2</a:Qty></a:Line>
  <c:Customer id="42">Acme</c:Customer>
</a:Order>
ns2:Order(xmlns:ns1="urn:shop:cust", xmlns:ns2="urn:shop:order")
  ns1:Customer(id="42") = Acme
  ns2:Line(sku="X1")
    ns2:Qty = 2

Sibling sorting applies only to plain XML. Element order is significant in stylesheets and schemas (xsl:* control flow, xs:sequence, Schematron rule order), so in a dialect/--special mode (--xslt, --xsd, --wsdl, --schematron) --canonical normalises prefixes only and preserves document order.

Listing document paths (--paths)

--paths dumps a compact structural summary instead of the full document: the set of distinct element paths as an indented tree, each node shown once (repeated siblings collapse) and annotated with the union of attribute names ever seen at that path. A leading // legend explains the namespace prefixes (recognised vocabularies on their conventional prefix are omitted as self-explanatory):

unxml --paths invoice.xml
order(xmlns="urn:shop:order")
  customer(id)
  line(discount, sku)
    qty(unit)

Prefixed namespaces (xmlns:ext) go into a leading // legend; the default namespace (xmlns) is shown inline on the element that sets it, since several nested redefinitions would collide under one (default) legend key.

It answers "what shapes exist in this document" and is handy for understanding or comparing document shapes. It composes with --select (subtree under a match), --hide-ns (shorter segments), and --canonical (the legend resolves the generated ns1/ns2 names).

Two further knobs make --paths a fuzzable fingerprint for clustering files by structure — coarsen the signature so documents of the same format collapse together despite incidental differences:

  • --depth N limits the tree to N nesting levels (root = level 1), dropping deeper subtrees. Lower N → coarser.
  • --no-attrs drops ordinary attribute names from each node, keeping only namespaces. Incidental per-document attributes (schemaLocation, version, timestamps) stop fragmenting otherwise-identical formats.

Combined with --hide-ns ALL, --paths --depth 1 --no-attrs reduces each file to a single root-element + namespace line — a format census signature: run it over a directory and sort | uniq -c to see how many distinct formats are present and how many files use each. Raise --depth to cluster by finer structural variants instead.

Introduction

This command line application was developed for comparing XML files (e.g. database/application state dumps). It takes an XML file and converts it to a YAML-like syntax that is easier to read and compare.

Example

Take an excerpt of the standard UBL 2.1 invoice example:

<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2"
	xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2"
	xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2">
	<cbc:UBLVersionID>2.1</cbc:UBLVersionID>
	<cbc:ID>TOSL108</cbc:ID>
	<cbc:IssueDate>2009-12-15</cbc:IssueDate>
	<cbc:InvoiceTypeCode listID="UN/ECE 1001 Subset" listAgencyID="6">380</cbc:InvoiceTypeCode>
	<cbc:DocumentCurrencyCode listID="ISO 4217 Alpha" listAgencyID="6">EUR</cbc:DocumentCurrencyCode>
	<cac:AccountingSupplierParty>
		<cac:Party>
			<cac:PartyName>
				<cbc:Name>Salescompany ltd.</cbc:Name>
			</cac:PartyName>
			<cac:PostalAddress>
				<cbc:StreetName>Main street</cbc:StreetName>
				<cbc:CityName>Big city</cbc:CityName>
				<cbc:PostalZone>54321</cbc:PostalZone>
			</cac:PostalAddress>
		</cac:Party>
	</cac:AccountingSupplierParty>
</Invoice>

unxml invoice.xml flattens it into:

Invoice(
    xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2",
    xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2",
    xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2")
  cbc:UBLVersionID = 2.1
  cbc:ID = TOSL108
  cbc:IssueDate = 2009-12-15
  cbc:InvoiceTypeCode(listAgencyID="6", listID="UN/ECE 1001 Subset") = 380
  cbc:DocumentCurrencyCode(listAgencyID="6", listID="ISO 4217 Alpha") = EUR
  cac:AccountingSupplierParty
    cac:Party
      cac:PartyName
        cbc:Name = Salescompany ltd.
      cac:PostalAddress
        cbc:StreetName = Main street
        cbc:CityName = Big city
        cbc:PostalZone = 54321

With --auto, unxml sniffs the UBL instance and hides the noisy cbc:/cac: prefixes (along with their xmlns: declarations), leaving just the signal:

Invoice(xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2")
  UBLVersionID = 2.1
  ID = TOSL108
  IssueDate = 2009-12-15
  InvoiceTypeCode(listAgencyID="6", listID="UN/ECE 1001 Subset") = 380
  DocumentCurrencyCode(listAgencyID="6", listID="ISO 4217 Alpha") = EUR
  AccountingSupplierParty
    Party
      PartyName
        Name = Salescompany ltd.
      PostalAddress
        StreetName = Main street
        CityName = Big city
        PostalZone = 54321

Mode example: XSLT

Beyond flattening, each mode rewrites its vocabulary into terser pseudocode. A small XSLT stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
  <table border="1">
    <xsl:for-each select="catalog/cd">
    <tr>
      <td><xsl:value-of select="title"/></td>
      <td><xsl:value-of select="artist"/></td>
    </tr>
    </xsl:for-each>
  </table>
</xsl:template>
</xsl:stylesheet>

renders with unxml --xslt as:

xsl:stylesheet(version="1.0", xmlns:xsl="http://www.w3.org/1999/XSL/Transform")
  match /:
    table(border="1")
      foreach catalog/cd:
        tr
          td
            <- title
          td
            <- artist

match, foreach and <- (for xsl:value-of) read like the control flow the stylesheet actually expresses. See XSLT transformations for the full vocabulary, and XSD / Schematron for the other modes.

Key Features

  • Attributes in Parentheses: Element attributes are displayed Pug-style as element(attr="value")
  • Text Content with Equals: Element text content is shown as ElementName = text content
  • Hierarchical Indentation: Nested elements are properly indented
  • Clean Format: Easy to read and compare, great for diffing
  • Inline mixed content: Prose interleaved with short inline elements stays on one readable line

Mixed content (prose with inline spans)

Document-style XML interleaves text with small inline elements — a paragraph containing a <command> or a <link>. Flattening every run onto its own line makes such prose hard to read, so unxml keeps it inline as one line of verbatim XML:

<para>The <command>widget</command> daemon keeps its
  <link href="recovery.html">recoverable</link> state in one database.</para>

renders as:

para = The <command>widget</command> daemon keeps its <link href="recovery.html">recoverable</link> state in one database.

An element flows inline when its whole subtree is inline-safe — text interleaved with elements that are themselves inline-safe. A leaf with significant (multi-line) text, such as <programlisting> or <screen>, is not inline-safe, so its parent stays in the flattened block form and the listing keeps its line breaks. Nested inline markup (e.g. <emphasis> wrapping a <command>) collapses all the way up. This applies to the generic XML render; the --xslt/--xsd/--wsdl/--schematron modes use their own formatting.

Technical Details

  • Built with Rust for performance and safety
  • Uses quick-xml for fast XML parsing
  • Uses clap for command-line argument parsing
  • Proper error handling with anyhow

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Creating Releases

The version lives in the git tag, not in Cargo.toml (which stays at the 0.0.0-dev placeholder; the release workflow injects the real version with cargo set-version). Do not bump Cargo.toml or create tags by hand.

To cut a release, let gh create the tag:

gh release create vX.Y.Z --title "Release vX.Y.Z" --notes "…"

The pushed tag triggers the GitHub Actions workflow, which builds binaries and the PyPI wheel for all platforms and attaches them to the release.

The CI workflow runs on every push to ensure code quality with formatting checks, linting, and tests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

unxml_rs-1.5.0-py3-none-win_amd64.whl (859.5 kB view details)

Uploaded Python 3Windows x86-64

unxml_rs-1.5.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB view details)

Uploaded Python 3manylinux: glibc 2.17+ x86-64

unxml_rs-1.5.0-py3-none-macosx_11_0_arm64.whl (953.9 kB view details)

Uploaded Python 3macOS 11.0+ ARM64

unxml_rs-1.5.0-py3-none-macosx_10_12_x86_64.whl (976.8 kB view details)

Uploaded Python 3macOS 10.12+ x86-64

File details

Details for the file unxml_rs-1.5.0-py3-none-win_amd64.whl.

File metadata

  • Download URL: unxml_rs-1.5.0-py3-none-win_amd64.whl
  • Upload date:
  • Size: 859.5 kB
  • Tags: Python 3, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for unxml_rs-1.5.0-py3-none-win_amd64.whl
Algorithm Hash digest
SHA256 054eaf8442765e7c8a398c6f550647589d9f88dbe61d3e1e38c31bc628c03209
MD5 b59fb50e8ed6c01181b0dd53fddd2593
BLAKE2b-256 dccd61a93a8103f0fea5ecad488399f4119853d34f109ab3b6d2d094c27766b2

See more details on using hashes here.

Provenance

The following attestation bundles were made for unxml_rs-1.5.0-py3-none-win_amd64.whl:

Publisher: release.yml on vivainio/unxml-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file unxml_rs-1.5.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for unxml_rs-1.5.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d5651222b17b31ff66ae64a3f0412fb8692891f37e33bf2872555b71b03a041b
MD5 a0d18428e20029a9db6d5073a3f2be15
BLAKE2b-256 a8b13a6e7e410600f7208e9ca51a71643396e0e92e84b153c3424253d52cafc0

See more details on using hashes here.

Provenance

The following attestation bundles were made for unxml_rs-1.5.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on vivainio/unxml-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file unxml_rs-1.5.0-py3-none-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for unxml_rs-1.5.0-py3-none-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 411075a146219ca5f208e93e839e730d8255ad376c5b08c619a31a38fd264dc6
MD5 73f2ac348acab362e52efca3d09e58ae
BLAKE2b-256 ce4c8783b9f970ffcd691727217673f1213fbda1c033ef20abc0ad85afec8416

See more details on using hashes here.

Provenance

The following attestation bundles were made for unxml_rs-1.5.0-py3-none-macosx_11_0_arm64.whl:

Publisher: release.yml on vivainio/unxml-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file unxml_rs-1.5.0-py3-none-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for unxml_rs-1.5.0-py3-none-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2d776b0bd2fff848dee505e7a2b160637fc381a86c0407a46a7220ff227c4eb2
MD5 cc640e6db7cb0615791b7984ec1f2da0
BLAKE2b-256 6ee8465b8753ad315846a7fa92bc492ae8f95ba9a8edb538f4a0f1291bd988ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for unxml_rs-1.5.0-py3-none-macosx_10_12_x86_64.whl:

Publisher: release.yml on vivainio/unxml-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page