Skip to main content

Serialization format readable for LLMs and humans

Project description

lmxml

A serialization format readable for both LLMs and humans.

I found that structured prompting yields great results. Instead of feeding the model a wall of text (possibly formatted with Markdown or sprinkled with pseudo-XML tags), you can write your prompt in-memory as an object (anything JSON-like), serialize it, and use it as a system or user prompt.

In my (non-benchmarked, but battle-tested) experience, models understand your intent much better that way — especially the smaller ones.


Why lmxml?

What do you use to serialize the prompts, though?

JSON is a first-class citizen in the world of tool-capable models, so why not that?

Because you might end up with a single-line prompt that’s hard to read and even harder to debug. You can pretty-print it, but once you introduce multi-line text fields, you’re parsing \n in your own head — and mine started hurting the first time I tried.

So, maybe YAML?

It is nicely formatted for human ingestion and supports |-style multi-line strings.
Well… yes, but have fun configuring YAML serializers to reliably emit that flavor.

There’s TOON as well, but it’s optimized for token usage, not human readability.
Let’s not even talk about INI or TOML — I think you already see how that won’t do us any good here.

But hey — every prompting guide tells you that XML-like <tags> make models understand structure better.
So… maybe XML?

I liked that idea best, but the eXtensible part turns out to be a curse, not a feature, for this use case. There’s no simple xml.dumps(...) in major XML libraries, so you’re forced to decide:

  • which values become attributes
  • how lists are represented
  • whether to use CDATA
  • how much whitespace matters

That’s basically what I’ve done.

lmxml stands for Language Model XML and is an opinionated way to produce XML-like text from JSON-ish data.

Yes, I know that if you expand the acronym, you technically get
Language Model eXtensible Modelling Language.
I figure that “XML” is a proper noun these days (like YAML), so I’m willing to pay this silly price for a cute name.


Core design principles

  • Indentation is mandatory and deterministic
  • No attributes, except one permitted attribute: index on <item> inside <list>
  • Everything is a tag — most leaves are single-line: <tag>value</tag>
  • Multiline strings are supported, but not indented:
    • opening and closing tags are indented
    • inner lines are raw (no leading spaces)
  • Collections: only lists are first-class (tuples and sets are silently converted to lists)
  • Top-level primitives are emitted as raw values (no wrapping <None> tag)
  • Primitives are serialized using Python str() semantics (True, False, 42, 3.14)

The goal is not expressiveness — it’s predictability.


Usage

As simple as this (which is the whole API surface btw):

import lmxml

data = {
    "user": {
        "id": 42,
        "name": "Ada",
        "bio": "Researcher\nLoves coffee"
    },
    "tags": ["ml", "nlp"]
}

print(lmxml.dumps(data))

That snippet prints:

<user>
  <id>42</id>
  <name>Ada</name>
  <bio>
Researcher
Loves coffee
  </bio>
</user>
<tags>
  <list>
    <item index="0">ml</item>
    <item index="1">nlp</item>
  </list>
</tags>

Pydantic support

If pydantic is importable, then you can feed any instance of BaseModel to lmxml.dumps. Following is an invariant:

x: pydantic.BaseModel
lmxml.dumps(x) == lmxml.dumps(x.model_dump(mode="json"))

Pydantic is not a dependency, even an optional one. I just recognize whether it is present and add that tweak (including typing) if it is.


Where's the deserializer?

There isn’t one.

This is not a data transport format. It’s a way to take structured concepts and feed them to a model reliably, while preserving both human readability and structural cues.

You can run the output through a standard XML parser, but you won’t get the original structure back out-of-the-box. That’s intentional.

Unless you serialized a primitive (which is emitted as raw str()), parsing should always succeed — as in no exceptions should be raised.

There are some minor gotchas (HTML escaping, no CDATA for multi-line strings, XML character restrictions), but you’re unlikely to hit them in normal prompting scenarios. If you do, open an issue — we’ll figure out whether it’s a bug or a feature.


lmxml is intentionally boring.

If you need schemas, validation, round-tripping, or extensibility — use something else.

If you want prompts that are easy to read, hard to break, and easy for models to follow — lmxml is for you.


Disclaimer for the AI age

I admit, this has been vibe-slapped together. ChatGPT can do surprisingly good job when writing code, although I did that by chatting and copying its snippets. Anyway, even though most of the code has been conjured via LLM magic, it has all been reviewed by yours truly (not like there was lots to review).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmxml-0.1.0.tar.gz (7.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lmxml-0.1.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file lmxml-0.1.0.tar.gz.

File metadata

  • Download URL: lmxml-0.1.0.tar.gz
  • Upload date:
  • Size: 7.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lmxml-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4babbd5b0d019409fb0573b8099d7bf9dd0ccac343b8f82700ca2161d798db84
MD5 c8d457499c246522f6874ff0ac64b6d1
BLAKE2b-256 7c04a4a7c8477e2afa4a00b7811985dcdcf01b7d66003490cc356cc5f550e117

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmxml-0.1.0.tar.gz:

Publisher: on_release.yml on FilipMalczak/lmxml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file lmxml-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lmxml-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for lmxml-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 818726edfa260b47898821eb6d19480144b5ec29ad26e91592396de0e23e4a8e
MD5 0dc1662efdb1575b7d77e2eda88bcdec
BLAKE2b-256 edc12ce169b71a9290c26050b89ad117a265bde5bd6f0773b272ddbc9e370040

See more details on using hashes here.

Provenance

The following attestation bundles were made for lmxml-0.1.0-py3-none-any.whl:

Publisher: on_release.yml on FilipMalczak/lmxml

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page