Python library for manipulating, creating and editing tmx files

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.13
- Python :: 3.14

Project description

Hypomnema

A type-safe, dependency-free Python library for TMX 1.4B.

Python 3.13+ | MIT License | mypy --strict clean

[!warning] Hypomnema is still in active development and considered alpha quality. The API is subject to change without warning and no backwards compatibility is guaranteed until 1.0.

Why this exists

TMX is 25 years old. Every CAT tool produces it, but the Python ecosystem has no library that gives you typed, validated, round-trippable TMX without forcing you to think about XML internals. Hypomnema fixes that.

It is an infrastructure library — the pandas of TMX, if you will. It gives you a typed domain model you can reason about, operations you can compose, and backends you can swap. It does not validate your semantics (if your "English" variant contains French, that's your problem), but it does ensure structural validity: every node enforces its invariants at construction time, and unknown elements are preserved as opaque payloads rather than silently dropped.

What it is not: a segmentation engine, a corpus manager, an alignment tool, or an MT pipeline. Those are things you build with Hypomnema, not things it does for you.

Philosophy

Dependency-free by default. The standard library is all you need. lxml is an opt-in extra for performance. The aim is for this to stay true forever — optional dependencies for things like language-code validation may appear, but pip install hypomnema will always give you a working library with no transitive deps.

Type safety is non-negotiable. Everything passes mypy --strict. We lean hard on modern Python — generic syntax (E instead of TypeVar), match/case, union types as X | Y, dataclass(frozen=True, slots=True). The floor is 3.13 and will probably become 3.14 before 1.0. There is no reason a new project should carry compatibility baggage for EOL Python releases.

Round-trip fidelity. Anything Hypomnema doesn't model explicitly is captured verbatim as UnknownNode / UnknownInlineNode payloads (raw bytes). Loading a TMX file and dumping it back produces structurally equivalent output regardless of which backend you pick.

Backend parity. StandardBackend and LxmlBackend are first-class citizens. They share all namespace logic through pure functions. Every test that touches a backend is parametrized over both. If one can do something the other can't, that's a bug.

Architecture

┌─────────────────────────────────────────────────────┐
│                    OPERATIONS                       │
│         (ops/ — walk, text, transform, normalize)   │
│         Pure functions on domain nodes. No I/O.     │
└──────────────────────┬──────────────────────────────┘
                       │
              domain nodes only
              (no backend, no XML)
                       │
┌──────────────────────┴──────────────────────────────┐
│                     DOMAIN                          │
│        (domain/ — nodes, attributes, value types)   │
│        Immutable dataclasses. The IR.               │
└──────────────────────┬──────────────────────────────┘
                       │
        domain nodes in, domain nodes out
                       │
┌──────────────────────┴──────────────────────────────┐
│              LOADERS  ←→  DUMPERS                   │
│            (format-specific entry points)           │
│                                                     │
│   XML ── XmlLoader → domain → XmlDumper ── XML      │
│   JSON ─ JsonLoader → domain → JsonDumper ── JSON   │
│   CSV  ─ CsvLoader  → domain → CsvDumper  ── CSV    │
└──────────────────────┬──────────────────────────────┘
                       │
backend elements (et.Element, lxml._Element, dict, …)
                       │
┌──────────────────────┴──────────────────────────────┐
│                    BACKENDS                         │
│    (backends/xml/ — standard, lxml, namespace)      │
│    XmlBackendLike[E] protocol, XmlBackend[E] ABC    │
└─────────────────────────────────────────────────────┘

The domain layer is the intermediate representation. Loaders convert serialized content into domain nodes. Dumpers convert domain nodes back into serialized content. Backends handle the low-level serialization details. Operations run on domain nodes and never touch XML.

This separation means you can add a JSON backend (for FastAPI interop), a CSV backend (for spreadsheet workflows), or a Polars backend (for dataframe-scale analysis) without touching a single line of the domain model. The domain is format-agnostic — it's an IR.

Domain model

Every TMX element is a dataclass(slots=True) with two distinct attribute regions:

spec_attributes — a separate slotted dataclass holding the fields defined by the TMX 1.4B specification. Typed, validated at construction, IDE-autocompletable.
extra_attributes — a plain dict[str, str] for vendor extensions, non-standard attributes, or anything not in the spec. Preserved through round-trips.

This split makes it immediately obvious what's spec and what's vendor contamination. No guessing.

Attribute renaming

TMX attribute names are mapped to readable Python. Hyphens aren't valid identifiers, and abbreviations hide intent:

TMX attribute	Domain name	Notes
`creationdate`	`created_at`	`datetime` — not a string
`changedate`	`last_modified_at`	`datetime`
`creationid`	`created_by`	It's a user, not an ID
`changeid`	`last_modified_by`	Same
`creationtool`	`creation_tool`	Readable
`creationtoolversion`	`creation_tool_version`	Readable
`o-encoding`	`original_encoding`	Hyphens → underscores
`o-tmf`	`original_tm_format`	Abbreviation expanded
`srclang`	`source_language`	Readable
`adminlang`	`admin_language`	Readable
`segtype`	`segmentation_type`	Readable
`datatype`	`original_data_type`	It's "original data type" per spec
`tuid`	`translation_unit_id`	Readable
`usagecount`	`usage_count`	Underscore
`lastusagedate`	`last_used_at`	`datetime`
`version`	`version`	Kept as-is
`lang`	`language`	It's a language code

Datetime attributes (created_at, last_modified_at, last_used_at) are stored as datetime objects, not strings — they're parsed on load and formatted on dump. TMX recommends the YYYYMMDDTHHMMSSZ format so that's what Hypomnema dumps to by default.

[!note] Hypomnema parses datetime via datetime.fromisoformat() and formats to YYYYMMDDTHHMMSSZ regardless of the original format. If you need a different output format, you'll need to override the dumpers for the relevant elements.

Inline content

Segment content isn't a string — it's a mixed list of str, the allowed type of content node for that element (so Sub or Bpt | Ept | It | Ph | Hi | Sub) and UnknownInlineNode. This preserves markup structure so you can query, transform, and round-trip it.

[!important] The structure of the TMX format means it can technically be infinitely recursive and Hypomnema is the same. Be careful when using recursion and infinite loops. For memory efficiency, Hypomnema uses a stack-based approach to recursion internally.

for item in variant.segment:
    match item:
        case str():
            ...
        case Bpt(spec_attributes=a):
            print(a.internal_id)
        case UnknownInlineNode(payload=bytes):
            # preserved for round-trip, never touched
            ...

Type safety

The entire codebase is mypy --strict clean. We use modern Python features throughout:

PEP 695 generics — XmlBackend[E] instead of XmlBackend(Generic[E]), XmlLoader[T] instead of TypeVar boilerplate
match/case — for namespace resolution (namespace.py), event dispatch in iterparse, and tag dispatch in loaders
Union types as X | Y — no Optional, no Union
dataclass(frozen=True, slots=True) — immutable, memory-efficient, hashable
Protocol-based backend contract — XmlBackendLike[E] is a Protocol so you can mock it, proxy it, or implement it without inheriting. XmlBackend[E] is the shared ABC that both concrete backends inherit from.
NamedTuples for resolution results — ResolveResult(prefix, uri, localname, clark) is explicit and typed, not a 4-tuple

The generic E parameter threads the element type through the entire stack — XmlLoader[et.Element] vs XmlLoader[et._Element] produce the same domain nodes but are statically distinguished at the backend layer.

Xml Backend details

Namespace handling

All namespace logic lives in pure functions in namespace.py. No class, no state — just functions that take maps and return results:

resolve("ns:item", global_nsmap={...}, nsmap={...}) → ResolveResult
format_notation(result, "local", global_nsmap={...}) → str
register_namespace(nsmap, "ns", "http://...")  # mutates dict in-place

Resolution uses successive lookups: per-call nsmap first, then global_nsmap. No dict merging, no copies. The xml prefix is built-in and always maps to http://www.w3.org/XML/1998/namespace.

Clark notation ({uri}local) is the internal representation. Prefixed names and default-namespace names are resolved at the boundary. format_notation with "prefixed" raises MultiplePrefixesError if the URI maps to ambiguity in one map.

LxmlBackend and element.nsmap

Lxml elements expose element.nsmap — in-scope namespace declarations. Methods that resolve names (get_tag, get_attribute, etc.) merge this into a fresh dict alongside the caller's nsmap for the resolution call. The caller's dict is never mutated.

Streaming

iterparse yields elements whose closing tag is reached. Unmatched elements are cleared immediately to bound memory. iterwrite writes in batches with configurable buffer size, optional root wrapper, XML declaration, and doctype.

Parsing

Both parse() methods use single-pass iterparse with start and start-ns events. from_bytes and from_string wrap the input in a BytesIO/StringIO and delegate to parse(). No double-scanning.

What's not here yet

Language code validation. lang attributes are stored as strings. Proper BCP 47 validation would require an external dependency, which contradicts the zero-dep default. This will likely become an optional extra (hypomnema[langcode]) using langcodes or similar.

<map> and <ude> tags. These relate to custom character encodings and mapping tables. Supporting them properly would add significant complexity for negligible benefit since virtually no modern TMX file uses them. If broad demand emerges, they can be added.

<ut> tag. Will be modeled in a coming update.

TMX versions other than 1.4B. This is the de facto standard — every tool produces it, virtually no tool produces anything else. TMX 2.0 (which was only ever a Committee Draft) will be considered if it ever becomes real.

Development

The project uses uv for everything — dependency management, virtual environments, testing, linting, type checking, publishing.

uv run pytest                  # run tests
uv run ruff check src/ tests/  # lint
uv run mypy --strict src/      # type check

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.13
- Python :: 3.14

Release history Release notifications | RSS feed

This version

0.8

Apr 9, 2026

0.7

Feb 25, 2026

0.6

Jan 28, 2026

0.5.0

Jan 15, 2026

0.4.4

Dec 19, 2025

0.4.3

Dec 15, 2025

0.4.2

Dec 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hypomnema-0.8.tar.gz (44.0 kB view details)

Uploaded Apr 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

hypomnema-0.8-py3-none-any.whl (48.4 kB view details)

Uploaded Apr 9, 2026 Python 3

File details

Details for the file hypomnema-0.8.tar.gz.

File metadata

Download URL: hypomnema-0.8.tar.gz
Upload date: Apr 9, 2026
Size: 44.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hypomnema-0.8.tar.gz
Algorithm	Hash digest
SHA256	`1c857ff701d36860dd71830529f0dda077225361e5e8a9e47fd60557385afcfa`
MD5	`0cd75e593065792076de2ec68a37a573`
BLAKE2b-256	`9e4878eae0ffce622b02a1913f529ebbebea71462a6fb3e7c6e923e00bf9db83`

See more details on using hashes here.

File details

Details for the file hypomnema-0.8-py3-none-any.whl.

File metadata

Download URL: hypomnema-0.8-py3-none-any.whl
Upload date: Apr 9, 2026
Size: 48.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hypomnema-0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f9257a866b2f788a3834cc33849f5ba74a9a68ceeacaf1ec0e438b2cbdcb13b1`
MD5	`bb4cc4e963a179d414098185e17fdbac`
BLAKE2b-256	`8a557126380b746fd2beddd7a976c870e5456a62e171c957901dbe96ef244242`

See more details on using hashes here.

hypomnema 0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Hypomnema

Why this exists

Philosophy

Architecture

Domain model

Attribute renaming

Inline content

Type safety

Xml Backend details

Namespace handling

LxmlBackend and element.nsmap

Streaming

Parsing

What's not here yet

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes