Skip to main content

Python library for manipulating, creating and editing tmx files

Project description

Hypomnema

A type-safe, dependency-free Python library for TMX 1.4B.

Python 3.13+ | MIT License | mypy --strict clean

[!warning] Hypomnema is still in active development and considered alpha quality. The API is subject to change without warning and no backwards compatibility is guaranteed until 1.0.

Why this exists

TMX is 25 years old. Every CAT tool produces it, but the Python ecosystem has no library that gives you typed, validated, round-trippable TMX without forcing you to think about XML internals. Hypomnema fixes that.

It is an infrastructure library — the pandas of TMX, if you will. It gives you a typed domain model you can reason about, operations you can compose, and backends you can swap. It does not validate your semantics (if your "English" variant contains French, that's your problem), but it does ensure structural validity: every node enforces its invariants at construction time, and unknown elements are preserved as opaque payloads rather than silently dropped.

What it is not: a segmentation engine, a corpus manager, an alignment tool, or an MT pipeline. Those are things you build with Hypomnema, not things it does for you.

Philosophy

Dependency-free by default. The standard library is all you need. lxml is an opt-in extra for performance. The aim is for this to stay true forever — optional dependencies for things like language-code validation may appear, but pip install hypomnema will always give you a working library with no transitive deps.

Type safety is non-negotiable. Everything passes mypy --strict. We lean hard on modern Python — generic syntax (E instead of TypeVar), match/case, union types as X | Y, dataclass(frozen=True, slots=True). The floor is 3.13 and will probably become 3.14 before 1.0. There is no reason a new project should carry compatibility baggage for EOL Python releases.

Round-trip fidelity. Anything Hypomnema doesn't model explicitly is captured verbatim as UnknownNode / UnknownInlineNode payloads (raw bytes). Loading a TMX file and dumping it back produces structurally equivalent output regardless of which backend you pick.

Backend parity. StandardBackend and LxmlBackend are first-class citizens. They share all namespace logic through pure functions. Every test that touches a backend is parametrized over both. If one can do something the other can't, that's a bug.

Architecture

┌─────────────────────────────────────────────────────┐
│                    OPERATIONS                       │
│         (ops/ — walk, text, transform, normalize)   │
│         Pure functions on domain nodes. No I/O.     │
└──────────────────────┬──────────────────────────────┘
                       │
              domain nodes only
              (no backend, no XML)
                       │
┌──────────────────────┴──────────────────────────────┐
│                     DOMAIN                          │
│        (domain/ — nodes, attributes, value types)   │
│        Immutable dataclasses. The IR.               │
└──────────────────────┬──────────────────────────────┘
                       │
        domain nodes in, domain nodes out
                       │
┌──────────────────────┴──────────────────────────────┐
│              LOADERS  ←→  DUMPERS                   │
│            (format-specific entry points)           │
│                                                     │
│   XML ── XmlLoader → domain → XmlDumper ── XML      │
│   JSON ─ JsonLoader → domain → JsonDumper ── JSON   │
│   CSV  ─ CsvLoader  → domain → CsvDumper  ── CSV    │
└──────────────────────┬──────────────────────────────┘
                       │
backend elements (et.Element, lxml._Element, dict, …)
                       │
┌──────────────────────┴──────────────────────────────┐
│                    BACKENDS                         │
│    (backends/xml/ — standard, lxml, namespace)      │
│    XmlBackendLike[E] protocol, XmlBackend[E] ABC    │
└─────────────────────────────────────────────────────┘

The domain layer is the intermediate representation. Loaders convert serialized content into domain nodes. Dumpers convert domain nodes back into serialized content. Backends handle the low-level serialization details. Operations run on domain nodes and never touch XML.

This separation means you can add a JSON backend (for FastAPI interop), a CSV backend (for spreadsheet workflows), or a Polars backend (for dataframe-scale analysis) without touching a single line of the domain model. The domain is format-agnostic — it's an IR.

Domain model

Every TMX element is a dataclass(slots=True) with two distinct attribute regions:

  • spec_attributes — a separate slotted dataclass holding the fields defined by the TMX 1.4B specification. Typed, validated at construction, IDE-autocompletable.
  • extra_attributes — a plain dict[str, str] for vendor extensions, non-standard attributes, or anything not in the spec. Preserved through round-trips.

This split makes it immediately obvious what's spec and what's vendor contamination. No guessing.

Attribute renaming

TMX attribute names are mapped to readable Python. Hyphens aren't valid identifiers, and abbreviations hide intent:

TMX attribute Domain name Notes
creationdate created_at datetime — not a string
changedate last_modified_at datetime
creationid created_by It's a user, not an ID
changeid last_modified_by Same
creationtool creation_tool Readable
creationtoolversion creation_tool_version Readable
o-encoding original_encoding Hyphens → underscores
o-tmf original_tm_format Abbreviation expanded
srclang source_language Readable
adminlang admin_language Readable
segtype segmentation_type Readable
datatype original_data_type It's "original data type" per spec
tuid translation_unit_id Readable
usagecount usage_count Underscore
lastusagedate last_used_at datetime
version version Kept as-is
lang language It's a language code

Datetime attributes (created_at, last_modified_at, last_used_at) are stored as datetime objects, not strings — they're parsed on load and formatted on dump. TMX recommends the YYYYMMDDTHHMMSSZ format so that's what Hypomnema dumps to by default.

[!note] Hypomnema parses datetime via datetime.fromisoformat() and formats to YYYYMMDDTHHMMSSZ regardless of the original format. If you need a different output format, you'll need to override the dumpers for the relevant elements.

Inline content

Segment content isn't a string — it's a mixed list of str, the allowed type of content node for that element (so Sub or Bpt | Ept | It | Ph | Hi | Sub) and UnknownInlineNode. This preserves markup structure so you can query, transform, and round-trip it.

[!important] The structure of the TMX format means it can technically be infinitely recursive and Hypomnema is the same. Be careful when using recursion and infinite loops. For memory efficiency, Hypomnema uses a stack-based approach to recursion internally.

for item in variant.segment:
    match item:
        case str():
            ...
        case Bpt(spec_attributes=a):
            print(a.internal_id)
        case UnknownInlineNode(payload=bytes):
            # preserved for round-trip, never touched
            ...

Type safety

The entire codebase is mypy --strict clean. We use modern Python features throughout:

  • PEP 695 genericsXmlBackend[E] instead of XmlBackend(Generic[E]), XmlLoader[T] instead of TypeVar boilerplate
  • match/case — for namespace resolution (namespace.py), event dispatch in iterparse, and tag dispatch in loaders
  • Union types as X | Y — no Optional, no Union
  • dataclass(frozen=True, slots=True) — immutable, memory-efficient, hashable
  • Protocol-based backend contractXmlBackendLike[E] is a Protocol so you can mock it, proxy it, or implement it without inheriting. XmlBackend[E] is the shared ABC that both concrete backends inherit from.
  • NamedTuples for resolution resultsResolveResult(prefix, uri, localname, clark) is explicit and typed, not a 4-tuple

The generic E parameter threads the element type through the entire stack — XmlLoader[et.Element] vs XmlLoader[et._Element] produce the same domain nodes but are statically distinguished at the backend layer.

Xml Backend details

Namespace handling

All namespace logic lives in pure functions in namespace.py. No class, no state — just functions that take maps and return results:

resolve("ns:item", global_nsmap={...}, nsmap={...}) → ResolveResult
format_notation(result, "local", global_nsmap={...}) → str
register_namespace(nsmap, "ns", "http://...")  # mutates dict in-place

Resolution uses successive lookups: per-call nsmap first, then global_nsmap. No dict merging, no copies. The xml prefix is built-in and always maps to http://www.w3.org/XML/1998/namespace.

Clark notation ({uri}local) is the internal representation. Prefixed names and default-namespace names are resolved at the boundary. format_notation with "prefixed" raises MultiplePrefixesError if the URI maps to ambiguity in one map.

LxmlBackend and element.nsmap

Lxml elements expose element.nsmap — in-scope namespace declarations. Methods that resolve names (get_tag, get_attribute, etc.) merge this into a fresh dict alongside the caller's nsmap for the resolution call. The caller's dict is never mutated.

Streaming

iterparse yields elements whose closing tag is reached. Unmatched elements are cleared immediately to bound memory. iterwrite writes in batches with configurable buffer size, optional root wrapper, XML declaration, and doctype.

Parsing

Both parse() methods use single-pass iterparse with start and start-ns events. from_bytes and from_string wrap the input in a BytesIO/StringIO and delegate to parse(). No double-scanning.

What's not here yet

Language code validation. lang attributes are stored as strings. Proper BCP 47 validation would require an external dependency, which contradicts the zero-dep default. This will likely become an optional extra (hypomnema[langcode]) using langcodes or similar.

<map> and <ude> tags. These relate to custom character encodings and mapping tables. Supporting them properly would add significant complexity for negligible benefit since virtually no modern TMX file uses them. If broad demand emerges, they can be added.

<ut> tag. Will be modeled in a coming update.

TMX versions other than 1.4B. This is the de facto standard — every tool produces it, virtually no tool produces anything else. TMX 2.0 (which was only ever a Committee Draft) will be considered if it ever becomes real.

Development

The project uses uv for everything — dependency management, virtual environments, testing, linting, type checking, publishing.

uv run pytest                  # run tests
uv run ruff check src/ tests/  # lint
uv run mypy --strict src/      # type check

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hypomnema-0.8.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

hypomnema-0.8-py3-none-any.whl (48.4 kB view details)

Uploaded Python 3

File details

Details for the file hypomnema-0.8.tar.gz.

File metadata

  • Download URL: hypomnema-0.8.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hypomnema-0.8.tar.gz
Algorithm Hash digest
SHA256 1c857ff701d36860dd71830529f0dda077225361e5e8a9e47fd60557385afcfa
MD5 0cd75e593065792076de2ec68a37a573
BLAKE2b-256 9e4878eae0ffce622b02a1913f529ebbebea71462a6fb3e7c6e923e00bf9db83

See more details on using hashes here.

File details

Details for the file hypomnema-0.8-py3-none-any.whl.

File metadata

  • Download URL: hypomnema-0.8-py3-none-any.whl
  • Upload date:
  • Size: 48.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for hypomnema-0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f9257a866b2f788a3834cc33849f5ba74a9a68ceeacaf1ec0e438b2cbdcb13b1
MD5 bb4cc4e963a179d414098185e17fdbac
BLAKE2b-256 8a557126380b746fd2beddd7a976c870e5456a62e171c957901dbe96ef244242

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page