Python library for manipulating, creating and editing tmx files
Project description
Hypomnema
A type-safe, dependency-free Python library for TMX 1.4B.
Python 3.13+ | MIT License | mypy --strict clean
[!warning] Hypomnema is still in active development and considered alpha quality. The API is subject to change without warning and no backwards compatibility is guaranteed until 1.0.
Why this exists
TMX is 25 years old. Every CAT tool produces it, but the Python ecosystem has no library that gives you typed, validated, round-trippable TMX without forcing you to think about XML internals. Hypomnema fixes that.
It is an infrastructure library — the pandas of TMX, if you will. It gives you a typed domain model you can reason about, operations you can compose, and backends you can swap. It does not validate your semantics (if your "English" variant contains French, that's your problem), but it does ensure structural validity: every node enforces its invariants at construction time, and unknown elements are preserved as opaque payloads rather than silently dropped.
What it is not: a segmentation engine, a corpus manager, an alignment tool, or an MT pipeline. Those are things you build with Hypomnema, not things it does for you.
Philosophy
Dependency-free by default. The standard library is all you need. lxml is an opt-in extra for performance. The aim is for this to stay true forever — optional dependencies for things like language-code validation may appear, but pip install hypomnema will always give you a working library with no transitive deps.
Type safety is non-negotiable. Everything passes mypy --strict. We lean hard on modern Python — generic syntax (E instead of TypeVar), match/case, union types as X | Y, dataclass(frozen=True, slots=True). The floor is 3.13 and will probably become 3.14 before 1.0. There is no reason a new project should carry compatibility baggage for EOL Python releases.
Round-trip fidelity. Anything Hypomnema doesn't model explicitly is captured verbatim as UnknownNode / UnknownInlineNode payloads (raw bytes). Loading a TMX file and dumping it back produces structurally equivalent output regardless of which backend you pick.
Backend parity. StandardBackend and LxmlBackend are first-class citizens. They share all namespace logic through pure functions. Every test that touches a backend is parametrized over both. If one can do something the other can't, that's a bug.
Architecture
┌─────────────────────────────────────────────────────┐
│ OPERATIONS │
│ (ops/ — walk, text, transform, normalize) │
│ Pure functions on domain nodes. No I/O. │
└──────────────────────┬──────────────────────────────┘
│
domain nodes only
(no backend, no XML)
│
┌──────────────────────┴──────────────────────────────┐
│ DOMAIN │
│ (domain/ — nodes, attributes, value types) │
│ Immutable dataclasses. The IR. │
└──────────────────────┬──────────────────────────────┘
│
domain nodes in, domain nodes out
│
┌──────────────────────┴──────────────────────────────┐
│ LOADERS ←→ DUMPERS │
│ (format-specific entry points) │
│ │
│ XML ── XmlLoader → domain → XmlDumper ── XML │
│ JSON ─ JsonLoader → domain → JsonDumper ── JSON │
│ CSV ─ CsvLoader → domain → CsvDumper ── CSV │
└──────────────────────┬──────────────────────────────┘
│
backend elements (et.Element, lxml._Element, dict, …)
│
┌──────────────────────┴──────────────────────────────┐
│ BACKENDS │
│ (backends/xml/ — standard, lxml, namespace) │
│ XmlBackendLike[E] protocol, XmlBackend[E] ABC │
└─────────────────────────────────────────────────────┘
The domain layer is the intermediate representation. Loaders convert serialized content into domain nodes. Dumpers convert domain nodes back into serialized content. Backends handle the low-level serialization details. Operations run on domain nodes and never touch XML.
This separation means you can add a JSON backend (for FastAPI interop), a CSV backend (for spreadsheet workflows), or a Polars backend (for dataframe-scale analysis) without touching a single line of the domain model. The domain is format-agnostic — it's an IR.
Domain model
Every TMX element is a dataclass(slots=True) with two distinct attribute regions:
spec_attributes— a separate slotted dataclass holding the fields defined by the TMX 1.4B specification. Typed, validated at construction, IDE-autocompletable.extra_attributes— a plaindict[str, str]for vendor extensions, non-standard attributes, or anything not in the spec. Preserved through round-trips.
This split makes it immediately obvious what's spec and what's vendor contamination. No guessing.
Attribute renaming
TMX attribute names are mapped to readable Python. Hyphens aren't valid identifiers, and abbreviations hide intent:
| TMX attribute | Domain name | Notes |
|---|---|---|
creationdate |
created_at |
datetime — not a string |
changedate |
last_modified_at |
datetime |
creationid |
created_by |
It's a user, not an ID |
changeid |
last_modified_by |
Same |
creationtool |
creation_tool |
Readable |
creationtoolversion |
creation_tool_version |
Readable |
o-encoding |
original_encoding |
Hyphens → underscores |
o-tmf |
original_tm_format |
Abbreviation expanded |
srclang |
source_language |
Readable |
adminlang |
admin_language |
Readable |
segtype |
segmentation_type |
Readable |
datatype |
original_data_type |
It's "original data type" per spec |
tuid |
translation_unit_id |
Readable |
usagecount |
usage_count |
Underscore |
lastusagedate |
last_used_at |
datetime |
version |
version |
Kept as-is |
lang |
language |
It's a language code |
Datetime attributes (created_at, last_modified_at, last_used_at) are stored as datetime objects, not strings — they're parsed on load and formatted on dump. TMX recommends the YYYYMMDDTHHMMSSZ format so that's what Hypomnema dumps to by default.
[!note] Hypomnema parses datetime via datetime.fromisoformat() and formats to YYYYMMDDTHHMMSSZ regardless of the original format. If you need a different output format, you'll need to override the dumpers for the relevant elements.
Inline content
Segment content isn't a string — it's a mixed list of str, the allowed type of content node for that element (so Sub or Bpt | Ept | It | Ph | Hi | Sub) and UnknownInlineNode. This preserves markup structure so you can query, transform, and round-trip it.
[!important] The structure of the TMX format means it can technically be infinitely recursive and Hypomnema is the same. Be careful when using recursion and infinite loops. For memory efficiency, Hypomnema uses a stack-based approach to recursion internally.
for item in variant.segment:
match item:
case str():
...
case Bpt(spec_attributes=a):
print(a.internal_id)
case UnknownInlineNode(payload=bytes):
# preserved for round-trip, never touched
...
Type safety
The entire codebase is mypy --strict clean. We use modern Python features throughout:
- PEP 695 generics —
XmlBackend[E]instead ofXmlBackend(Generic[E]),XmlLoader[T]instead ofTypeVarboilerplate match/case— for namespace resolution (namespace.py), event dispatch in iterparse, and tag dispatch in loaders- Union types as
X | Y— noOptional, noUnion dataclass(frozen=True, slots=True)— immutable, memory-efficient, hashable- Protocol-based backend contract —
XmlBackendLike[E]is aProtocolso you can mock it, proxy it, or implement it without inheriting.XmlBackend[E]is the shared ABC that both concrete backends inherit from. - NamedTuples for resolution results —
ResolveResult(prefix, uri, localname, clark)is explicit and typed, not a 4-tuple
The generic E parameter threads the element type through the entire stack — XmlLoader[et.Element] vs XmlLoader[et._Element] produce the same domain nodes but are statically distinguished at the backend layer.
Xml Backend details
Namespace handling
All namespace logic lives in pure functions in namespace.py. No class, no state — just functions that take maps and return results:
resolve("ns:item", global_nsmap={...}, nsmap={...}) → ResolveResult
format_notation(result, "local", global_nsmap={...}) → str
register_namespace(nsmap, "ns", "http://...") # mutates dict in-place
Resolution uses successive lookups: per-call nsmap first, then global_nsmap. No dict merging, no copies. The xml prefix is built-in and always maps to http://www.w3.org/XML/1998/namespace.
Clark notation ({uri}local) is the internal representation. Prefixed names and default-namespace names are resolved at the boundary. format_notation with "prefixed" raises MultiplePrefixesError if the URI maps to ambiguity in one map.
LxmlBackend and element.nsmap
Lxml elements expose element.nsmap — in-scope namespace declarations. Methods that resolve names (get_tag, get_attribute, etc.) merge this into a fresh dict alongside the caller's nsmap for the resolution call. The caller's dict is never mutated.
Streaming
iterparse yields elements whose closing tag is reached. Unmatched elements are cleared immediately to bound memory. iterwrite writes in batches with configurable buffer size, optional root wrapper, XML declaration, and doctype.
Parsing
Both parse() methods use single-pass iterparse with start and start-ns events. from_bytes and from_string wrap the input in a BytesIO/StringIO and delegate to parse(). No double-scanning.
What's not here yet
Language code validation. lang attributes are stored as strings. Proper BCP 47 validation would require an external dependency, which contradicts the zero-dep default. This will likely become an optional extra (hypomnema[langcode]) using langcodes or similar.
<map> and <ude> tags. These relate to custom character encodings and mapping tables. Supporting them properly would add significant complexity for negligible benefit since virtually no modern TMX file uses them. If broad demand emerges, they can be added.
<ut> tag. Will be modeled in a coming update.
TMX versions other than 1.4B. This is the de facto standard — every tool produces it, virtually no tool produces anything else. TMX 2.0 (which was only ever a Committee Draft) will be considered if it ever becomes real.
Development
The project uses uv for everything — dependency management, virtual environments, testing, linting, type checking, publishing.
uv run pytest # run tests
uv run ruff check src/ tests/ # lint
uv run mypy --strict src/ # type check
License
MIT. See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hypomnema-0.8.tar.gz.
File metadata
- Download URL: hypomnema-0.8.tar.gz
- Upload date:
- Size: 44.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1c857ff701d36860dd71830529f0dda077225361e5e8a9e47fd60557385afcfa
|
|
| MD5 |
0cd75e593065792076de2ec68a37a573
|
|
| BLAKE2b-256 |
9e4878eae0ffce622b02a1913f529ebbebea71462a6fb3e7c6e923e00bf9db83
|
File details
Details for the file hypomnema-0.8-py3-none-any.whl.
File metadata
- Download URL: hypomnema-0.8-py3-none-any.whl
- Upload date:
- Size: 48.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.5 {"installer":{"name":"uv","version":"0.11.5","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f9257a866b2f788a3834cc33849f5ba74a9a68ceeacaf1ec0e438b2cbdcb13b1
|
|
| MD5 |
bb4cc4e963a179d414098185e17fdbac
|
|
| BLAKE2b-256 |
8a557126380b746fd2beddd7a976c870e5456a62e171c957901dbe96ef244242
|