Python library for manipulating, creating and editing tmx files
Project description
Hypomnema
Industrial-grade TMX 1.4b parsing and serialization for Python.
Hypomnema is a strictly typed infrastructure library for working with TMX 1.4b (Translation Memory eXchange) files. It is designed as a foundation for building localization tools, CAT software, and NLP pipelines, focusing on correctness, type safety, and memory efficiency when handling large datasets.
Warning
This project is currently in Alpha. It is a work in progress and should not be used for full production workflows until the 1.0 version is released. API changes may occur.
Why Hypomnema?
While other TMX libraries exist, Hypomnema is built with modern Python engineering standards to address common pain points:
- Strict Type Safety: Every TMX element is modeled as a typed Python dataclass. This ensures your code is robust, autocompletion works perfectly, and you catch errors at static analysis time rather than runtime.
- Policy-Driven Error Handling: Real-world TMX files are often messy. Instead of crashing on a single malformed date or missing attribute, Hypomnema uses a granular Policy System. You define exactly how to handle specific errors (raise, ignore, use default, or keep raw value) without compromising the integrity of the rest of the file.
- Full TMX 1.4b Level 2 Compliance: Supports arbitrary inline element nesting depth and complete attribute modeling.
- Memory Efficient: Supports streaming processing for large TMX files.
- Backend Agnostic: Works with standard
xmlorlxml(for performance).
Installation
Install using uv (recommended):
uv add hypomnema
Or using pip:
pip install hypomnema
For maximum performance with large files (enables lxml backend):
uv add "hypomnema[lxml]"
# or
pip install "hypomnema[lxml]"
Quick Start
import hypomnema as hm
# Load a TMX file
tmx = hm.load("translations.tmx")
# Inspect the content
print(f"Source language: {tmx.header.srclang}")
print(f"Translation units: {len(tmx.body)}")
# Find a specific translation unit
for tu in tmx.body:
for tuv in tu.variants:
if tuv.lang == "fr":
print(f"French: {tuv.content}")
# Save changes
hm.dump(tmx, "output.tmx")
Advanced Usage
Streaming Large Files
For large translation memories, use the streaming API to process units one by one without loading the whole file into RAM:
import hypomnema as hm
# Stream translation units ('tu') only
for tu in hm.load("massive_memory.tmx", filter="tu"):
print(f"Processing TU: {tu.tuid}")
# Process units here...
Creating and Saving TMX Files
You can programmatically create TMX files using the helper factory functions:
import hypomnema as hm
from hypomnema import helpers
# 1. Create a Header
header = helpers.create_header(
creationtool="hypomnema",
segtype="sentence",
srclang="en-US",
adminlang="en-US"
)
# 2. Create a Translation Unit (TU) with variants
# TUVs can contain plain text or mixed content with inline tags
tuv_en = helpers.create_tuv("en-US", content="Hello world")
tuv_fr = helpers.create_tuv("fr-FR", content=["Bonjour ", helpers.create_ph(x=1, type="lb"), "le monde"])
tu = helpers.create_tu(
tuid="1",
srclang="en-US",
variants=[tuv_en, tuv_fr]
)
# 3. Create the TMX object
tmx = helpers.create_tmx(header=header, body=[tu])
# 4. Save to disk
hm.dump(tmx, "output.tmx")
Policy Configuration
Real-world TMX files are often imperfect. Policies let you configure how Hypomnema handles validation errors:
import logging
import hypomnema as hm
from hypomnema.xml.policy import Behavior, XmlDeserializationPolicy
policy = XmlDeserializationPolicy(
missing_seg=Behavior("ignore", logging.WARNING),
extra_text=Behavior("ignore", logging.INFO),
)
tmx = hm.load("messy.tmx", deserializer_policy=policy)
Available Policy Keys
Deserialization:
invalid_child_tag: Action for unexpected child elements.missing_text_content: Action for elements missing required text.invalid_tag: Action for unexpected element tags.extra_text: Action for unexpected text content.required_attribute_missing: Action for missing required attributes.multiple_seg: Action for multiple elements in .multiple_headers: Action for multiple elements.invalid_datetime_value: Action for unparsable datetime values.invalid_enum_value: Action for invalid enum values.invalid_int_value: Action for unparsable integer values.missing_deserialization_handler: Action for missing element handlers.missing_seg: Action for elements without .multiple_body: Action for multiple elements.missing_header: Action for elements without .missing_body: Action for elements without .
Serialization:
invalid_element_type: Action for unexpected object types.missing_text_content: Action for objects missing required text.required_attribute_missing: Action for missing required attributes.invalid_child_element: Action for invalid child element types.invalid_attribute_type: Action for attributes with wrong types.missing_serialization_handler: Action for missing element handlers.
Namespace:
existing_namespace: Action when registering an already-existing prefix.inexistent_namespace: Action when resolving an unregistered prefix.
Text Extraction
Extract plain text content from elements, skipping inline markup:
from hypomnema import helpers, Bpt
tuv = helpers.create_tuv(
"en",
content=[
"Hello ",
helpers.create_bpt(i=1, content="Bpt text"),
"World",
helpers.create_ept(i=1, content="Ept text")
],
)
# Quick access via text helper
print(helpers.text(tuv)) # "Hello World"
# Iterate over text segments
for text in helpers.iter_text(tuv):
print(text) # "Hello " then "Bpt text" then "World" then "Ept text"
# Ignore specific element types
for text in helpers.iter_text(tuv, ignore=Bpt):
print(text) # "Hello " then "World" then "Ept text"
TMX 1.4b Level 2 Compliance
Hypomnema is the only Python library that fully implements the TMX 1.4b Level 2 specification:
- Arbitrary Nesting Depth: No limits on inline element nesting.
<bpt>/<ept>pairs,<ph>placeholders, and<sub>elements can nest to any depth. - Complete Inline Element Support: All six inline markup elements (
<bpt>,<ept>,<it>,<ph>,<hi>,<sub>) with proper mixed content handling. - Full Attribute Modeling: Every TMX attribute is typed, including enumerations for
segtype,pos, andassoc. - Metadata Preservation: Properties and notes supported at all valid nesting levels.
Development
To contribute or run tests locally:
- Clone the repository.
- Install dependencies using
uv:uv sync - Run the test suite:
uv run pytest
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hypomnema-0.7.tar.gz.
File metadata
- Download URL: hypomnema-0.7.tar.gz
- Upload date:
- Size: 48.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a45eb0259ee8c4c4f9745999bcbdfbcfd8cbf8d9a740177b1ab6dcbf4fb7053f
|
|
| MD5 |
d4246c292758e73e76702d38c6578963
|
|
| BLAKE2b-256 |
256d1d78a67edfc68f7ddec2a052e0875a564d10728bd97a2ffea88172be05d1
|
File details
Details for the file hypomnema-0.7-py3-none-any.whl.
File metadata
- Download URL: hypomnema-0.7-py3-none-any.whl
- Upload date:
- Size: 55.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.10.6 {"installer":{"name":"uv","version":"0.10.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
90d8d770132c7be28dc2fa235dfb7c7124a20e59831dc8485940131117a85ea7
|
|
| MD5 |
91e76e48a14c826e854e1ff62a635178
|
|
| BLAKE2b-256 |
3e90e453885ddc5febaa5904c3d4a7568a114fcf91019bc268afc7115a07bcf4
|