Skip to main content

Tool to create, augment, filter and generally work with TMX files.

Project description

tmxutil

tmxutil.py allows you to add domain groups to your Europat tmx files, or filter on them.

Installation & Requirements

To install tmxutil.py, just download it from Github and place it somewhere where you can reach it from the command line. Besides Python 3.5 or newer, it has no external dependencies.

Examples

Example tmx file: DE-EN-2001-Abstract.tmx.gz, ipc domain group file: ipc-groups.tab

The provided IPC grouping has the following high-level categories:

Group Description
I General / Default
II Computing, Science and Tech (Science, photography, optics, cryptography, communications)
III Biotechnology and Chemical (food, biotech, nanotech, chemistry)
IV Engineering and Manufacturing (Engines, nuclear physica, agriculture, forestry, aviation)
V Daily life (Household, music, arts, clothing, jewlery, sports and decorating)

Filtering by IPC code: Filter out only sentence pairs that come from patents with a certain IPC codes.

gzip -cd DE-EN-2001-Abstract.tmx.gz \
| ./tmxutil.py -o tmx --with-ipc D06M15/59 D06P005/02 \
> selection.tmx

Export selection as tab-separated sentence pairs: By changing the output format of tmxutil you can get the sentence pairs as plain text separated by tabs.

This option can be combined with data augmentation and filter options, although only the first source document per sentence pair is exported. You'll also have to tell it in what order you want the languages to be exported.

gzip -cd DE-EN-2001-Abstract.tmx.gz \
| ./tmxutil.py \
    -o tab \
    --output-languages en de \
    --with-ipc D06M15/59 \
> selection-en-de.tsv

Adding ipc groups to tmx file: To be able to make more coarse-grained selections you can add ipc groups (c.f. domains) to the sentence pairs, based on the IPC codes already in the tmx file. You can then use those ipc groups to make a selection using --with-ipc-group, which works just like --with-ipc.

The ipc-groups.tab file used here should have a IPC code prefix and a group name on each line, separated by a tab, as the first two columns. You can get the ipc-groups.tab file from the project's releases page.

gzip -cd DE-EN-2001-Abstract.tmx.gz \
| ./tmxutil.py \
	-o tmx \
	--ipc-group ipc-groups.tab \
| gzip > DE-EN-2001-Abstract-with-groups.tmx.gz

Only the tmx output format will maintain the ipc-group metadata by adding ipc-group properties. Other output formats won't maintain it, but you can still use --with-ipc-group directly to make a selection.

Converting tsv to tmx: tmxutil can also be used to generate tmx files from sentence pairs. The input format is the same as the tab output format, that is source1 \t source2 \t sentence1 \t sentence2.

To also add the IPC codes from metadata, use the --ipc option. The format of this file should be l1_id \t _ \t _ \t _ \t l1_lang \t l1_ipcs \t l2_id \t _ \t _ \t _ \t l2_lang \t l2_ipcs where id is the document identifier, and l1_ipc is a comma-separated list of all ipc codes for this document.

cat DE-EN-2001-Abstract-aligned.tsv \
| ./tab2tmx.py \
    -o tmx \
    -l de en \
    -d \
    --ipc DE-EN-2001-Metadata.tab \
| gzip -9c > DE-EN-2001-Abstract.tmx.gz

Parameters

  • -i tmx|tab, --input-format tmx|tab input format, if not given will be auto-detected. Possible values: tmx, tab.
    • In case of tab you'll have to specify which languages are in there using --languages l1 l2.
  • -o tmx|tab|txt, --output-format tmx|tab|txt output format, either tmx, tab or txt.
    • In case of tab you'll have to specify the languages, e.g. --output-languages l1 l2.
    • When using txt, you'll have to select which language you want the plain text for, i.e. --output-languages en.
  • -l L1 L2, --input-languages L1 L2. Languages & order of them in the input file. Only necessary when reading tab files.
  • --ouput-languages L1 [L2] language or order of languages in the output file. Not used if tmx is the output.
  • -d, --deduplicate groups sentence pairs with the same text or hash together.
  • --drop PROP [PROP ...] drop properties from the sentence pairs while writing output.
  • --renumber-output causes all translation unit ids to be reset. Enabled by default when multiple input files are given.
  • --ipc FILE adds IPC metadata to each sentence pair.
  • --with PROP=VALUE [PROP=VALUE ...] filters sentence pairs on their text or properties. Supported operators are =, >, <, >=, <= and =~ for regular expressions. Use multiple PROP=VALUE pairs in a --with option to combine the conditions (i.e. AND). Or use multiple --with options for separate conditions (i.e. OR).
  • --without PROP=VALUE [PROP=VALUE ...] same as --with, but negated, for excluding instead of including sentence pairs.
  • --verbose enabled progress updates.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tmxutil-1.2.tar.gz (14.1 kB view details)

Uploaded Source

Built Distribution

tmxutil-1.2-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file tmxutil-1.2.tar.gz.

File metadata

  • Download URL: tmxutil-1.2.tar.gz
  • Upload date:
  • Size: 14.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for tmxutil-1.2.tar.gz
Algorithm Hash digest
SHA256 6050631352edac2c1750fd7b3181e68c4f16c816ffbf9044462ab90a8a2f289f
MD5 8037570e89685e963af55a197f589477
BLAKE2b-256 96f99e21289c547ae973f6dff858d594fc3914498be45a72138e63aa1856e87a

See more details on using hashes here.

File details

Details for the file tmxutil-1.2-py3-none-any.whl.

File metadata

  • Download URL: tmxutil-1.2-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for tmxutil-1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 c2e0d205cd6d5e9cec4b5cc7b8e4e7995431ee7a0cb4363f93d1edab192a2eb6
MD5 2e2bbfd0b4fd6ef334b65ef606838eb8
BLAKE2b-256 3f02e4bb48025b87b75c3cd8bbf5b7314c3ab9dfef1bf011055b180ddd7e0b94

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page