Tool to create, augment, filter and generally work with TMX files.

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

tmxutil

tmxutil.py allows you to add domain groups to your Europat tmx files, or filter on them.

Installation & Requirements

To install tmxutil.py, just download it from Github and place it somewhere where you can reach it from the command line. Besides Python 3.5 or newer, it has no external dependencies.

Examples

Example tmx file: DE-EN-2001-Abstract.tmx.gz, ipc domain group file: ipc-groups.tab

The provided IPC grouping has the following high-level categories:

Group	Description
I	General / Default
II	Computing, Science and Tech (Science, photography, optics, cryptography, communications)
III	Biotechnology and Chemical (food, biotech, nanotech, chemistry)
IV	Engineering and Manufacturing (Engines, nuclear physica, agriculture, forestry, aviation)
V	Daily life (Household, music, arts, clothing, jewlery, sports and decorating)

Filtering by IPC code: Filter out only sentence pairs that come from patents with a certain IPC codes.

gzip -cd DE-EN-2001-Abstract.tmx.gz \
| ./tmxutil.py -o tmx --with-ipc D06M15/59 D06P005/02 \
> selection.tmx

Export selection as tab-separated sentence pairs: By changing the output format of tmxutil you can get the sentence pairs as plain text separated by tabs.

This option can be combined with data augmentation and filter options, although only the first source document per sentence pair is exported. You'll also have to tell it in what order you want the languages to be exported.

gzip -cd DE-EN-2001-Abstract.tmx.gz \
| ./tmxutil.py \
    -o tab \
    --output-languages en de \
    --with-ipc D06M15/59 \
> selection-en-de.tsv

Adding ipc groups to tmx file: To be able to make more coarse-grained selections you can add ipc groups (c.f. domains) to the sentence pairs, based on the IPC codes already in the tmx file. You can then use those ipc groups to make a selection using --with-ipc-group, which works just like --with-ipc.

The ipc-groups.tab file used here should have a IPC code prefix and a group name on each line, separated by a tab, as the first two columns. You can get the ipc-groups.tab file from the project's releases page.

gzip -cd DE-EN-2001-Abstract.tmx.gz \
| ./tmxutil.py \
	-o tmx \
	--ipc-group ipc-groups.tab \
| gzip > DE-EN-2001-Abstract-with-groups.tmx.gz

Only the tmx output format will maintain the ipc-group metadata by adding ipc-group properties. Other output formats won't maintain it, but you can still use --with-ipc-group directly to make a selection.

Converting tsv to tmx: tmxutil can also be used to generate tmx files from sentence pairs. The input format is the same as the tab output format, that is source1 \t source2 \t sentence1 \t sentence2.

To also add the IPC codes from metadata, use the --ipc option. The format of this file should be l1_id \t _ \t _ \t _ \t l1_lang \t l1_ipcs \t l2_id \t _ \t _ \t _ \t l2_lang \t l2_ipcs where id is the document identifier, and l1_ipc is a comma-separated list of all ipc codes for this document.

cat DE-EN-2001-Abstract-aligned.tsv \
| ./tab2tmx.py \
    -o tmx \
    -l de en \
    -d \
    --ipc DE-EN-2001-Metadata.tab \
| gzip -9c > DE-EN-2001-Abstract.tmx.gz

Parameters

-i tmx|tab, --input-format tmx|tab input format, if not given will be auto-detected. Possible values: tmx, tab.
- In case of tab you'll have to specify which languages are in there using --languages l1 l2.
-o tmx|tab|txt, --output-format tmx|tab|txt output format, either tmx, tab or txt.
- In case of tab you'll have to specify the languages, e.g. --output-languages l1 l2.
- When using txt, you'll have to select which language you want the plain text for, i.e. --output-languages en.
-l L1 L2, --input-languages L1 L2. Languages & order of them in the input file. Only necessary when reading tab files.
--ouput-languages L1 [L2] language or order of languages in the output file. Not used if tmx is the output.
-d, --deduplicate groups sentence pairs with the same text or hash together.
--drop PROP [PROP ...] drop properties from the sentence pairs while writing output.
--renumber-output causes all translation unit ids to be reset. Enabled by default when multiple input files are given.
--ipc FILE adds IPC metadata to each sentence pair.
--with PROP=VALUE [PROP=VALUE ...] filters sentence pairs on their text or properties. Supported operators are =, >, <, >=, <= and =~ for regular expressions. Use multiple PROP=VALUE pairs in a --with option to combine the conditions (i.e. AND). Or use multiple --with options for separate conditions (i.e. OR).
--without PROP=VALUE [PROP=VALUE ...] same as --with, but negated, for excluding instead of including sentence pairs.
--verbose enabled progress updates.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.3

Dec 6, 2022

1.2

Dec 6, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tmxutil-1.3.tar.gz (20.7 kB view details)

Uploaded Dec 6, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tmxutil-1.3-py3-none-any.whl (23.0 kB view details)

Uploaded Dec 6, 2022 Python 3

File details

Details for the file tmxutil-1.3.tar.gz.

File metadata

Download URL: tmxutil-1.3.tar.gz
Upload date: Dec 6, 2022
Size: 20.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for tmxutil-1.3.tar.gz
Algorithm	Hash digest
SHA256	`7282ed5582f31feaa7103cd1b6ba064b7b871fd03a041b9a9fff10d4ad9fbc12`
MD5	`c48a19f0bf67212f92609149057b9737`
BLAKE2b-256	`2594923fd751cc5226ab769688cc21a614d990cb917121c7e74379a82d24a472`

See more details on using hashes here.

File details

Details for the file tmxutil-1.3-py3-none-any.whl.

File metadata

Download URL: tmxutil-1.3-py3-none-any.whl
Upload date: Dec 6, 2022
Size: 23.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for tmxutil-1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cec4d365fc42f9684f90b2ed4d799898fac2bf0ff0104b3d867de98784c7c720`
MD5	`ecf8ab5dffd36ab97721d530ee92bfc7`
BLAKE2b-256	`2a0d8c8f7df0921bdd73f77100f86d68fdd615f04cb2dd4a1ab8474dc8413389`

See more details on using hashes here.

tmxutil 1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

tmxutil

Installation & Requirements

Examples

Parameters

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes