Skip to main content

Converts TMX files to CSV-files and/or stores to HANA table

Project description

TMX Converter

tmxconverter reads tmx-files from an input folder and saves the outcome either

  • as csv-files to an output folder or
  • stores them into a database table

The language code is mapped to the 2-character code based on the given file 'language_code_mapping.csv' (specified in 'config.yaml')

The application is using a yaml-configuration file config.yaml to control the behaviour read from the working directory.

Command line options

``tmxconverter -log [loglevel]``` with 'warning','info' and 'debug'

Mapping

  • <tmx><header srclang="en-US"> : source_lang
  • <body><tu creationdate : created
  • <body><tu creationid : creation_id
  • <body><tu changeid : change_id
  • <body><tu changedate : changed
  • <body><tu lastusagedate : lastusage
  • From filename substring until '_' : domain
  • Filename : origin
  • <body><tu><tuv xml:lang : target_lang if different from source_lang using the language mapping
  • <body><tu><tuv><seg>: source_text or target_text depending lang-attribute`

Regular Expression

As a first basic filter a list of regular expressions separated by a 'line separator' can be passed that are stored in a text-file.

Examples:

  • \s*$
  • \s*\d+\s*$
  • \s*\d*\.\d+\s*$

Files Output

If the parameter FILES_OUTPUT is true all tmx-files are written to the OUTPUT_FOLDER taking the same filename but replacing the suffix. The output is using a comma-separator and double quotes strings (pandas.to_csv used)

Database Output

If the parameter HDB_OUTPUT is True then the data is stored to the HANA Database for which the details are given in the config.yaml-file.

The current table structure:

CREATE COLUMN TABLE "TMX"."DATA"(
	"SOURCE_LANG" NVARCHAR(2),
	"SOURCE_TEXT" NVARCHAR(5000),
	"TARGET_LANG" NVARCHAR(2),
	"TARGET_TEXT" NVARCHAR(5000),
	"DOMAIN" NVARCHAR(15),
	"ORIGIN" NVARCHAR(30),
	"CREATION_ID" NVARCHAR(30),
	"CREATED" LONGDATE,
	"CHANGE_ID" NVARCHAR(30),
	"CHANGED" LONGDATE,
	"LAST_USAGE" LONGDATE,
	"USAGE_COUNT" INTEGER
)

Example Config.YAML

# input folder
input_folder : /Users/Shared/data/tmx/input

#language coding map
lang_map_file : language_code_mapping.csv

# output files
OUTPUT_FILES : true # save to output folder
OUTPUT_FOLDER : /Users/Shared/data/tmx/output

# HANA DB
OUTPUT_HDB : false  # Save to db
HDB_HOST : 'xxx.com'
HDB_USER : 'TMXUSER'
HDB_PWD : 'PassWord'
HDB_PORT : 111

# Test Parameter
TEST : true
MAX_NUMBER_FILES : 100  # max number of files processed. NOT used when EXCLUSIVE_FILE given
EXCLUSIVE_FILE : reviews.tmx  # If not used leave empty
#EXCLUSIVE_FILE :

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tmxconverter-0.0.6-py2-none-any.whl (7.0 kB view details)

Uploaded Python 2

File details

Details for the file tmxconverter-0.0.6-py2-none-any.whl.

File metadata

  • Download URL: tmxconverter-0.0.6-py2-none-any.whl
  • Upload date:
  • Size: 7.0 kB
  • Tags: Python 2
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.56.0 CPython/3.8.0

File hashes

Hashes for tmxconverter-0.0.6-py2-none-any.whl
Algorithm Hash digest
SHA256 6858f00c246842a1f94c440820aea022bfd9c010ebf56e736d9d835dd87a4f18
MD5 5e4a855f46f992d17585359b1fb18d90
BLAKE2b-256 e3f5a2a529beff193d61161e607afd0c2ce0803ab4ef89e43c7faf7306926430

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page