Skip to main content

Converts TMX files to CSV-files and/or stores to HANA table

Project description

TMX Converter

tmxconverter reads tmx-files from an input folder and saves the outcome either

  • as csv-files to an output folder or
  • stores them into a database table

The language code is mapped to the 2-character code based on the given file 'language_code_mapping.csv' (specified in 'config.yaml')

The application is using a yaml-configuration file config.yaml to control the behaviour read from the working directory.

Command line options

``tmxconverter -log [loglevel]``` with 'warning','info' and 'debug'

Mapping

  • <tmx><header srclang="en-US"> : source_lang
  • <body><tu creationdate : created
  • <body><tu creationid : creation_id
  • <body><tu changeid : change_id
  • <body><tu changedate : changed
  • <body><tu lastusagedate : lastusage
  • From filename substring until '_' : domain
  • Filename : origin
  • <body><tu><tuv xml:lang : target_lang if different from source_lang using the language mapping
  • <body><tu><tuv><seg>: source_text or target_text depending lang-attribute`

Regular Expression

As a first basic filter a list of regular expressions separated by a 'line separator' can be passed that are stored in a text-file.

Examples:

  • \s*$
  • \s*\d+\s*$
  • \s*\d*\.\d+\s*$

Files Output

If the parameter FILES_OUTPUT is true all tmx-files are written to the OUTPUT_FOLDER taking the same filename but replacing the suffix. The output is using a comma-separator and double quotes strings (pandas.to_csv used)

Database Output

If the parameter HDB_OUTPUT is True then the data is stored to the HANA Database for which the details are given in the config.yaml-file.

The current table structure:

CREATE COLUMN TABLE "TMX"."DATA"(
	"SOURCE_LANG" NVARCHAR(2),
	"SOURCE_TEXT" NVARCHAR(5000),
	"TARGET_LANG" NVARCHAR(2),
	"TARGET_TEXT" NVARCHAR(5000),
	"DOMAIN" NVARCHAR(15),
	"ORIGIN" NVARCHAR(30),
	"CREATION_ID" NVARCHAR(30),
	"CREATED" LONGDATE,
	"CHANGE_ID" NVARCHAR(30),
	"CHANGED" LONGDATE,
	"LAST_USAGE" LONGDATE,
	"USAGE_COUNT" INTEGER
)

Example Config.YAML

# input folder
input_folder : /Users/Shared/data/tmx/input

#language coding map
lang_map_file : language_code_mapping.csv

# output files
OUTPUT_FILES : true # save to output folder
OUTPUT_FOLDER : /Users/Shared/data/tmx/output

# HANA DB
OUTPUT_HDB : false  # Save to db
HDB_HOST : 'xxx.com'
HDB_USER : 'TMXUSER'
HDB_PWD : 'PassWord'
HDB_PORT : 111

# Test Parameter
TEST : true
MAX_NUMBER_FILES : 100  # max number of files processed. NOT used when EXCLUSIVE_FILE given
EXCLUSIVE_FILE : reviews.tmx  # If not used leave empty
#EXCLUSIVE_FILE :

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

tmxconverter-0.0.6-py2-none-any.whl (7.0 kB view hashes)

Uploaded Python 2

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page