Python tool to extract sentences from po files and create language datasets for NLP machine learning
Project description
PO2Dataset
po2dataset is a python tool to extract sentences from po files and create language datasets for machine translation.
This command line tool is intended to create dataset packages suitable for Argos Train.
How to install
From pip
pip install po2dataset
Manual installation
Create a virtual environment using virtualenv
git clone https://github.com/urtzai/po2dataset.git
virtualenv po2dataset
cd po2dataset
source ./bin/activate
Quick start guide
Create Argos Train suitable dataset
po2dataset <path_to_po_file> --name <project_name> --source_code <source_lang_code> --target_code <target_lang_code> --ref "Some reference information of the project"
Where:
name
: The name of the projectsource_code
: Source language code (ISO 639)target_code
: Target language code (ISO 639)ref
: Some reference information of the project
Optional arguments:
format
: Extension name of the zip file (default argosdata)license
: License to add into the package (default CC0)
Support
Should you experience any issues do not hesistate to post an issue or contribute in this project pulling requests.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
po2dataset-0.3.0.tar.gz
(6.5 kB
view hashes)
Built Distribution
Close
Hashes for po2dataset-0.3.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9db3607abe30f7d8d950a41cc2b2d1fc7905e6771810453677f74b78bf017d68 |
|
MD5 | 04c35566920445bd2b4b0c956f3727e8 |
|
BLAKE2b-256 | 3644e0ca171f65ea4801cac7c062931b0f39bab81cae322a4331543243d0877d |