Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

These details have not been verified by PyPI

Project links

Homepage

Project description

Subword Neural Machine Translation

This repository contains preprocessing scripts to segment text into subword units. The primary purpose is to facilitate the reproduction of our experiments on Neural Machine Translation with subword units (see below for reference).

INSTALLATION

install via pip (from PyPI):

pip install subword-nmt

install via pip (from Github):

pip install https://github.com/rsennrich/subword-nmt/archive/master.zip

alternatively, clone this repository; the scripts are executable stand-alone.

USAGE INSTRUCTIONS

Check the individual files for usage instructions.

To apply byte pair encoding to word segmentation, invoke these commands:

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

To segment rare words into character n-grams, do the following:

subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}
subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} > {out_file}

The original segmentation can be restored with a simple replacement:

sed -r 's/(@@ )|(@@ ?$)//g'

If you cloned the repository and did not install a package, you can also run the individual commands as scripts:

./subword_nmt/learn_bpe.py -s {num_operations} < {train_file} > {codes_file}

BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT

We found that for languages that share an alphabet, learning BPE on the concatenation of the (two or more) involved languages increases the consistency of segmentation, and reduces the problem of inserting/deleting characters when copying/transliterating names.

However, this introduces undesirable edge cases in that a word may be segmented in a way that has only been observed in the other language, and is thus unknown at test time. To prevent this, apply_bpe.py accepts a --vocabulary and a --vocabulary-threshold option so that the script will only produce symbols which also appear in the vocabulary (with at least some frequency).

To use this functionality, we recommend the following recipe (assuming L1 and L2 are the two languages):

Learn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each:

cat {train_file}.L1 {train_file}.L2 | subword-nmt learn-bpe -s {num_operations} -o {codes_file}
subword-nmt apply-bpe -c {codes_file} < {train_file}.L1 | subword-nmt get-vocab > {vocab_file}.L1
subword-nmt apply-bpe -c {codes_file} < {train_file}.L2 | subword-nmt get-vocab > {vocab_file}.L2

more conventiently, you can do the same with with this command:

subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2

re-apply byte pair encoding with vocabulary filter:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2

as a last step, extract the vocabulary to be used by the neural network. Example with Nematus:

nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2

[you may want to take the union of all vocabularies to support multilingual systems]

for test/dev data, re-use the same options for consistency:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1

ADVANCED FEATURES

On top of the basic BPE implementation, this repository supports:

BPE dropout (Provilkov, Emelianenko and Voita, 2019): https://arxiv.org/abs/1910.13267 use the argument --dropout 0.1 for subword-nmt apply-bpe to randomly drop out possible merges. Doing this on the training corpus can improve quality of the final system; at test time, use BPE without dropout. In order to obtain reproducible results, argument --seed can be used to set the random seed.

Note: In the original paper, the authors used BPE-Dropout on each new batch separately. You can copy the training corpus several times to get similar behavior to obtain multiple segmentations for the same sentence.
support for glossaries: use the argument --glossaries for subword-nmt apply-bpe to provide a list of words and/or regular expressions that should always be passed to the output without subword segmentation

PUBLICATIONS

The segmentation methods are described in:

Rico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

HOW IMPLEMENTATION DIFFERS FROM Sennrich et al. (2016)

This repository implements the subword segmentation as described in Sennrich et al. (2016), but since version 0.2, there is one core difference related to end-of-word tokens.

In Sennrich et al. (2016), the end-of-word token </w> is initially represented as a separate token, which can be merged with other subwords over time:

u n d </w>
f u n d </w>

Since 0.2, end-of-word tokens are initially concatenated with the word-final character:

u n d</w>
f u n d</w>

The new representation ensures that when BPE codes are learned from the above examples and then applied to new text, it is clear that a subword unit und is unambiguously word-final, and un is unambiguously word-internal, preventing the production of up to two different subword units from each BPE merge operation.

apply_bpe.py is backward-compatible and continues to accept old-style BPE files. New-style BPE files are identified by having the following first line: #version: 0.2

ACKNOWLEDGMENTS

This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).

CHANGELOG

v0.3.8:

multiprocessing support (get_vocab and apply_bpe)
progress bar for learn_bpe
seed parameter for deterministic BPE dropout
ignore some unicode line separators which would crash subword-nmt

v0.3.7:

BPE dropout (Provilkov et al., 2019)
more efficient glossaries (https://github.com/rsennrich/subword-nmt/pull/69)

v0.3.6:

fix to subword-bpe command encoding

v0.3.5:

fix to subword-bpe command under Python 2
wider support of --total-symbols argument

v0.3.4:

segment_tokens method to improve library usability (https://github.com/rsennrich/subword-nmt/pull/52)
support regex glossaries (https://github.com/rsennrich/subword-nmt/pull/56)
allow unicode separators (https://github.com/rsennrich/subword-nmt/pull/57)
new option --total-symbols in learn-bpe (commit 61ad8)
fix documentation (best practices) (https://github.com/rsennrich/subword-nmt/pull/60)

v0.3:

library is now installable via pip
fix occasional problems with UTF-8 whitespace and new lines in learn_bpe and apply_bpe.
- do not silently convert UTF-8 newline characters into "\n"
- do not silently convert UTF-8 whitespace characters into " "
- UTF-8 whitespace and newline characters are now considered part of a word, and segmented by BPE

v0.2:

different, more consistent handling of end-of-word token (commit a749a7) (https://github.com/rsennrich/subword-nmt/issues/19)
allow passing of vocabulary and frequency threshold to apply_bpe.py, preventing the production of OOV (or rare) subword units (commit a00db)
made learn_bpe.py deterministic (commit 4c54e)
various changes to make handling of UTF more consistent between Python versions
new command line arguments for apply_bpe.py:
- '--glossaries' to prevent given strings from being affected by BPE
- '--merges' to apply a subset of learned BPE operations
new command line arguments for learn_bpe.py:
- '--dict-input': rather than raw text file, interpret input as a frequency dictionary (as created by get_vocab.py).

v0.1:

consistent cross-version unicode handling
all scripts are now deterministic

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.3.8

Dec 8, 2021

0.3.7

Nov 25, 2019

0.3.6

Dec 11, 2018

0.3.5

Sep 17, 2018

0.3.4

Aug 17, 2018

0.3.3

May 21, 2018

0.3.2

May 17, 2018

0.3.1

May 17, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

subword_nmt-0.3.8.tar.gz (22.1 kB view details)

Uploaded Dec 8, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

subword_nmt-0.3.8-py3-none-any.whl (27.3 kB view details)

Uploaded Dec 8, 2021 Python 3

File details

Details for the file subword_nmt-0.3.8.tar.gz.

File metadata

Download URL: subword_nmt-0.3.8.tar.gz
Upload date: Dec 8, 2021
Size: 22.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for subword_nmt-0.3.8.tar.gz
Algorithm	Hash digest
SHA256	`3964c66b37712ca1d9fb9a1a6ff7e57c9ab72d838813da3e9a1d4d4997f4fb75`
MD5	`3acac581aa484334d66b68437b5c5410`
BLAKE2b-256	`c71abc10ed2b43788716c9b25ff066c92d6838444a7883f462abbc2e25b34c03`

See more details on using hashes here.

File details

Details for the file subword_nmt-0.3.8-py3-none-any.whl.

File metadata

Download URL: subword_nmt-0.3.8-py3-none-any.whl
Upload date: Dec 8, 2021
Size: 27.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.0

File hashes

Hashes for subword_nmt-0.3.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d22526b557752f35ac15e8ea384ea7773e50a51d966b8752d023d16cb87eac36`
MD5	`dabf557fd35873e795795067f0d4b348`
BLAKE2b-256	`1b9a488ecac22d78eb429928b9ee4f6b6c692e116ca4bd43ef42a475698def32`

See more details on using hashes here.

subword-nmt 0.3.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Subword Neural Machine Translation

INSTALLATION

USAGE INSTRUCTIONS

BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT

ADVANCED FEATURES

PUBLICATIONS

HOW IMPLEMENTATION DIFFERS FROM Sennrich et al. (2016)

ACKNOWLEDGMENTS

CHANGELOG

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes