Skip to main content

Unsupervised Word Segmentation for Neural Machine Translation and Text Generation

Project description

Subword Neural Machine Translation

This repository contains preprocessing scripts to segment text into subword units. The primary purpose is to facilitate the reproduction of our experiments on Neural Machine Translation with subword units (see below for reference).

INSTALLATION

install via pip (from PyPI):

pip install subword-nmt

install via pip (from Github):

pip install https://github.com/rsennrich/subword-nmt/archive/master.zip

alternatively, clone this repository; the scripts are executable stand-alone.

USAGE INSTRUCTIONS

Check the individual files for usage instructions.

To apply byte pair encoding to word segmentation, invoke these commands:

subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}
subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}

To segment rare words into character n-grams, do the following:

subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}
subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} > {out_file}

The original segmentation can be restored with a simple replacement:

sed -r 's/(@@ )|(@@ ?$)//g'

If you cloned the repository and did not install a package, you can also run the individual commands as scripts:

./subword_nmt/learn_bpe.py -s {num_operations} < {train_file} > {codes_file}

BEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT

We found that for languages that share an alphabet, learning BPE on the concatenation of the (two or more) involved languages increases the consistency of segmentation, and reduces the problem of inserting/deleting characters when copying/transliterating names.

However, this introduces undesirable edge cases in that a word may be segmented in a way that has only been observed in the other language, and is thus unknown at test time. To prevent this, apply_bpe.py accepts a --vocabulary and a --vocabulary-threshold option so that the script will only produce symbols which also appear in the vocabulary (with at least some frequency).

To use this functionality, we recommend the following recipe (assuming L1 and L2 are the two languages):

Learn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each:

cat {train_file}.L1 {train_file}.L2 | subword-nmt learn-bpe -s {num_operations} -o {codes_file}
subword-nmt apply-bpe -c {codes_file} < {train_file}.L1 && subword-nmt get-vocab --train_file {codes_file} --vocab_file {vocab_file}.L1
subword-nmt apply-bpe -c {codes_file} < {train_file}.L2 && subword-nmt get-vocab --train_file {codes_file} --vocab_file {vocab_file}.L2

more conventiently, you can do the same with with this command:

subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2

re-apply byte pair encoding with vocabulary filter:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1
subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2

as a last step, extract the vocabulary to be used by the neural network. Example with Nematus:

nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2

[you may want to take the union of all vocabularies to support multilingual systems]

for test/dev data, re-use the same options for consistency:

subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1

PUBLICATIONS

The segmentation methods are described in:

Rico Sennrich, Barry Haddow and Alexandra Birch (2016): Neural Machine Translation of Rare Words with Subword Units Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

ACKNOWLEDGMENTS

This project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union’s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).

CHANGELOG

v0.3:

  • library is now installable via pip
  • fix occasional problems with UTF-8 whitespace and new lines in learn_bpe and apply_bpe.
    • do not silently convert UTF-8 newline characters into "\n"
    • do not silently convert UTF-8 whitespace characters into " "
    • UTF-8 whitespace and newline characters are now considered part of a word, and segmented by BPE

v0.2:

  • different, more consistent handling of end-of-word token (commit a749a7) (https://github.com/rsennrich/subword-nmt/issues/19)
  • allow passing of vocabulary and frequency threshold to apply_bpe.py, preventing the production of OOV (or rare) subword units (commit a00db)
  • made learn_bpe.py deterministic (commit 4c54e)
  • various changes to make handling of UTF more consistent between Python versions
  • new command line arguments for apply_bpe.py:
    • '--glossaries' to prevent given strings from being affected by BPE
    • '--merges' to apply a subset of learned BPE operations
  • new command line arguments for learn_bpe.py:
    • '--dict-input': rather than raw text file, interpret input as a frequency dictionary (as created by get_vocab.py).

v0.1:

  • consistent cross-version unicode handling
  • all scripts are now deterministic

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

subword_nmt-0.3.3-py3.5.egg (48.8 kB view details)

Uploaded Source

subword_nmt-0.3.3-py2.py3-none-any.whl (25.0 kB view details)

Uploaded Python 2 Python 3

subword_nmt-0.3.3-py2.7.egg (53.7 kB view details)

Uploaded Source

File details

Details for the file subword_nmt-0.3.3-py3.5.egg.

File metadata

File hashes

Hashes for subword_nmt-0.3.3-py3.5.egg
Algorithm Hash digest
SHA256 ea295d949c7176579de325d63dfc691143fdf893765a0bf1cd0dbe4628f2ce91
MD5 c2dc776f31643280d02b658deb8abe98
BLAKE2b-256 e1759c860bd42403d9ce6e6a0bbe46e6feeb58834b55725682e849860ecf9cb9

See more details on using hashes here.

File details

Details for the file subword_nmt-0.3.3-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for subword_nmt-0.3.3-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 7df6b6672114f72fc3b8d87d80fd21cc458871295733c444c72278cfd55c205a
MD5 a8bfc6e7c89705e44ef7fbc5da273c7f
BLAKE2b-256 7fbd2bd51c30a05048d6c4e847d48386380c4554cf9c634d426c33c4425469d8

See more details on using hashes here.

File details

Details for the file subword_nmt-0.3.3-py2.7.egg.

File metadata

File hashes

Hashes for subword_nmt-0.3.3-py2.7.egg
Algorithm Hash digest
SHA256 908a61581f22c6e695b50f82ca2d28a74cda23f94f7833c2b5505d0fd351c277
MD5 dc4429b9980da2659391153bf754d00f
BLAKE2b-256 f811911079d1f5295ad72c74d389cba425f12c9c8e55217b631fae81439b9696

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page