Skip to main content

C++ implementation of Neural Machine Translation of Rare Words with Subword Units, with Python API.

Project description


C++ implementation of Neural Machine Translation of Rare Words with Subword Units, with Python API.


Compile with:

g++ -std=c++11 -pthread -O3 fastBPE/ -IfastBPE -o fast


List commands

usage: fastbpe <command> <args>

The commands supported by fastBPE are:

getvocab input1 [input2]             extract the vocabulary from one or two text files
learnbpe nCodes input1 [input2]      learn BPE codes from one or two text files
applybpe output input codes [vocab]  apply BPE codes to a text file
applybpe_stream codes [vocab]        apply BPE codes to stdin and outputs to stdout

fastBPE also supports stdin inputs. For instance, these two commands are equivalent:

./fast getvocab text > vocab
cat text | ./fast getvocab - > vocab

But the first one will memory map the input file to read it efficiently, which can be more than twice faster than stdin on very large files. Similarly, these two commands are equivalent:

./fast applybpe output input codes vocab
cat input | ./fast applybpe_stream codes vocab > output

Although the first one will be significantly faster on large datasets, as it uses multi-threading to pre-compute the BPE splits of all words in the input file.

Learn codes

./fast learnbpe 40000 train.en > codes

Apply codes to train

./fast applybpe codes
./fast applybpe train.en.40000 train.en codes

Get train vocabulary

./fast getvocab >
./fast getvocab train.en.40000 > vocab.en.40000

Apply codes to valid and test

./fast applybpe codes
./fast applybpe valid.en.40000 valid.en codes vocab.en.40000
./fast applybpe  codes
./fast applybpe test.en.40000  test.en  codes vocab.en.40000

Python API

To install the Python API, simply run:

python install

Note: For Mac OSX Users, add export MACOSX_DEPLOYMENT_TARGET=10.x (x=9 or 10, depending on your version) or -stdlib=libc++ to the extra_compile_args of before/during the above install command, as appropriate.

Call the API using:

import fastBPE

bpe = fastBPE.fastBPE(codes_path, vocab_path)
bpe.apply(["Roasted barramundi fish", "Centrally managed over a client-server architecture"])

>> ['Ro@@ asted barr@@ am@@ un@@ di fish', 'Centr@@ ally managed over a cli@@ ent-@@ server architecture']

Project details

Release history Release notifications

This version


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for fastBPE, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size fastBPE-0.1.0.tar.gz (35.1 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page