Skip to main content

OpenNMT tokenization library

Project description

CI PyPI version

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters "⦅" and "⦆".

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

pip install pyonmttok
>>> import pyonmttok
>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

See the Python API description for more details.

C++ API

#include <onmt/Tokenizer.h>

using namespace onmt;

int main() {
  Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);
  std::vector<std::string> tokens;
  tokenizer.tokenize("Hello World!", tokens);
}

See the Tokenizer class for more details.

Command line clients

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

git submodule update --init
mkdir build
cd build
cmake ..
make

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

mkdir build
cd build
cmake -DBUILD_TESTS=ON ..
make
test/onmt_tokenizer_test ../test/data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.22.2-cp39-cp39-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9

pyonmttok-1.22.2-cp38-cp38-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8

pyonmttok-1.22.2-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.22.2-cp36-cp36m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.22.2-cp35-cp35m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.22.2-cp27-cp27mu-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7mu

pyonmttok-1.22.2-cp27-cp27m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file pyonmttok-1.22.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.2-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyonmttok-1.22.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f62e10525c584c2330bc1771d38bc3e91b7ef4b13b1b897aa48d23f71cff5210
MD5 1ccd58f92787f8688634f331c13df0c1
BLAKE2b-256 d282f1c9260ea8402193372a1104912fc57cdf5ebae1cbef15d7b779f6bbe925

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyonmttok-1.22.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ee36b7291fa1ce90b062e05e72b3477da5d0d00f32307663c9b7bb0cb2a46694
MD5 9030e761fe5d650ea879e0a0cb22adec
BLAKE2b-256 d197646034e57cddb5680d777d7ea40f8cd459b3b82af801c64b085092d19e3d

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyonmttok-1.22.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 107e382100a7264ae12fdbfc7fc7e44a2a7133acfa1c143cb7484dd5993a1976
MD5 652130940c89eb60ba8bf88d144eeee9
BLAKE2b-256 84ae1255b25502b817840001e60a00843937df9586cb6cc5b4c7d6ba097f4aca

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyonmttok-1.22.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b918abccfa6eaa0ac873dcb01ed5b55440838766977a4f36fe4ebeebc2666abe
MD5 b52662e21889ddd0fd0a218284b59c44
BLAKE2b-256 10217a69fa68de7de41ef70b35424d21523ebf2208f0c0fab1355cabc2305ff4

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyonmttok-1.22.2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 dc109055ea74e10b8e7818d11b34607882670e22e10504bf6b3aa5bdf9b1582b
MD5 9088aa6dfbbb8cfb76a87a8bbeea9831
BLAKE2b-256 15eda71a87b09a86913df4142c2079ef1fe6cea37b87693511790efcc3a2ca12

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.2-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.2-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyonmttok-1.22.2-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9ea438d7394e4a910a13fe00bbdae442565e7d869882b10d0e8870773c029744
MD5 d9b4f7ed39aa551a5c23b2cea956235c
BLAKE2b-256 31de5a64970904e5b5b14b771a0e727f2458d8a9b1ba607a3700522d7e57509f

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.2-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.2-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for pyonmttok-1.22.2-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c258e9e578711d6052f961523dad2041376129719621014c5db2c07b867a02ea
MD5 26266330c5a4e4b00421f48a34d2c5a9
BLAKE2b-256 51d09ef741c57ebe3bd9c8c7077d687d306b18ef7a261b0d18c7645703832944

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page