Skip to main content

OpenNMT tokenization library

Project description

Build Status PyPI version

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters "⦅" and "⦆".

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

pip install pyonmttok
>>> import pyonmttok
>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

See the Python API description for more details.

C++ API

#include <onmt/Tokenizer.h>

using namespace onmt;

int main() {
  Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);
  std::vector<std::string> tokens;
  tokenizer.tokenize("Hello World!", tokens);
}

See the Tokenizer class for more details.

Command line clients

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

git submodule update --init
mkdir build
cd build
cmake ..
make

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

mkdir build
cd build
cmake -DBUILD_TESTS=ON ..
make
test/onmt_tokenizer_test ../test/data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.20.0-cp39-cp39-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9

pyonmttok-1.20.0-cp38-cp38-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8

pyonmttok-1.20.0-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.20.0-cp36-cp36m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.20.0-cp35-cp35m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7mu

pyonmttok-1.20.0-cp27-cp27m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file pyonmttok-1.20.0-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.20.0-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.20.0-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7c9ae3b2a93d07071d83b746e5d335606f693008b4f0484a133b794469e6eb67
MD5 0a77a64ae4bee4fecf28ede517b7c207
BLAKE2b-256 2e006b792f3155475a579ab6db0b9265bd3ad27174f34b1ce2b52f5c2932b4fc

See more details on using hashes here.

File details

Details for the file pyonmttok-1.20.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.20.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.20.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3e7285cd16b311fe2ffdbc12b7bf5467e94faf8cf7fded166367d5b37099c81a
MD5 e63457eddb450a72a897f7020f9cbc55
BLAKE2b-256 7ec2355311343fdea4d7ebe3d1bfcb1e9564775a6d0dd4e7a81a5121c2f96503

See more details on using hashes here.

File details

Details for the file pyonmttok-1.20.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.20.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.20.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 01f46396de1cb1689b24264a2a60fe2f0525f62615e950682f31898685477b8d
MD5 81d7753cfe273bddcab17dad2f39a583
BLAKE2b-256 dc16287364ca2f8d135074980a312188fed58c8a753dcdc737ccb7715f2a1e52

See more details on using hashes here.

File details

Details for the file pyonmttok-1.20.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.20.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.20.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 697ef4cd93b3d633f5d521f99e89743210c7cb00ee7ea99a6a5bcd2e4bc65ef5
MD5 38d14d3dc3890d37898eb61168a9e521
BLAKE2b-256 f7828bf1e8bac7a90c71304d8e4e63d7eff5a558c52f069823b47978946f08e6

See more details on using hashes here.

File details

Details for the file pyonmttok-1.20.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.20.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.20.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c06f17994532521a79ce8b3a6915524b6621ab2d91797e3592f91360ae8a6303
MD5 8ef0503941f3168939009ee968a7d469
BLAKE2b-256 8bb6c82c56d214ee7bd902005d8c3e2696ed110f599cec56719c5c79c14e9873

See more details on using hashes here.

File details

Details for the file pyonmttok-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.20.0-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e6e2964e8b6d0f1f543ff1b8cdef52686037606fe5cd99581d41e1758c460a29
MD5 7d8880deb44dfac0b7083e605f4b8925
BLAKE2b-256 7614461ba58de06e64a2325967a327a2671865e722e7c49c742a8731aff2a1e5

See more details on using hashes here.

File details

Details for the file pyonmttok-1.20.0-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.20.0-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.20.0-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7fd6bdab14ddb7b8e7bfe1346a6e87ed122c20a951478e3408669ae73da97511
MD5 539c6dd7a3614675fb02b71b9d3ff1ff
BLAKE2b-256 8c7c1a6f67352abd41f9592aedaf526021a76daa762004d82832493a370b3f28

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page