Skip to main content

OpenNMT tokenization library

Project description

Build Status PyPI version

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters "⦅" and "⦆".

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

pip install pyonmttok
>>> import pyonmttok
>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

See the Python API description for more details.

C++ API

#include <onmt/Tokenizer.h>

using namespace onmt;

int main() {
  Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);
  std::vector<std::string> tokens;
  tokenizer.tokenize("Hello World!", tokens);
}

See the Tokenizer class for more details.

Command line clients

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

git submodule update --init
mkdir build
cd build
cmake ..
make

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

mkdir build
cd build
cmake -DBUILD_TESTS=ON ..
make
test/onmt_tokenizer_test ../test/data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.22.1-cp39-cp39-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9

pyonmttok-1.22.1-cp38-cp38-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8

pyonmttok-1.22.1-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.22.1-cp36-cp36m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.22.1-cp35-cp35m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.22.1-cp27-cp27mu-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7mu

pyonmttok-1.22.1-cp27-cp27m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file pyonmttok-1.22.1-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.1-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.1-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 040a823745797ec5488d2f2f95c2553f7674a905117d434f3290b76988040a04
MD5 20f4ba294a2529e5bf4ea071d83b05c6
BLAKE2b-256 cf285ee3a971ca73fa43be802d2a4a047e75e2c98bb9d42c3272e41a80e257e8

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 dd09aafc278e3ef8bb1b2cd13f3353335c1db3ebd1ecc0a17123650a04723380
MD5 4da4e67278c3e7b452add9beb04c5f39
BLAKE2b-256 23e8c21d16f85114015d322bb35eb34e950d32f960f44cba0c5d19d10132b6ca

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f603e44bc344d074b87fe4f3497d6c18f34f183f90ccc66ad09d4b9e4f21f098
MD5 161a6b8bc084e5149c121bdc0e778264
BLAKE2b-256 2a33e18c6b6b61b1939c4f49acac44de1436c281df996c54d9a0f84aaec0436e

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f2ec1b8a03c6fbe8fe89d01124740c89e8a2443206c0d86786d1e5f11a22f727
MD5 9aef6ccad8466c0106f8072750ea54b8
BLAKE2b-256 950aff7a3c18be093a6b11a4bc7935aec29f10fdd500322571e695e99abeaca5

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3130712588466d04d3a61095280cdc07c812deb2889f3f50d2cff10e1037b922
MD5 199f6ed7aa5eac2f9552de5c3ebbbf4c
BLAKE2b-256 f238acd06ad90d086475b8ab3926fc93ba647e8230c7341fedab3d27f897b050

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.1-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.1-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.1-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 56d114cc4548667c9944df3575bc1a060e9a160350630604a3456036d1a17f00
MD5 101313d55ee62402c808ee5343fd9c05
BLAKE2b-256 b00f9b90184386c9f7aaaccfdcc5427328a7363ee0dded74c446e1ca616fee5f

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.1-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.1-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.1-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 44dd4d40def07a8b4535d8475a4281a8d84b0e2f4121f91fda0f8baecf45c9d8
MD5 66e1dbace6b4d40acc73faf380a4e28a
BLAKE2b-256 e8be280df303e2288126e4041bc3a8b049a5cd48f6ddf3a58d601093930e5e1a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page