Skip to main content

OpenNMT tokenization library

Project description

Build Status PyPI version

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters "⦅" and "⦆".

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

pip install pyonmttok
>>> import pyonmttok
>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

See the Python API description for more details.

C++ API

#include <onmt/Tokenizer.h>

using namespace onmt;

int main() {
  Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);
  std::vector<std::string> tokens;
  tokenizer.tokenize("Hello World!", tokens);
}

See the Tokenizer class for more details.

Command line clients

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=<Release or Debug> ..
make

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.
  • To compile with the ICU unicode backend, use the -DWITH_ICU=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

test/onmt_tokenizer_test ../test/data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.18.5-cp39-cp39-manylinux1_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.9

pyonmttok-1.18.5-cp38-cp38-manylinux1_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.8

pyonmttok-1.18.5-cp37-cp37m-manylinux1_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.18.5-cp36-cp36m-manylinux1_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.18.5-cp35-cp35m-manylinux1_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.18.5-cp27-cp27mu-manylinux1_x86_64.whl (2.2 MB view details)

Uploaded CPython 2.7mu

pyonmttok-1.18.5-cp27-cp27m-manylinux1_x86_64.whl (2.2 MB view details)

Uploaded CPython 2.7m

File details

Details for the file pyonmttok-1.18.5-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.18.5-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.18.5-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 07cd8fa7f6ba53c1af738834cc918e1e7a75ea4fac5f733cc48988b516a990ce
MD5 a366f3fb14cdb02644f13660b1344864
BLAKE2b-256 8c9a83da2978bdd1bf3af4baefce37e008ce1571d9c6822a3a3605f2749b6a52

See more details on using hashes here.

File details

Details for the file pyonmttok-1.18.5-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.18.5-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.18.5-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5aa858b0738c9bcf2a4a8960e2872800ddf3fc29fc570ad1393bfa04dcb334e6
MD5 1b68429e5f8fad9d639c5021dab2bb24
BLAKE2b-256 6939c0d6c5be093e0f6cb78cf55dfc6167cffc6e758a6af003471f678f9a3e51

See more details on using hashes here.

File details

Details for the file pyonmttok-1.18.5-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.18.5-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.18.5-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c500cfc8e7a471c73727d2172b5a15d8d1b63b9267060308f91797d88afe18d4
MD5 1ef8cef1fc9b354cf899c36d8ca7be78
BLAKE2b-256 bb439dd51350ff894879ebb5975afb0728ebeedbf022ea6ea1699226e47570a2

See more details on using hashes here.

File details

Details for the file pyonmttok-1.18.5-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.18.5-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.18.5-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 01062b18fca8cb15bd7c0d4bc46ae37be645275752fca8b02efcbb4edbc5fff3
MD5 92a6a38a7dc6282542a31262cc67053f
BLAKE2b-256 fffcaaa5096a948f2923d5e012409586274956368e00a6a4008412fb2807882d

See more details on using hashes here.

File details

Details for the file pyonmttok-1.18.5-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.18.5-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.18.5-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 251c81083cb3e50f14b76bfc8472a2c97f55ccdf0bb791255cad87b02cdf58ec
MD5 29221cca3eed8cb0be827efef913e539
BLAKE2b-256 4b936177716a26f8c313fe42d60d51204f8a7349312d5e0ce6fd68d50ccb9451

See more details on using hashes here.

File details

Details for the file pyonmttok-1.18.5-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.18.5-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.18.5-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 8eb4b20e13e87391caf50c7a4fbcef2ab5308159f5c1ce8056af2c9e0d42da49
MD5 5a3cf5115795ee715f17b1058899ece4
BLAKE2b-256 3d1d4e0383baf11dee24eba2c03fe26d4463f07ab6e18076f7253b3f0092ed8b

See more details on using hashes here.

File details

Details for the file pyonmttok-1.18.5-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.18.5-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.2 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.47.0 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.18.5-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c080f5de074847b332396dfb9e659a67d690e58ee92df479d2710aeb8d5f3745
MD5 bf8332a609c30ec8c2f04748df464527
BLAKE2b-256 2ca501bfaee41d8dca8f53ddcd0b3a1721292685ec10f877d8a7e613e1a79366

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page