Skip to main content

OpenNMT tokenization library

Project description

Build Status PyPI version

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters "⦅" and "⦆".

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

pip install pyonmttok
>>> import pyonmttok
>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

See the Python API description for more details.

C++ API

#include <onmt/Tokenizer.h>

using namespace onmt;

int main() {
  Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);
  std::vector<std::string> tokens;
  tokenizer.tokenize("Hello World!", tokens);
}

See the Tokenizer class for more details.

Command line clients

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

git submodule update --init
mkdir build
cd build
cmake ..
make

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

mkdir build
cd build
cmake -DBUILD_TESTS=ON ..
make
test/onmt_tokenizer_test ../test/data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.22.0-cp39-cp39-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9

pyonmttok-1.22.0-cp38-cp38-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8

pyonmttok-1.22.0-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.22.0-cp36-cp36m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.22.0-cp35-cp35m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.22.0-cp27-cp27mu-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7mu

pyonmttok-1.22.0-cp27-cp27m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file pyonmttok-1.22.0-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.0-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.0-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 35579eb457f9b7df61f33a14e06fad20a4841d4e5df1d55ed4587161a2b33ac9
MD5 4403069b015392173c1ef05776664e07
BLAKE2b-256 dfe20a8831a7baf0510d50e8edb8a78ef0ddd59b7e2233b5a2c50f2fb964882b

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 665bf3323118211ad99e52cd8ac4377101a31b006194e2cd0685bcd3358555de
MD5 c95342c9dfe9df2e2aed4e59133dfa15
BLAKE2b-256 0dadf27d2daadeb80f1330a79fa4f145dac5bb8c57b9c6b70664c61a2fbfbd65

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 2059b8342a11ba2029aa5bf5e583182a9efdecb164a405e910164532fe140043
MD5 3b15a74b8a31a533115a8298f964cfbf
BLAKE2b-256 29a35acfbb2e6eb182f07ffcaf6db37e2d12f902b9452807be5f3ecff212d29c

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 3b8db70fd56db9a25df0c8c150e3848538a7f7ca5b3ea716b0b351477b5dc67c
MD5 a91a9e2a6bf7b39accb68aa03ba873f0
BLAKE2b-256 e734315f6d3e4d5dae8809c42bda191435cc4a2495bb77d3803009ac997ca023

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c20454cfb57f2f8813dd80cf45cb554e38ecd95b3e29f5883e1f3e8685f984bf
MD5 e5ff557d3f847ae220a131ef214fa36a
BLAKE2b-256 5fbd9c5a880f9119ac8abca383b2460919c1f113f53406509d74a39a6f92a099

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.0-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.0-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.0-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c458e7f567ad09ad2f8580bc133eddd708e1527c8a5370b5c0d7b6d6df358092
MD5 f4cd0f0cbfaed880ff3b570b2e7cb923
BLAKE2b-256 3d7f87bd610f126bb4efe5523a2c96d39df93dbe4d8383156feb65f664fffd63

See more details on using hashes here.

File details

Details for the file pyonmttok-1.22.0-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.22.0-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.24.0 setuptools/47.3.1 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.6.10

File hashes

Hashes for pyonmttok-1.22.0-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 dd454054ae18645556960d0a6fe997e1b1774d448d98db9d68b0899a546308d3
MD5 4c8eb53698891166ac1431f53a9babe1
BLAKE2b-256 d77263df1a7e3110af04c5496968ce44fccdde591771173d41668af1b3031a7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page