Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS
  • Python version: >= 3.5

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: str = "",
    bpe_model_path: str = "",
    bpe_dropout: float = 0,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.tokenize(
    text: str,
    training: bool = True,
) -> Tuple[List[str], List[List[str]]]

# The as_token_objects flag can alternatively return Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = True,
    training: bool = True,
) -> List[pyonmttok.Token]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: Union[List[str], List[pyonmttok.Token]],
    features: Optional[List[List[str]]] = None
) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = True,
    unicode_ranges: bool = True
) -> Tuple[str, Dict[int, Pair[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(tokens: List[pyonmttok.Token]) -> Tuple[List[str], List[List[str]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.26.3-cp39-cp39-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.9

pyonmttok-1.26.3-cp39-cp39-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.26.3-cp38-cp38-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.8

pyonmttok-1.26.3-cp38-cp38-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.26.3-cp37-cp37m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.26.3-cp37-cp37m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.26.3-cp36-cp36m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.26.3-cp36-cp36m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.26.3-cp35-cp35m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.26.3-cp35-cp35m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.26.3-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 40ba7dd48284ea292f45a3f0202a0542580d6dd7817ca258c7a7635dc426dff9
MD5 029aa536c68e8cc8a4756b086c0ba697
BLAKE2b-256 8300f381b273a01d41ae78737d7d67232938da1211efcb9d216b5193570ba7bf

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2c8bfe368650691681e48fed45ae2873cb33cab8bdc493dbce7697782941490a
MD5 1baa1b3cb5786876132cdc25d82ab882
BLAKE2b-256 2f1114367f95ce3f5db71cb207e0942f5eec8682c6e87fd546dc8d77d1deed23

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c775e9cd3556d77d1d12831074cac1df69fa712d29399c71c5e417a88af2b228
MD5 0a73ed4300aab398f41677a5cbf65c9c
BLAKE2b-256 647551f1156637702f394b02b653ac9acd5645ed7c5278d12d73ff71aa9d1de3

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fe5acc0e27c73bd622ee63aeb708e82ceb750ae35607f0738968329098db9884
MD5 342024256ff9f999aa9df1ae5e55ade3
BLAKE2b-256 61d7819a21ba6b45db819372b39cb738881932347b0448adeea15681ebd3a9ae

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 fc2098dfd1b55d92eeae4a826fec1b9896a0437934426d24cee338a53008a41f
MD5 66c30a137c162a81a95112a78aa8eb56
BLAKE2b-256 6dcc24b366d49423b993fa0364b880f7be7e5c34239539bd9f9d36f9ab2dec50

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 9e8a3d2c4ed60c5151089d36b1820b97c1925991b758d03ab3cf877c3cec1787
MD5 df64505f23d15a99c77d5d86d32b9e90
BLAKE2b-256 dece51b8b1813800b5d158e4f4a3628634b2540f9b58e5dc43fd2e8921cd49b2

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 f49ce308761d97df84a37c93da1c9fe85452d3455464200618a70ad781c447c3
MD5 a09b161982f1b4326f8534d2515d3f4d
BLAKE2b-256 5b187e8d2c0007b9c0a7ab81837521024db0eebfda3ecb9e1743355e2ec31323

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 aff385ee11360df7783025b5afa2455a9348e8c90839d57ba30c2c372a88538c
MD5 a5b0e0776fcc9f3157b336a47203b891
BLAKE2b-256 529062cd8cb3c1eead963c36d6c15feaec5cabb3827b39c5e0002a36c3954c60

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5221c8a33ab631d0e3a9bbb31f39a7ea4b2f54a694d85ecf216cfed4655fdb38
MD5 0e5bdeeda4e1009ee3757297274391b1
BLAKE2b-256 221de56e3050a2908b36a6df7d013638b783abf27232002c718a3a8fafa42861

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.3-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.3-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.3-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 aca4b0032e016206cd8b709f9129b37362a89959923283f0bbe9060bbc13ab2c
MD5 2235b3b1ab4004cf129231e46970e0d0
BLAKE2b-256 1d2537c57320a9a0d2dd64fcfbdff1ca4c08fc01418674194df170b96ae960f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page