Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS
  • Python version: >= 3.5

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: str = "",
    bpe_model_path: str = "",
    bpe_dropout: float = 0,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.tokenize(
    text: str,
    training: bool = True,
) -> Tuple[List[str], List[List[str]]]

# The as_token_objects flag can alternatively return Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = True,
    training: bool = True,
) -> List[pyonmttok.Token]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: Union[List[str], List[pyonmttok.Token]],
    features: Optional[List[List[str]]] = None
) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = True,
    unicode_ranges: bool = True
) -> Tuple[str, Dict[int, Pair[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(tokens: List[pyonmttok.Token]) -> Tuple[List[str], List[List[str]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.26.2-cp39-cp39-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.9

pyonmttok-1.26.2-cp39-cp39-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.26.2-cp38-cp38-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.8

pyonmttok-1.26.2-cp38-cp38-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.26.2-cp37-cp37m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.26.2-cp37-cp37m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.26.2-cp36-cp36m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.26.2-cp36-cp36m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.26.2-cp35-cp35m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.26.2-cp35-cp35m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.26.2-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bee08735f58d64fe9579aa9543473155c94d33ded863af126e499be8d5b77dac
MD5 02bc2d3ea89bcd62aba3d6aa4f857a83
BLAKE2b-256 3a67195a8ae473cba233c733f70e48a7f3d71bffd50a9cf84dc3916d01b3c173

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 d246300a3c6bba96a4468e2f51ede95701fee3fbe51303d19e83375bda299389
MD5 cbd04ae6cca65cefbdabe984fcf7e020
BLAKE2b-256 73134f5480a5c1b71838009dea361b90fe4ac00de6f6e691ec5febaa19b44e7c

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 07cd47848272d91daee4ea70f441c24e308366c8e1dea0ae5f087576cc410f39
MD5 fcb58da79a1e248d0456800afc3b9efe
BLAKE2b-256 3364650c281e141e91bd23b5269d4fa813497486b33e9a376785ed79f144debe

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0b9c1173c25439e3b2508423def1e5c62b5bf039e1ed06f7fc2e7f9349545ef1
MD5 e6b2908a04be38703e5d400d33dbc8f8
BLAKE2b-256 07138663c3f8240148aad920245b0460554820fcba6993ce8e0df2c69898e6a9

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 cc53faebdd0a1264edcb4c713394b17e0b748d073fd4aa83e8f1ca1ad39b135c
MD5 7bdace0b2f1bdbbb9bcce52decfadbc3
BLAKE2b-256 5c0cbad1d4c1c42885f89ef7e9697534af17906e59fc6703872b2a3a52d7c06f

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 418f243e81597422aa9de4003b67bf735ac7dc161ae676906e34c282ea7bbbfb
MD5 13efa7d027be02107f757f6b0222dfb7
BLAKE2b-256 e393251c4a1117f925d33d0f21631df6d3dace21f6a6526fa981b80b1cef4127

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6446912c7413b711c4a7ddd23026fd4391a2e69db2e62acb0e3a646f5ce062e6
MD5 7174d810f0f75009d10d2b0e1c6a8e2b
BLAKE2b-256 29bf94e2bd50634acb59c855a6d4964a01574c5b401a85220501a9891619e48a

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 baf0d1f22a375b32eca5c8d095b7164fde448ead297dcf4a54ab762c816fe5bf
MD5 a89a21bfa2d1f71959edfe42b448b89e
BLAKE2b-256 ded5aa5be555111303e1681acf3f90a340c50708125511074134084630a26c3d

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 6294452a1e02d16c51ef0cae49bcb1ce202c4814c96fcb9295654e227a88328c
MD5 f2a092052ecda4b1809d1fe37a872377
BLAKE2b-256 a3f9e9112bed5d8e9d5696bb7cd2f836163d2c91f4751221a836375d7bf559aa

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.2-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.2-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.2-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 195d40bb9fa348abddee6ae6ef0437e22cdea057e01fc2fd6169b5463df127bf
MD5 88e12168f30bb58d9925eae4734452e3
BLAKE2b-256 0c4d061a506f49f61fcb113e5e87822e5b5d8737df220f79ecac4aa86e900708

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page