Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS
  • Python version: >= 3.5

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: str = "",
    bpe_model_path: str = "",
    bpe_dropout: float = 0,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.tokenize(
    text: str,
    training: bool = True,
) -> Tuple[List[str], List[List[str]]]

# The as_token_objects flag can alternatively return Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = True,
    training: bool = True,
) -> List[pyonmttok.Token]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: Union[List[str], List[pyonmttok.Token]],
    features: Optional[List[List[str]]] = None
) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = True,
    unicode_ranges: bool = True
) -> Tuple[str, Dict[int, Pair[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(tokens: List[pyonmttok.Token]) -> Tuple[List[str], List[List[str]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.26.0-cp39-cp39-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.9

pyonmttok-1.26.0-cp39-cp39-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.26.0-cp38-cp38-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.8

pyonmttok-1.26.0-cp38-cp38-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.26.0-cp37-cp37m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.26.0-cp37-cp37m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.26.0-cp36-cp36m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.26.0-cp36-cp36m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.26.0-cp35-cp35m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.26.0-cp35-cp35m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.26.0-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 781a85e1f424e98255384635d86e85435ea387a0c1a74056c37e27b4b4f980a0
MD5 153676ca2e2516790b8d01d46d88b1af
BLAKE2b-256 9955f5f1e6e891c3796cff49065c43747b24befb362876973276a7092a4a1e71

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0930c2981e211d57e919fe8454f0da62c82e7bd35d77a0f0110f7cd6135fc1ce
MD5 7a039f27a8121d23327f8e44e719a3ea
BLAKE2b-256 d54f6f8024ce45b9ae0c90444ca47492adab2cdf3e88f4c2886f1f323a5eda66

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 49f781c5aefddddf5fede0536d3082f327f95b4800ba33f03fc3e50f44312744
MD5 79542e2dc103e5874193e39d289811c3
BLAKE2b-256 1911b8896fd6cf0ea0da78d4e506be7b0d2b235a0f5c3c1edcd502d36a0553c0

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0d9425f2a5474dba7ecf08a9d2725eebad7c076e00d72d6f896117c508bda05a
MD5 ad841860eab562161962cce544c89ad6
BLAKE2b-256 0ee0afdc75309a6c504bb223ead51f9f387731206b796434c0d5a579f509e7d1

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 df4fa46f3a6ddaa7ee024632a044314966fc9b155c1aeb3e139095242bd01b9e
MD5 eea94e01f03b29bc401bfbe51cf1e0a9
BLAKE2b-256 9067cd64b4c2fd0a83eb1088e31e0217b612281d014299993424420f933df3e7

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 68ed96430f0b52d005dc77d90096f53f8c08a6bdea40461146aa9806d1fccf5a
MD5 ba72f9891fb9161e752b9979ad584587
BLAKE2b-256 0fc556f91e08bd620b217abcafe122840f5dce9913a3f6305c1a957d9ab05e6e

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 978e4d841d77b7b1f63d8bb089484ee2aacb4246144ac8ecf55ba7716c0ac64b
MD5 b0565c183e4a4847753cbaf1a99b8c4f
BLAKE2b-256 58d6928d1d8f48f70557358a1cff3d3d619451cb998a41bf07c629d4d32968fe

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0b0d90d5213ee94df8d9b1fee84389178dd610adf7622a3ace6d8dba584cfef6
MD5 ccdcdd5cef44a25fcd4bee4cb2f90a22
BLAKE2b-256 2af16d30046a6d895954ac13702c23d46b41113390996e9dec9212e021775542

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b91b3e1846d54c9dea62b384746e75ef527156564b953dad018588fadbb79453
MD5 354f89b5bbfc2afab25744089acb160f
BLAKE2b-256 638e70e37e064adee505911cb5083193c5431f9078e0e0199c9913dceb0cdea7

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.0-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.0-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for pyonmttok-1.26.0-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 883c066a2713df8fd99fdd94148c04bc0e0222e5dc088fc3500bae239edf553a
MD5 c85abc3b1ed4d857625bf7fc4974293c
BLAKE2b-256 645cc801e8270f586cd0012cd93430426186a6fb21ff9bd0244357860e564cfe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page