Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS
  • Python version: >= 3.5

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    bpe_model_path: str = "",
    bpe_dropout: float = 0,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.tokenize(
    text: str,
    training: bool = True,
) -> Tuple[List[str], List[List[str]]]

# The as_token_objects flag can alternatively return Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = True,
    training: bool = True,
) -> List[pyonmttok.Token]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: Union[List[str], List[pyonmttok.Token]],
    features: Optional[List[List[str]]] = None
) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = True,
    unicode_ranges: bool = True
) -> Tuple[str, Dict[int, Pair[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(tokens: List[pyonmttok.Token]) -> Tuple[List[str], List[List[str]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.25.0-cp39-cp39-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.9

pyonmttok-1.25.0-cp39-cp39-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.25.0-cp38-cp38-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.8

pyonmttok-1.25.0-cp38-cp38-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.25.0-cp37-cp37m-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.25.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.25.0-cp36-cp36m-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.25.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.25.0-cp35-cp35m-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.25.0-cp35-cp35m-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.25.0-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 acb1ae938fae49abdf76496d0dd4da7b276dc7300d9a22a1f3d2d93459a75feb
MD5 c41b778ee10d343ceb8330f7719e3f46
BLAKE2b-256 af9fb02ac11a6cf7993ab0841748175486641ace48dfbe24f45748e423fef73e

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bbaf3bd85f74005cc02ada89638045cbb0b18e027844c86dd63dc0675af5d955
MD5 d05e3600aba8e7ea48e5525e03588700
BLAKE2b-256 2fce7096ff5e200e25c76e5d5e2f3631310fb7c570ed982533d9aa5c2c718b07

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 88d07315aff07ac399bbdec8d5c41c839a56d9ee0377f5368f5b8142575a7f1d
MD5 50dc3dc471373ba37376448babdff88e
BLAKE2b-256 6329b634c6cb281d63c74c1c70b00d1c4b4204ce324c5a64e6082e5fa38b39ba

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cdd18b91f08b62ef3e7b0f7adbf55dcb4cbdcc1d599014c619c92b1c5d1f8a42
MD5 77ef0b5d6178de5bbc0b32aa8f65a41b
BLAKE2b-256 5330cca18639366f9305dd65e2520737dfa50817cc75e9a0cdb181c9fbbd7e40

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 543b6855304c14a75feb91f15f31f91f874ce158344f0bbe2b97066912ecae0c
MD5 bbab0f25de2e37a047d4fd06baf5956f
BLAKE2b-256 3f6317c6ac0d8a0cfa5ff7257e52edb6759d12dc266392f6c97f5c65c0c7238c

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ade69c8dc37e1fd2ed0a2e2158ce975579fdf2eca588d2d3a5e6d026320d033a
MD5 f7537e488f8a15d7b92a8071b7b57f0f
BLAKE2b-256 95bdd47ad1bbb57a504896b72a4be9fef101db4b0a6f3d42111409c943aab190

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d3cee372e1ace4c9368fa6252e7e7f9f1878091b4d1d948dc43049d72eae3588
MD5 0816cfe65429e2bed47e9d333557918e
BLAKE2b-256 f595624409855941d5a1943264e550f060bfde7ec2d516b43a7bc922010a49f5

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 280128a11d22f43060cceb2b3bfa7ddf9c08428a62b075ff97814cd41377b77c
MD5 363752392754cbfb5f82c80aa89c2b11
BLAKE2b-256 6961b8fa90d52ae9ee457023e30cecb27c4e72023c6eb4145e21093d00e2b8f8

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d327550eff753bb2d828d92d8740fa7938c66a9a9639113477fb7faaeff58c5a
MD5 8a80f86141c306696e2afc9b8fb51a3f
BLAKE2b-256 3bbe614a33a3ba7e9ea0e06d9cc7ed784cd98ec5857eb691fd5a803e1e925d39

See more details on using hashes here.

File details

Details for the file pyonmttok-1.25.0-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.25.0-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/54.1.1 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.9.2

File hashes

Hashes for pyonmttok-1.25.0-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 1e639ab59bfcc6a930e414dec1cf5a2c786cfc8b1f425a29795cab9d25fa5a88
MD5 789b714be700b73344ead1fcd5706ae0
BLAKE2b-256 308d172a228ef1da14ca40a3f9924a89f100c938744da9ce19accd7a63675cc1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page