Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS
  • Python version: >= 3.5

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: str = "",
    bpe_model_path: str = "",
    bpe_dropout: float = 0,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.tokenize(
    text: str,
    training: bool = True,
) -> Tuple[List[str], List[List[str]]]

# The as_token_objects flag can alternatively return Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = True,
    training: bool = True,
) -> List[pyonmttok.Token]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: Union[List[str], List[pyonmttok.Token]],
    features: Optional[List[List[str]]] = None
) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = True,
    unicode_ranges: bool = True
) -> Tuple[str, Dict[int, Pair[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(tokens: List[pyonmttok.Token]) -> Tuple[List[str], List[List[str]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.26.1-cp39-cp39-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.9

pyonmttok-1.26.1-cp39-cp39-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.26.1-cp38-cp38-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.8

pyonmttok-1.26.1-cp38-cp38-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.26.1-cp37-cp37m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.26.1-cp37-cp37m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.26.1-cp36-cp36m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.26.1-cp36-cp36m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.26.1-cp35-cp35m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.26.1-cp35-cp35m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.26.1-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 02756293ec47535af2be9116c86abe7b60b3e064067125e008be768fa33ce488
MD5 8f206c58b7ff7f87b7977548a8dc25ae
BLAKE2b-256 0e7b658a676ef754167b0635640be9f3e703819358ca664a71a80155ede39ff5

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 190983458c6490c145448d55fff4dcd151f871b8edfaea8d5138b07e906afd4d
MD5 cddb295dddf2dced2083df7643671a33
BLAKE2b-256 6ad9127905530b574fa01bb531f99f4f970ee1db120e4030e351e1475ead77f0

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 97cdf01c1c8fa438c9bf247943763ee9e935f84d0971c44cd4a8507f12dc8bca
MD5 a9684e17d89158a4e2fb49739742d749
BLAKE2b-256 2a08970fb0efd5c3ca307d1178e4098305a81f5d17d1b209e80284beb6077ac2

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 494d6af6d015d335d5ed06acb5b228568f1c8406ead339e8809a310e2441f7e9
MD5 c7b4ea1f57d0d80f9cdece28099d4a57
BLAKE2b-256 224d4e782ad4d3650e3e74af0e767b5847e780447c3adf4f590f8c5a094b6a71

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d064c810f80f7149d19967da2e93a58823d67de5d4c0dd438367b47c7a5d7587
MD5 795f005d2864a637c98fbada96d7ac0c
BLAKE2b-256 953c26c6c699462fdf90d137cbd862a54ad4d873a90dd99d068acf75f40a4745

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5bee4bbe83f9f9597d7bea3e94d58a9c3b09ee9b3dce343bbd3d1c7860e19bf3
MD5 5657e54a7bdb4d90fc182c650ffba3d1
BLAKE2b-256 44026e7517e89cecba1dcaa1f5e8aa2e7d2be692215eb894dfe553c4a90590f2

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 61cca75872481b824a8c8c40fd1cd1bc2f3d0045b8e9ca33765da9898fdc43b8
MD5 1c655bb7656f32c383cf70708e9bc228
BLAKE2b-256 8e7b3fc0f3f3767e654861cd0cd3dd1300d6e8b1995f923cf84f774b568f05b4

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5d908b10c1a810e8f63b9d7fbe093074265849459992a861e2374a5cc6f9e05f
MD5 46ad55d0c187e23e7b063ecbf2861a1f
BLAKE2b-256 769fc9d6b125ed7e24d2dd03e96bae44c0b3e5c0cff122f817e6ee58d6d14396

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 84989e0c4428831b18d91c801a7942bf28a844ab0a0194df2f7735b60b94a567
MD5 be5672aa4912d1bb924394740e6cad2f
BLAKE2b-256 b6aeeb4eca608fa5c0ca90e0886efdc14c9396da219c595181744bd3090035a1

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.1-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.1-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.3.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.0 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.1-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 c7066c336efb0d0b029cb68596ddb6ca544b21868dab9259469043c9d16c5409
MD5 cbe854d4d1a7714bad6cb8444cef9488
BLAKE2b-256 2360e4449741015abb1f884a93da2362a1e36f064e14879a3a1092e45cdffa54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page