Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS
  • Python version: >= 3.5
  • pip version: >= 19.0

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: Optional[str] = None,
    bpe_model_path: Optional[str] = None,
    bpe_dropout: float = 0,
    vocabulary_path: Optional[str] = None,
    vocabulary_threshold: int = 0,
    sp_model_path: Optional[str] = None,
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None,
)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: Optional[str] = None,
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
# When as_token_objects=True, the method returns Token objects (see below).
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = False,
    training: bool = True,
) -> Union[Tuple[List[str], Optional[List[List[str]]]], List[pyonmttok.Token]]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: List[str],
    features: Optional[List[List[str]]] = None,
) -> str
tokenizer.detokenize(tokens: List[pyonmttok.Token]) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = False,
    unicode_ranges: bool = False,
) -> Tuple[str, Dict[int, Tuple[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False,
)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options,
)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(
    tokens: List[pyonmttok.Token],
) -> Tuple[List[str], Optional[List[List[str]]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None,
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.27.0-cp39-cp39-manylinux2010_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.12+ x86-64

pyonmttok-1.27.0-cp39-cp39-macosx_10_9_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.27.0-cp38-cp38-manylinux2010_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.12+ x86-64

pyonmttok-1.27.0-cp38-cp38-macosx_10_9_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.27.0-cp37-cp37m-manylinux2010_x86_64.whl (15.8 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.12+ x86-64

pyonmttok-1.27.0-cp37-cp37m-macosx_10_9_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.27.0-cp36-cp36m-manylinux2010_x86_64.whl (15.8 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.12+ x86-64

pyonmttok-1.27.0-cp36-cp36m-macosx_10_9_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.27.0-cp35-cp35m-manylinux2010_x86_64.whl (15.8 MB view details)

Uploaded CPython 3.5mmanylinux: glibc 2.12+ x86-64

pyonmttok-1.27.0-cp35-cp35m-macosx_10_9_x86_64.whl (13.3 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.27.0-cp39-cp39-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp39-cp39-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 15.7 MB
  • Tags: CPython 3.9, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp39-cp39-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 b3c85f6128c53a0c0e87e20d994ca80d0ce944f26563af58fec86411366f7740
MD5 f5af0817a3280b3d3657787dcf6dc839
BLAKE2b-256 22801a7211e1bc9b581e261d8336bf97cc63d41fcd291ebc4dfae747990116b9

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.3 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 f261ca3433e3f7e4eb48a4329c19de5025c5ae717b629700b6cd1fc3dc4897c1
MD5 5ac0a4b9498f48bf392e51c0f512388b
BLAKE2b-256 3a04085171671c3e7846e2785e337a27a93d02674d7af21710233d6ef581e2a7

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp38-cp38-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp38-cp38-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 15.7 MB
  • Tags: CPython 3.8, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 2e47b4796a2b9913e432e449ef4141f9f8b1596ac82d1eb88bfa35da04a931df
MD5 3e6d10467c45e325c2d8d9bd891f7b2e
BLAKE2b-256 fb821d0dff71bec51d754cb5fb6000884e5ae59d1698d30fc344a173906462c7

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.3 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 252156610900ead582c8e704655cc5938d0f0d1dd74a651c98cfae55298d6ebc
MD5 5076bafc83e9e3ffeab49dd72ceb4d34
BLAKE2b-256 f600e7a9eec7da36e5c1471b6b735456f4ed35e8fc8cf7fae52bc64c64ab86fc

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp37-cp37m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp37-cp37m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 15.8 MB
  • Tags: CPython 3.7m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 e57722cd33db07df8b0440e4ca16bd99a59593f624c068c9de3123982244cd21
MD5 551f14f89a51950a6c0eb4fbbb605e2b
BLAKE2b-256 30590a46815266d6db72def43fe9f25137839d2ef02054f5e58ece1a10825f52

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.3 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b0ce2a47832a1c6d58e26a499070ed841bb131ba97c20158ec93e4805795fba4
MD5 911724998db8cf0e16cfbe77edb9ecaa
BLAKE2b-256 59664733b931974d06eb67aba2c3c2e941419337905d16b485b0402198021500

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp36-cp36m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp36-cp36m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 15.8 MB
  • Tags: CPython 3.6m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 9a64331e675e7826cba8d1990eb0c72b2d017713706a170da1597e1586ebb0c5
MD5 1f62a85d06efc060502ad11b19c40c8f
BLAKE2b-256 29cfb5825c5fea579bafd36afaa682f9a0cb288e03a6bd0a0a6581f18cd9bd43

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.3 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 eb5071e3269d5ade638fee2f17c663057c4aca531063651ae25d443675c4ded0
MD5 d81d317205a01ca2464fc6ff1b7c105c
BLAKE2b-256 4f76a83fd296a26519b843c29a9bbb20707aa551770d8f7ea6c95c2739aa5184

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp35-cp35m-manylinux2010_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp35-cp35m-manylinux2010_x86_64.whl
  • Upload date:
  • Size: 15.8 MB
  • Tags: CPython 3.5m, manylinux: glibc 2.12+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm Hash digest
SHA256 7c2dec2f65f88823fc1050ac65be7a163183ebc66a52ccccb728baaa38433ff8
MD5 a3d2e0a7ff28b706028466dc4a0aa910
BLAKE2b-256 7db9cc9ac1c868851442b5aed7c8d7f79a41e543f212c6dd9f222e0596793e88

See more details on using hashes here.

File details

Details for the file pyonmttok-1.27.0-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.27.0-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.3 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.9.6

File hashes

Hashes for pyonmttok-1.27.0-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 725716ded03c1bd72767bf805583432121142d0d2cffe70f406d9bf1fcaacf6b
MD5 244fcfd78fda579ad78faa31f42e296b
BLAKE2b-256 ba978545cf5a1ee6d9342a6950a88dd47448e54a2fe59979498df93d19c4605c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page