Skip to main content

OpenNMT tokenization library

Project description

Python API

pip install pyonmttok

Requirements:

  • Python version: >= 3.5

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    bpe_model_path: str = "",
    bpe_dropout: float = 0,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
tokenizer.tokenize(text: str) -> Tuple[List[str], List[List[str]]]

# The as_token_objects flag can alternatively return Token objects (see below).
tokenizer.tokenize(text: str, as_token_objects=True) -> List[pyonmttok.Token]

# Tokenize a file.
tokenizer.tokenize_file(input_path: str, output_path: str, num_threads: int = 1)

Detokenization

# The detokenize method converts tokens back to a string.
tokenizer.detokenize(
    tokens: Union[List[str], List[pyonmttok.Token]],
    features: Optional[List[List[str]]] = None
) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = True,
    unicode_ranges: bool = True
) -> Tuple[str, Dict[int, Pair[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(tokens: List[pyonmttok.Token]) -> Tuple[List[str], List[List[str]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenzation.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.23.0-cp39-cp39-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9

pyonmttok-1.23.0-cp39-cp39-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.23.0-cp38-cp38-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8

pyonmttok-1.23.0-cp38-cp38-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.23.0-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.23.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.23.0-cp36-cp36m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.23.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.23.0-cp35-cp35m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.23.0-cp35-cp35m-macosx_10_9_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.23.0-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 33d9cbb92881d73dceaacd98ebef1d6a93a88daf2fd5d926671a69f3b93d2b8f
MD5 e43b0e7b4566128d53177784d625cb23
BLAKE2b-256 e69819c9eae3a9897ee11908bb53d63b2a5d81d0211286f1ed2f54286313f120

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 678568b03e62dcbb2c285b6cf7481316b539ffcfdd6b6ea729902d767a13766c
MD5 39301c670f0d3d2e3fd9e6491f76e542
BLAKE2b-256 ee97b48f3ece44698faa9ffbdd8b1f4a65db1d5bdc77fd97fa12cd948fb61625

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b4b9488df394a7dff7c933e0e4994cccfb03edef14aa166d7c28a25743dacaf2
MD5 a2b3b508a15936802553860b6c3c9e9e
BLAKE2b-256 8a225e8e2eb8b6cffaa48fec7540a21b5aabab1412306c8edeb6a0937d566a4a

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 b420d93083e5e2d437735078d1db31b9081fcfd2568cdcfaf3f96f4121ed31f7
MD5 5edb185be4ef830e9b5573f7b3132e95
BLAKE2b-256 4265bbac4e10960d980bb9b0b36d845961b7cab6b9013bacc55f31e54006c1eb

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 e5fcee529188ca73c233fcfcd0ee54fe4a629b23a55cf86ea39e45afd3d9819e
MD5 354198345c65a7b8f5fadec43623787f
BLAKE2b-256 caf0f04984c4c7473cb8220242b7679407cf69b9ad7ee80c94e809920a419c07

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 bc21f080d4716d9757a161ffc43712cce89ad790e1d7f9214fee82ac74b75b75
MD5 c84ae336aa7ad5cfe510ec6afae0fcf9
BLAKE2b-256 4908ad6c799ba0e6202547c6bf782d3a4488a35c612f96f75ec8e095224623e7

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 69803e14dcaaecfa0fc193bbb8a039670a38318deef5e25862ac4fe8e562143f
MD5 0e26a17fe5803a2b3e9ac2f707836771
BLAKE2b-256 f20245433f86ff7dbccae319f4aff32187fa9e9c9c3204c9f7f7b07b5c9aa9dc

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 cfe89f7f0ab1bb39ff37237f858aaf62fd0dc159446d36caadad883b9f0ed529
MD5 f80f87dd6037c90cdbfc16f375127a8c
BLAKE2b-256 357ceea54f8c072257b74d5a868df4d417debd82a0cfad3a8c8334409514bf89

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 c9c3c65288612f619fc46428f1aad64653d749404c5b01c916bbd42cf86c1be4
MD5 aebd9a45815011256c6fdcaf05d865c4
BLAKE2b-256 be734cb1b3088eed49e90edbfb0b67503862cf1e82796af9e74a6bfe86d850d6

See more details on using hashes here.

File details

Details for the file pyonmttok-1.23.0-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.23.0-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.3 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.6.1 requests/2.25.1 setuptools/51.1.0 requests-toolbelt/0.9.1 tqdm/4.55.0 CPython/3.8.7

File hashes

Hashes for pyonmttok-1.23.0-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3cc26021354f3b76b5c701564110284805402383be22c8e65987b62e81903d74
MD5 0aaa266ce8e8783713afba1539996a17
BLAKE2b-256 a396fc06cb20efa8cb88e2fb5750410982beee11022c256c2da7c4427b96a73e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page