Skip to main content

OpenNMT tokenization library

Project description

Python API

pip install pyonmttok

Requirements:

  • Python version: >= 3.5

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    bpe_model_path: str = "",
    bpe_dropout: float = 0,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
tokenizer.tokenize(text: str) -> Tuple[List[str], List[List[str]]]

# The as_token_objects flag can alternatively return Token objects (see below).
tokenizer.tokenize(text: str, as_token_objects=True) -> List[pyonmttok.Token]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
)

Detokenization

# The detokenize method converts tokens back to a string.
tokenizer.detokenize(
    tokens: Union[List[str], List[pyonmttok.Token]],
    features: Optional[List[List[str]]] = None
) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = True,
    unicode_ranges: bool = True
) -> Tuple[str, Dict[int, Pair[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(tokens: List[pyonmttok.Token]) -> Tuple[List[str], List[List[str]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.24.0-cp39-cp39-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.9

pyonmttok-1.24.0-cp39-cp39-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.24.0-cp38-cp38-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.8

pyonmttok-1.24.0-cp38-cp38-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.24.0-cp37-cp37m-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.24.0-cp37-cp37m-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.24.0-cp36-cp36m-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.24.0-cp36-cp36m-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.24.0-cp35-cp35m-manylinux1_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.24.0-cp35-cp35m-macosx_10_9_x86_64.whl (1.4 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.24.0-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b81961174593422d2b4304cd9b7b0c567c1f86a3c1acab9e83aa67e1501cb856
MD5 91c4867e7978dc820d07289620f07ff6
BLAKE2b-256 8326ef76443c67caa6ba387b61010164371dd775837589addc7289d09c3d5c1c

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 3e4a6a5829434eea68a056115be11d4576f0bdcf9fb26995c789346d7bdb5ddd
MD5 1888604bf914cc7573fac9f0ee150872
BLAKE2b-256 8a12ee86a69175e317c9ccb035b03f665cb642852e0a5972509afe3244c51c51

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 d67c64991dc5dbb547085b1b2dc4cd240af219e0e6a303663d4d4f257ed82cb7
MD5 d763dca5c8469529c0aaa1d59bc9df45
BLAKE2b-256 c63050e9b94f438c5f4d3c961fb41e3f906a43b62730e2c29123ec064aaff668

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ced3242450db17ff11c34ee08f344685f5a1e014187138af5b929b6d1a8502b1
MD5 04c36826fb75e3554cd7ce0d2dd2ba35
BLAKE2b-256 4b0bf584b82542c070ce75d0b21dd9c50f329176fddd9028dc11a4a8e315f159

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 cad30738ee79a7cc783cf8f502807a04eee54253e884b4f48de45bdc2a44285d
MD5 2cfb61baedb63d540b1e18f3a0639eb7
BLAKE2b-256 5faaeaf0eaee095a5d42b6ed9fa969c8a37085e059f71efb1e55cb220e245437

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 fcd416c1bc1a716a5c3e3384ee9144a9a0a816dd82c4fdc0d06d9a6252e18d90
MD5 c38778a02415f02ea692d50b8e890f54
BLAKE2b-256 10637459acb03f131c3641322baef31f128252b3f2eab2ba0140ddb2ca763db9

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 bf7aec914f737f8877449fe717c543e01898c8c060d86f21dd9a03c32de6ee62
MD5 b017bcc129947b081df9de9031f12f67
BLAKE2b-256 85b00450bc51011bbe8f7591a1edd6057ac2e26651b35c16c24fbc9fe52ff61e

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 4427d83e5cc7f209eb28bf155b60dafa8f306187843604475e1aeeddf39aedf6
MD5 dffa65902fae55988030d2a977b9c540
BLAKE2b-256 0b133a4d440f0a121b786885923b16920f36e7b91994587a520c056df8659211

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.6 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5de1109067a2f118f0df87ae921a6e9728af2def668b5e2149adfed54a385f99
MD5 6c3c0412156dca2fa0a860578bb7d98f
BLAKE2b-256 498a484c9d3b0f3218a96d3f57bb25ed19951e9f9570121a76ba9e5d3bb9241e

See more details on using hashes here.

File details

Details for the file pyonmttok-1.24.0-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.24.0-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 1.4 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.1

File hashes

Hashes for pyonmttok-1.24.0-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 24b21440e8928a8230db093d15dc961c503d42bb224e01b040f27d4c3cbcbdb4
MD5 805ad1a6a980fb4150128034e43bb312
BLAKE2b-256 f77b4bab2775d4bef7e919ae09a99eb58d4c8593ab99cce839d36906ae860e82

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page