Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS
  • Python version: >= 3.5

Table of contents

  1. Tokenization
  2. Subword learning
  3. Token API
  4. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: str = "",
    bpe_model_path: str = "",
    bpe_dropout: float = 0,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    sp_model_path: str = "",
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: str = "",
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# By default, tokenize returns the tokens and features.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.tokenize(
    text: str,
    training: bool = True,
) -> Tuple[List[str], List[List[str]]]

# The as_token_objects flag can alternatively return Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = True,
    training: bool = True,
) -> List[pyonmttok.Token]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: Union[List[str], List[pyonmttok.Token]],
    features: Optional[List[List[str]]] = None
) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = True,
    unicode_ranges: bool = True
) -> Tuple[str, Dict[int, Pair[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(input_path: str, output_path: str)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(tokens: List[pyonmttok.Token]) -> Tuple[List[str], List[List[str]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.26.4-cp39-cp39-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.9

pyonmttok-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.9macOS 10.9+ x86-64

pyonmttok-1.26.4-cp38-cp38-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.8

pyonmttok-1.26.4-cp38-cp38-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.8macOS 10.9+ x86-64

pyonmttok-1.26.4-cp37-cp37m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.26.4-cp37-cp37m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.7mmacOS 10.9+ x86-64

pyonmttok-1.26.4-cp36-cp36m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.26.4-cp36-cp36m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.6mmacOS 10.9+ x86-64

pyonmttok-1.26.4-cp35-cp35m-manylinux1_x86_64.whl (14.3 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.26.4-cp35-cp35m-macosx_10_9_x86_64.whl (13.0 MB view details)

Uploaded CPython 3.5mmacOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.26.4-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 963647f9e32a691b293ab6b3a7c6e9cf93c48c76d1099170c3f95eec275ae847
MD5 8f1965a4b1f6924e0068cc5446157ffb
BLAKE2b-256 b66795180b71c21717c0b57b50dceec6a1b8da883a596612539b193eaab44905

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.9, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 429c245080fd0b524f2a08e04e49868765c671f279b6f7ffd79ffa93344dc939
MD5 a6f35aa685f42195b5223fc807a928b4
BLAKE2b-256 83643002fe7ffacb9e8e8699726b181317310ef96ede1eba2654cb26227c23f8

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 51530f11f8211810a8487e24fa166b2080afbb68030608ec006781dafd4abf2a
MD5 711ead1f55552bb5dd4662139291c845
BLAKE2b-256 f5d27a2c7570cf958076fa29fb0165d651574d261e4248cb50495c34a911af9f

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp38-cp38-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp38-cp38-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.8, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5a2d47977bfe06b33d2e9fd929c3972cc5f2727b9fdee0a4ff9fbcc945dc6e84
MD5 888fe8fd2079f55cc7ad37dd24183600
BLAKE2b-256 db28fa37cd764fc4e84222c2bdbfb333c3620c248c9ec01c10d830477b7bac1f

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b5b30ee6d7edce87a5be16fa6cdc9d0d92a9a09ac1d18df8f78d0e3a76b49595
MD5 0df3e3ad480c4b699bcb174c30f00098
BLAKE2b-256 9073034c3e0584322e3f3f03c0965c8a83df0ab7ae5ca65172203cd606ffe8ce

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp37-cp37m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp37-cp37m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.7m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 597c439d1618b46275902eaf5cdb14dfb0fba858dd3758c28479c5844543d329
MD5 1f20015417a1587d256968666488df8e
BLAKE2b-256 d752f1b5c35cf46d44c01effb1c2f67bda676f00b85c145fd5c2ee52d2e4ab19

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 5bff709de4bd8637bf3ff8cd510381568d460f41210a975c145958dc677ce114
MD5 20f7433f577245ef2081ed98ea73a1be
BLAKE2b-256 479e4848dec7fcdb1e0d9c60fd9b4355d28ada7aaf366ec90e94b1aa0d401613

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp36-cp36m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp36-cp36m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.6m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 e064836fa6f1ed80770f2922e9b0e31960eb50eec86217b46bbdc7deeff728ff
MD5 59509fbac8f30f7777ca59fd58dd8381
BLAKE2b-256 9c42df7e00bb2b4d3bd3757cad18920500e8acec39f5561794ff18d9c9f4adcc

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 14.3 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 58580f2f5b9f8ec99ddb4037e969e29ba20caa45b6c709d778d6daaf07bf9692
MD5 9f2fcf0d857924bbacdc6b1fe4e137e9
BLAKE2b-256 f39941ac94b3a740c747b068fc584f8b95537a498a91e6aeeecf9ee1273bea4d

See more details on using hashes here.

File details

Details for the file pyonmttok-1.26.4-cp35-cp35m-macosx_10_9_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.26.4-cp35-cp35m-macosx_10_9_x86_64.whl
  • Upload date:
  • Size: 13.0 MB
  • Tags: CPython 3.5m, macOS 10.9+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.5.0 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.1 CPython/3.9.5

File hashes

Hashes for pyonmttok-1.26.4-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 0462372f48d00354545fdbb615d435b2fa19f8921a77a199b7c7f55e62f6ee25
MD5 0ff2c7af8a766cddd90c18421b545b10
BLAKE2b-256 e6050a21ee176440441dd82db5c65459a230d6522d126779ab410e13ab0a5173

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page