Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS, Windows
  • Python version: >= 3.6
  • pip version: >= 19.3

Table of contents

  1. Tokenization
  2. Subword learning
  3. Vocabulary
  4. Token API
  5. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: Optional[str] = None,
    bpe_model_path: Optional[str] = None,
    bpe_dropout: float = 0,
    vocabulary: Optional[List[str]] = None,
    vocabulary_path: Optional[str] = None,
    vocabulary_threshold: int = 0,
    sp_model_path: Optional[str] = None,
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    with_separators: bool = False,
    allow_isolated_marks: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None,
)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: Optional[str] = None,
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# Tokenize a text.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.__call__(text: str, training: bool = True) -> List[str]

# Tokenize a text and return optional features.
# When as_token_objects=True, the method returns Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = False,
    training: bool = True,
) -> Union[Tuple[List[str], Optional[List[List[str]]]], List[pyonmttok.Token]]

# Tokenize a batch of text.
tokenizer.tokenize_batch(
    batch_text: List[str],
    as_token_objects: bool = False,
    training: bool = True,
) -> Union[Tuple[List[List[str]], List[Optional[List[List[str]]]]], List[List[pyonmttok.Token]]]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
    tokens_delimiter: str = " ",
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: List[str],
    features: Optional[List[List[str]]] = None,
) -> str
tokenizer.detokenize(tokens: List[pyonmttok.Token]) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = False,
    unicode_ranges: bool = False,
) -> Tuple[str, Dict[int, Tuple[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(
    input_path: str,
    output_path: str,
    tokens_delimiter: str = " ",
)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False,
)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options,
)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Vocabulary

Example

tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)

with open("train.txt") as train_file:
    vocab = pyonmttok.build_vocab_from_lines(
        train_file,
        tokenizer=tokenizer,
        maximum_size=32000,
        special_tokens=["<blank>", "<unk>", "<s>", "</s>"],
    )

with open("vocab.txt", "w") as vocab_file:
    for token in vocab.ids_to_tokens:
        vocab_file.write("%s\n" % token)

Interface

# Special tokens are added with ids 0, 1, etc., and are never removed by a resize.
vocab = pyonmttok.Vocab(special_tokens: Optional[List[str]] = None)

# Read-only properties.
vocab.tokens_to_ids -> Dict[str, int]
vocab.ids_to_tokens -> List[str]
vocab.counters -> List[int]

# Get or set the ID returned for out-of-vocabulary tokens.
# By default, it is the ID of the token <unk> if present in the vocabulary, len(vocab) otherwise.
vocab.default_id -> int

vocab.lookup_token(token: str) -> int
vocab.lookup_index(index: int) -> str

# Calls lookup_token on a batch of tokens.
vocab.__call__(tokens: List[str]) -> List[int]

vocab.__len__() -> int                  # Implements: len(vocab)
vocab.__contains__(token: str) -> bool  # Implements: "hello" in vocab
vocab.__getitem__(token: str) -> int    # Implements: vocab["hello"]

# Add tokens to the vocabulary after tokenization.
# If a tokenizer is not set, the text is split on spaces.
vocab.add_from_text(text: str, tokenizer: Optional[pyonmttok.Tokenizer] = None) -> None
vocab.add_from_file(path: str, tokenizer: Optional[pyonmttok.Tokenizer] = None) -> None
vocab.add_token(token: str, count: int = 1) -> None

vocab.resize(maximum_size: int = 0, minimum_frequency: int = 1) -> None


# Build a vocabulary from an iterator of lines.
# If a tokenizer is not set, the lines are split on spaces.
pyonmttok.build_vocab_from_lines(
    lines: Iterable[str],
    tokenizer: Optional[pyonmttok.Tokenizer] = None,
    maximum_size: int = 0,
    minimum_frequency: int = 1,
    special_tokens: Optional[List[str]] = None,
) -> pyonmttok.Vocab

# Build a vocabulary from an iterator of tokens.
pyonmttok.build_vocab_from_tokens(
    tokens: Iterable[str],
    maximum_size: int = 0,
    minimum_frequency: int = 1,
    special_tokens: Optional[List[str]] = None,
) -> pyonmttok.Vocab

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(
    tokens: List[pyonmttok.Token],
) -> Tuple[List[str], Optional[List[List[str]]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None,
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

# Checks if the language code is valid.
pyonmttok.is_valid_language(lang: str).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.38.0-cp311-cp311-win_amd64.whl (14.6 MB view details)

Uploaded CPython 3.11Windows x86-64

pyonmttok-1.38.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pyonmttok-1.38.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

pyonmttok-1.38.0-cp311-cp311-macosx_11_0_arm64.whl (202.4 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pyonmttok-1.38.0-cp310-cp310-win_amd64.whl (14.6 MB view details)

Uploaded CPython 3.10Windows x86-64

pyonmttok-1.38.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pyonmttok-1.38.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.1 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

pyonmttok-1.38.0-cp310-cp310-macosx_11_0_arm64.whl (202.3 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

pyonmttok-1.38.0-cp310-cp310-macosx_10_9_x86_64.whl (234.8 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.38.0-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: pyonmttok-1.38.0-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 14.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.0-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 7fe725b44a629ee02e06418b83f5ff18a529f729827125cf01b4bbc9a88591fc
MD5 09a2d5affa9342945476fe92d48f6918
BLAKE2b-256 4b549b6fb957bcfae9309cab517bab7c472b4724655e7d9a06ddf21fb3ee543a

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 17e84aa86a8b7eed2e010202c515b721bea3d2b13b32c44f7c05517d6ddafd14
MD5 15a2aa1ec01e937103784b03f90529c8
BLAKE2b-256 c979c627e14a4ae42ce52129c1a909a084576814819120c763b2c9705b484ab6

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 0bd0cbf7dd5c7e3b7023bd6e3625534f9fa18102dae4ff1a980641ed4f405963
MD5 512f6cc90af32728d2d9889323f6af2c
BLAKE2b-256 e1e4985daf82de2279f1c0da21d8ab2b4174bde47596d35d3fa232bab362a773

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.0-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.0-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8af61a27722266c9233c8efae20654655427596d905946a161bca19cfd64e078
MD5 bf81538867041f596ba8a143cce266ce
BLAKE2b-256 ea6f54230159e4e63cc194e68a06fa04309bde143bd15192b18aba98b77d2aed

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.0-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: pyonmttok-1.38.0-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 14.6 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.0-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 110aebf0edeaebd8200440d6db6eb396760749c2509e5502cbf11a44a43e4986
MD5 5c10c3897a87756a640e84e3eb6cc7df
BLAKE2b-256 32b04d26ee3bab878bf3a1d16bcaaa9a6872eb95aea8858ca6239975102090b9

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a15774edbd1b46d4d08c7dddeac2817c3361f098704b97bb561b31d4f0d82e46
MD5 9b389f2112677e23c172724bdc9be715
BLAKE2b-256 f96c3143a3ed3152d00139c8b49ec0d0293b2a581fce71f43283ba0559505785

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 fe85b2304509571baae7545a4205e8a303cbc5d135eaccdd091d4b5c76dadc1f
MD5 bb26675100ddd672320246651ede6819
BLAKE2b-256 06fb80920ad5148ce7e62a7a2a5236585d1caeba319900a66d6b76fc545763b3

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.0-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.0-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6e1a021f09345803e2f31e729367b55201cc23a3b439cd4c97df1e5c6bc2cd39
MD5 f7668e92dfab9066a49ea39eb374404e
BLAKE2b-256 f5d8af84cf813e3de34e6b920a3a068685e6c64b232d328027e5467611a1dad4

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.0-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.0-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 5a8897c1abea7a990798c5870d3a482b19ecd02c03fe0cdea047605003b9cb10
MD5 e98c1cdee9db41ebeadf35492d3fbe7d
BLAKE2b-256 869a5dd5c9997f53f394656988769891a476e9f36942cd0d00d512ea27df7f29

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page