Fast and customizable text tokenization library with BPE and SentencePiece support

These details have not been verified by PyPI

Project links

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

OS: Linux, macOS, Windows
Python version: >= 3.6
pip version: >= 19.3

Table of contents

Tokenization
Subword learning
Vocabulary
Token API
Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer("Hello World!")
>>> tokens
['Hello', 'World', '￭!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: Optional[str] = None,
    bpe_model_path: Optional[str] = None,
    bpe_dropout: float = 0,
    vocabulary: Optional[List[str]] = None,
    vocabulary_path: Optional[str] = None,
    vocabulary_threshold: int = 0,
    sp_model_path: Optional[str] = None,
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "￭",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    with_separators: bool = False,
    allow_isolated_marks: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None,
)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: Optional[str] = None,
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# Tokenize a text.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.__call__(text: str, training: bool = True) -> List[str]

# Tokenize a text and return optional features.
# When as_token_objects=True, the method returns Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = False,
    training: bool = True,
) -> Union[Tuple[List[str], Optional[List[List[str]]]], List[pyonmttok.Token]]

# Tokenize a batch of text.
tokenizer.tokenize_batch(
    batch_text: List[str],
    as_token_objects: bool = False,
    training: bool = True,
) -> Union[Tuple[List[List[str]], List[Optional[List[List[str]]]]], List[List[pyonmttok.Token]]]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
    tokens_delimiter: str = " ",
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: List[str],
    features: Optional[List[List[str]]] = None,
) -> str
tokenizer.detokenize(tokens: List[pyonmttok.Token]) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = False,
    unicode_ranges: bool = False,
) -> Tuple[str, Dict[int, Tuple[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(
    input_path: str,
    output_path: str,
    tokens_delimiter: str = " ",
)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False,
)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options,
)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Vocabulary

Example

tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)

with open("train.txt") as train_file:
    vocab = pyonmttok.build_vocab_from_lines(
        train_file,
        tokenizer=tokenizer,
        maximum_size=32000,
        special_tokens=["<blank>", "<unk>", "<s>", "</s>"],
    )

with open("vocab.txt", "w") as vocab_file:
    for token in vocab.ids_to_tokens:
        vocab_file.write("%s\n" % token)

Interface

# Special tokens are added with ids 0, 1, etc., and are never removed by a resize.
vocab = pyonmttok.Vocab(special_tokens: Optional[List[str]] = None)

# Read-only properties.
vocab.tokens_to_ids -> Dict[str, int]
vocab.ids_to_tokens -> List[str]
vocab.counters -> List[int]

# Get or set the ID returned for out-of-vocabulary tokens.
# By default, it is the ID of the token <unk> if present in the vocabulary, len(vocab) otherwise.
vocab.default_id -> int

vocab.lookup_token(token: str) -> int
vocab.lookup_index(index: int) -> str

# Calls lookup_token on a batch of tokens.
vocab.__call__(tokens: List[str]) -> List[int]

vocab.__len__() -> int                  # Implements: len(vocab)
vocab.__contains__(token: str) -> bool  # Implements: "hello" in vocab
vocab.__getitem__(token: str) -> int    # Implements: vocab["hello"]

# Add tokens to the vocabulary after tokenization.
# If a tokenizer is not set, the text is split on spaces.
vocab.add_from_text(text: str, tokenizer: Optional[pyonmttok.Tokenizer] = None) -> None
vocab.add_from_file(path: str, tokenizer: Optional[pyonmttok.Tokenizer] = None) -> None
vocab.add_token(token: str, count: int = 1) -> None

vocab.resize(maximum_size: int = 0, minimum_frequency: int = 1) -> None


# Build a vocabulary from an iterator of lines.
# If a tokenizer is not set, the lines are split on spaces.
pyonmttok.build_vocab_from_lines(
    lines: Iterable[str],
    tokenizer: Optional[pyonmttok.Tokenizer] = None,
    maximum_size: int = 0,
    minimum_frequency: int = 1,
    special_tokens: Optional[List[str]] = None,
) -> pyonmttok.Vocab

# Build a vocabulary from an iterator of tokens.
pyonmttok.build_vocab_from_tokens(
    tokens: Iterable[str],
    maximum_size: int = 0,
    minimum_frequency: int = 1,
    special_tokens: Optional[List[str]] = None,
) -> pyonmttok.Vocab

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '￭!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '￭.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

surface: a string, the token value
type: a pyonmttok.TokenType value, the type of the token
join_left: a boolean, whether the token should be joined to the token on the left or not
join_right: a boolean, whether the token should be joined to the token on the right or not
preserve: a boolean, whether joiners and spacers can be attached to this token or not
features: a list of string, the features attached to the token
spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

TokenType.WORD
TokenType.LEADING_SUBWORD
TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

Casing.LOWERCASE
Casing.UPPERCASE
Casing.MIXED
Casing.CAPITALIZED
Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(
    tokens: List[pyonmttok.Token],
) -> Tuple[List[str], Optional[List[List[str]]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None,
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

# Checks if the language code is valid.
pyonmttok.is_valid_language(lang: str).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.38.1

Jan 10, 2026

1.38.0

Dec 31, 2025

1.37.1

Mar 1, 2023

1.37.0

Feb 28, 2023

1.36.0

Jan 11, 2023

1.35.0

Dec 6, 2022

1.34.0

Sep 13, 2022

1.33.0

Aug 29, 2022

1.32.0

Jul 25, 2022

1.31.0

Mar 7, 2022

1.30.1

Jan 25, 2022

1.30.0

Nov 29, 2021

1.29.0

Oct 8, 2021

1.28.1

Sep 30, 2021

1.28.0

Sep 17, 2021

1.27.0

Aug 30, 2021

1.26.4

Jun 25, 2021

1.26.3

Jun 24, 2021

1.26.2

Jun 8, 2021

1.26.1

May 31, 2021

1.26.0

Apr 19, 2021

1.25.0

Mar 15, 2021

1.24.0

Feb 16, 2021

1.23.0

Dec 30, 2020

1.22.2

Nov 12, 2020

1.22.1

Oct 30, 2020

1.22.0

Oct 29, 2020

1.21.0

Oct 22, 2020

1.20.0

Sep 24, 2020

1.19.0

Sep 2, 2020

1.18.5

Jul 7, 2020

1.18.4

May 22, 2020

1.18.3

Mar 9, 2020

1.18.2

Feb 17, 2020

1.18.1

Jan 16, 2020

1.18.0

Jan 6, 2020

1.17.2

Dec 6, 2019

1.17.1

Nov 28, 2019

1.17.0

Nov 13, 2019

1.16.1

Oct 21, 2019

1.16.0

Oct 7, 2019

1.15.7

Sep 20, 2019

1.15.6

Sep 19, 2019

1.15.5

Sep 16, 2019

1.15.4

Sep 14, 2019

1.15.3

Sep 13, 2019

1.15.2

Sep 11, 2019

1.15.1

Sep 5, 2019

1.15.0

Sep 5, 2019

1.14.1

Aug 7, 2019

1.14.0

Jul 19, 2019

1.13.0

Jun 12, 2019

1.12.1

May 27, 2019

1.11.0

Feb 5, 2019

1.10.6

Jan 15, 2019

1.10.5

Jan 4, 2019

1.10.4

Dec 18, 2018

1.10.1

Oct 9, 2018

1.10.0

Oct 5, 2018

1.8.4

Sep 24, 2018

1.7.0

Sep 4, 2018

1.6.2

Sep 3, 2018

1.6.1

Jul 31, 2018

1.5.3

Jul 13, 2018

1.5.2

Jul 12, 2018

1.5.0

Jul 3, 2018

1.4.0

Jun 13, 2018

1.3.0

Apr 7, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyonmttok-1.38.1-cp312-cp312-win_amd64.whl (14.6 MB view details)

Uploaded Jan 10, 2026 CPython 3.12Windows x86-64

pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded Jan 10, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.6 MB view details)

Uploaded Jan 10, 2026 CPython 3.12manylinux: glibc 2.17+ ARM64

pyonmttok-1.38.1-cp311-cp311-win_amd64.whl (14.6 MB view details)

Uploaded Jan 10, 2026 CPython 3.11Windows x86-64

pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded Jan 10, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64

pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.6 MB view details)

Uploaded Jan 10, 2026 CPython 3.11manylinux: glibc 2.17+ ARM64

pyonmttok-1.38.1-cp311-cp311-macosx_11_0_arm64.whl (231.6 kB view details)

Uploaded Jan 10, 2026 CPython 3.11macOS 11.0+ ARM64

pyonmttok-1.38.1-cp311-cp311-macosx_10_9_x86_64.whl (235.9 kB view details)

Uploaded Jan 10, 2026 CPython 3.11macOS 10.9+ x86-64

pyonmttok-1.38.1-cp310-cp310-win_amd64.whl (14.6 MB view details)

Uploaded Jan 10, 2026 CPython 3.10Windows x86-64

pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded Jan 10, 2026 CPython 3.10manylinux: glibc 2.17+ x86-64

pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.5 MB view details)

Uploaded Jan 10, 2026 CPython 3.10manylinux: glibc 2.17+ ARM64

pyonmttok-1.38.1-cp310-cp310-macosx_11_0_arm64.whl (230.1 kB view details)

Uploaded Jan 10, 2026 CPython 3.10macOS 11.0+ ARM64

pyonmttok-1.38.1-cp310-cp310-macosx_10_9_x86_64.whl (234.8 kB view details)

Uploaded Jan 10, 2026 CPython 3.10macOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.38.1-cp312-cp312-win_amd64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp312-cp312-win_amd64.whl
Upload date: Jan 10, 2026
Size: 14.6 MB
Tags: CPython 3.12, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp312-cp312-win_amd64.whl
Algorithm	Hash digest
SHA256	`4e6ba5368f28792ab05d4bd6da31f26981a6de0d105d9eeff748e62e724c9989`
MD5	`b2700aa3cad7ef0e5dd51442a4d9d4e1`
BLAKE2b-256	`c726ea94a9f7719fefb854f1a08cf999869db5313281081184bc35774a3aa549`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jan 10, 2026
Size: 17.7 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`b3cda3d8e54904d0557e4c081c0d2fa88ba4a292b4a4deeabeda7f8ae8eb281c`
MD5	`3bd0cf282bd2bb1142731cc4a6c1d8ec`
BLAKE2b-256	`e5d9835b9594d5acafe390cc6757405c79dffb72e3167390b765b9e457a66137`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jan 10, 2026
Size: 17.6 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`716367474e73b993ca38e457c16a41fecc7deea03b67d2ed46c458bf65420bb1`
MD5	`81d49415f573839293157122051d5fa6`
BLAKE2b-256	`6fd824128bebb7eec4d25561f288142414786234d9db41c83488a925c124dd1a`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-win_amd64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp311-cp311-win_amd64.whl
Upload date: Jan 10, 2026
Size: 14.6 MB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`3c263c148b851e15cf83cbe7bf78d4fe425de5df0c1454069f28432fd5fed046`
MD5	`05f71a94b8388fc3083189918293c537`
BLAKE2b-256	`e27214abbbee632d07b9a30de158bb8df56fca0958e3da4b946e4ecdc2d97418`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jan 10, 2026
Size: 17.6 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`b070475b437c382412ab87138ce57d9b4b5154d9ef52a57d1545a053533b355a`
MD5	`14c6fa149ebb82b3fe210dc371675519`
BLAKE2b-256	`f927c968fbd9cabef46507e7316faa751e939649744888080e490de37daf326a`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jan 10, 2026
Size: 17.6 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`6bfec4db0c2b773e768d60eca856569f04713ae3483cbd95bff03c6718d9c9e5`
MD5	`a883301c7575546b892f3c8e86a62769`
BLAKE2b-256	`af07a885a434a11825687167317b414e79ed815a7b4525b930adcdeff57e52d2`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp311-cp311-macosx_11_0_arm64.whl
Upload date: Jan 10, 2026
Size: 231.6 kB
Tags: CPython 3.11, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`817796f840b77a0fbcbb73c7126b37baf1339f4ed2b4015bcf58c02b63172498`
MD5	`0f59b9bb44618d05789d7fe130d8bec3`
BLAKE2b-256	`8b95548af9f71df8cf9ac9df0792d83db32f70cb3357ca838cb262853673af29`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp311-cp311-macosx_10_9_x86_64.whl
Upload date: Jan 10, 2026
Size: 235.9 kB
Tags: CPython 3.11, macOS 10.9+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm	Hash digest
SHA256	`2a520e524ba58c1fe0c074023c23287274f31b45268baa85cb85a07b79ff5585`
MD5	`a0f295188a0da48402c773de2396b635`
BLAKE2b-256	`def21f508b580488da16a1ac96e5fc31004b853de3a1b56f14db945656018809`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-win_amd64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp310-cp310-win_amd64.whl
Upload date: Jan 10, 2026
Size: 14.6 MB
Tags: CPython 3.10, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-win_amd64.whl
Algorithm	Hash digest
SHA256	`cb5c001defb13b0f22a32880dee4d31da8b862aa9bef2705814ffe466eeefed9`
MD5	`7f506375d7c394930c95c1b8d468692f`
BLAKE2b-256	`950dd39aacc63fc911128994a193a32f1a94b405da341983e70a58fdc894f7a7`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jan 10, 2026
Size: 17.6 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`204b8faef65d678115f68e2f48d98713e7795935997766e7a07a9fbc560f6333`
MD5	`ca90a0ac5c133b69a2e128a1270b5853`
BLAKE2b-256	`b0b807bb1b6e57fb9425232d54919da79fd0a2c9296f354b3e78def2c875b5f7`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jan 10, 2026
Size: 17.5 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`30654053d843ba1f3a324b6a22d6f96a01bf9aaa96b5c46e9135c7bcdee5375e`
MD5	`0437b9c5a7940d2e278b7f7de8ae0087`
BLAKE2b-256	`f35d2a0090d5a6694566f8eca7ad1658f4a74355007db7babe4d5b0a856c8ab3`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp310-cp310-macosx_11_0_arm64.whl
Upload date: Jan 10, 2026
Size: 230.1 kB
Tags: CPython 3.10, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`d55b53316601c697b00b3887b54bbc9b908536e6e3043e4dd6d42f4f625dda73`
MD5	`2f53f8fa7e33d1950ef59a15eb28770e`
BLAKE2b-256	`205db01841e35e4bede406aae90198f733ed1d8de6772273ac8efdf4af427e4b`

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

Download URL: pyonmttok-1.38.1-cp310-cp310-macosx_10_9_x86_64.whl
Upload date: Jan 10, 2026
Size: 234.8 kB
Tags: CPython 3.10, macOS 10.9+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm	Hash digest
SHA256	`ca6965f7dc8843916b75d7af32efccb505dee7593ef05c716bf334775ac70630`
MD5	`fd782435b93280687daf90760170aede`
BLAKE2b-256	`d656ebeb603a0d6f6e04e8cb8046cdfd60e859f2345fb2284e8b0fa2af4d445f`

See more details on using hashes here.

pyonmttok 1.38.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pyonmttok

Tokenization

Example

Interface

Constructor

Tokenization

Detokenization

Subword learning

Example

Interface

Vocabulary

Example

Interface

Token API

Example

Interface

Utilities

Interface

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes