Skip to main content

Fast and customizable text tokenization library with BPE and SentencePiece support

Project description

pyonmttok

pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.

Installation:

pip install pyonmttok

Requirements:

  • OS: Linux, macOS, Windows
  • Python version: >= 3.6
  • pip version: >= 19.3

Table of contents

  1. Tokenization
  2. Subword learning
  3. Vocabulary
  4. Token API
  5. Utilities

Tokenization

Example

>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

Interface

Constructor

tokenizer = pyonmttok.Tokenizer(
    mode: str,
    *,
    lang: Optional[str] = None,
    bpe_model_path: Optional[str] = None,
    bpe_dropout: float = 0,
    vocabulary: Optional[List[str]] = None,
    vocabulary_path: Optional[str] = None,
    vocabulary_threshold: int = 0,
    sp_model_path: Optional[str] = None,
    sp_nbest_size: int = 0,
    sp_alpha: float = 0.1,
    joiner: str = "■",
    joiner_annotate: bool = False,
    joiner_new: bool = False,
    support_prior_joiners: bool = False,
    spacer_annotate: bool = False,
    spacer_new: bool = False,
    case_feature: bool = False,
    case_markup: bool = False,
    soft_case_regions: bool = False,
    no_substitution: bool = False,
    with_separators: bool = False,
    allow_isolated_marks: bool = False,
    preserve_placeholders: bool = False,
    preserve_segmented_tokens: bool = False,
    segment_case: bool = False,
    segment_numbers: bool = False,
    segment_alphabet_change: bool = False,
    segment_alphabet: Optional[List[str]] = None,
)

# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
    model_path: str,
    vocabulary_path: Optional[str] = None,
    vocabulary_threshold: int = 0,
    nbest_size: int = 0,
    alpha: float = 0.1,
)

# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)

# Return the tokenization options (excluding options related to subword).
tokenizer.options

See the documentation for a description of each tokenization option.

Tokenization

# Tokenize a text.
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.__call__(text: str, training: bool = True) -> List[str]

# Tokenize a text and return optional features.
# When as_token_objects=True, the method returns Token objects (see below).
tokenizer.tokenize(
    text: str,
    as_token_objects: bool = False,
    training: bool = True,
) -> Union[Tuple[List[str], Optional[List[List[str]]]], List[pyonmttok.Token]]

# Tokenize a batch of text.
tokenizer.tokenize_batch(
    batch_text: List[str],
    as_token_objects: bool = False,
    training: bool = True,
) -> Union[Tuple[List[List[str]], List[Optional[List[List[str]]]]], List[List[pyonmttok.Token]]]

# Tokenize a file.
tokenizer.tokenize_file(
    input_path: str,
    output_path: str,
    num_threads: int = 1,
    verbose: bool = False,
    training: bool = True,
    tokens_delimiter: str = " ",
)

Detokenization

# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
    tokens: List[str],
    features: Optional[List[List[str]]] = None,
) -> str
tokenizer.detokenize(tokens: List[pyonmttok.Token]) -> str

# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
    tokens: Union[List[str], List[pyonmttok.Token]],
    merge_ranges: bool = False,
    unicode_ranges: bool = False,
) -> Tuple[str, Dict[int, Tuple[int, int]]]

# Detokenize a file.
tokenizer.detokenize_file(
    input_path: str,
    output_path: str,
    tokens_delimiter: str = " ",
)

Subword learning

Example

The Python wrapper supports BPE and SentencePiece subword learning through a common interface:

1. Create the subword learner with the tokenization you want to apply, e.g.:

# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)

2. Feed some raw data:

# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")

# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")

3. Start the learning process:

tokenizer = learner.learn("/data/model-32k")

The returned tokenizer instance can be used to apply subword tokenization on new data.

Interface

# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "space".
    symbols: int = 10000,
    min_frequency: int = 2,
    total_symbols: bool = False,
)

# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
    tokenizer: Optional[pyonmttok.Tokenizer] = None,  # Defaults to tokenization mode "none".
    keep_vocab: bool = False,  # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
    **training_options,
)

learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])

learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer

Vocabulary

Example

tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)

with open("train.txt") as train_file:
    vocab = pyonmttok.build_vocab_from_lines(
        train_file,
        tokenizer=tokenizer,
        maximum_size=32000,
        special_tokens=["<blank>", "<unk>", "<s>", "</s>"],
    )

with open("vocab.txt", "w") as vocab_file:
    for token in vocab.ids_to_tokens:
        vocab_file.write("%s\n" % token)

Interface

# Special tokens are added with ids 0, 1, etc., and are never removed by a resize.
vocab = pyonmttok.Vocab(special_tokens: Optional[List[str]] = None)

# Read-only properties.
vocab.tokens_to_ids -> Dict[str, int]
vocab.ids_to_tokens -> List[str]
vocab.counters -> List[int]

# Get or set the ID returned for out-of-vocabulary tokens.
# By default, it is the ID of the token <unk> if present in the vocabulary, len(vocab) otherwise.
vocab.default_id -> int

vocab.lookup_token(token: str) -> int
vocab.lookup_index(index: int) -> str

# Calls lookup_token on a batch of tokens.
vocab.__call__(tokens: List[str]) -> List[int]

vocab.__len__() -> int                  # Implements: len(vocab)
vocab.__contains__(token: str) -> bool  # Implements: "hello" in vocab
vocab.__getitem__(token: str) -> int    # Implements: vocab["hello"]

# Add tokens to the vocabulary after tokenization.
# If a tokenizer is not set, the text is split on spaces.
vocab.add_from_text(text: str, tokenizer: Optional[pyonmttok.Tokenizer] = None) -> None
vocab.add_from_file(path: str, tokenizer: Optional[pyonmttok.Tokenizer] = None) -> None
vocab.add_token(token: str, count: int = 1) -> None

vocab.resize(maximum_size: int = 0, minimum_frequency: int = 1) -> None


# Build a vocabulary from an iterator of lines.
# If a tokenizer is not set, the lines are split on spaces.
pyonmttok.build_vocab_from_lines(
    lines: Iterable[str],
    tokenizer: Optional[pyonmttok.Tokenizer] = None,
    maximum_size: int = 0,
    minimum_frequency: int = 1,
    special_tokens: Optional[List[str]] = None,
) -> pyonmttok.Vocab

# Build a vocabulary from an iterator of tokens.
pyonmttok.build_vocab_from_tokens(
    tokens: Iterable[str],
    maximum_size: int = 0,
    minimum_frequency: int = 1,
    special_tokens: Optional[List[str]] = None,
) -> pyonmttok.Vocab

Token API

The Token API allows to tokenize text into pyonmttok.Token objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.

Example

>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'

Interface

The pyonmttok.Token class has the following attributes:

  • surface: a string, the token value
  • type: a pyonmttok.TokenType value, the type of the token
  • join_left: a boolean, whether the token should be joined to the token on the left or not
  • join_right: a boolean, whether the token should be joined to the token on the right or not
  • preserve: a boolean, whether joiners and spacers can be attached to this token or not
  • features: a list of string, the features attached to the token
  • spacer: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)
  • casing: a pyonmttok.Casing value, the casing of the token (only set when tokenizing with case_feature or case_markup)

The pyonmttok.TokenType enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:

  • TokenType.WORD
  • TokenType.LEADING_SUBWORD
  • TokenType.TRAILING_SUBWORD

The pyonmttok.Casing enumeration is used to identify the original casing of a token that was lowercased by the case_feature or case_markup tokenization options. The enumeration has the following values:

  • Casing.LOWERCASE
  • Casing.UPPERCASE
  • Casing.MIXED
  • Casing.CAPITALIZED
  • Casing.NONE

The Tokenizer instances provide methods to serialize or deserialize Token objects:

# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(
    tokens: List[pyonmttok.Token],
) -> Tuple[List[str], Optional[List[List[str]]]]

# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
    tokens: List[str],
    features: Optional[List[List[str]]] = None,
) -> List[pyonmttok.Token]

Utilities

Interface

# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)

# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)

# Checks if the language code is valid.
pyonmttok.is_valid_language(lang: str).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.38.1-cp312-cp312-win_amd64.whl (14.6 MB view details)

Uploaded CPython 3.12Windows x86-64

pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ ARM64

pyonmttok-1.38.1-cp311-cp311-win_amd64.whl (14.6 MB view details)

Uploaded CPython 3.11Windows x86-64

pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ ARM64

pyonmttok-1.38.1-cp311-cp311-macosx_11_0_arm64.whl (231.6 kB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

pyonmttok-1.38.1-cp311-cp311-macosx_10_9_x86_64.whl (235.9 kB view details)

Uploaded CPython 3.11macOS 10.9+ x86-64

pyonmttok-1.38.1-cp310-cp310-win_amd64.whl (14.6 MB view details)

Uploaded CPython 3.10Windows x86-64

pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.6 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (17.5 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ ARM64

pyonmttok-1.38.1-cp310-cp310-macosx_11_0_arm64.whl (230.1 kB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

pyonmttok-1.38.1-cp310-cp310-macosx_10_9_x86_64.whl (234.8 kB view details)

Uploaded CPython 3.10macOS 10.9+ x86-64

File details

Details for the file pyonmttok-1.38.1-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: pyonmttok-1.38.1-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 14.6 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4e6ba5368f28792ab05d4bd6da31f26981a6de0d105d9eeff748e62e724c9989
MD5 b2700aa3cad7ef0e5dd51442a4d9d4e1
BLAKE2b-256 c726ea94a9f7719fefb854f1a08cf999869db5313281081184bc35774a3aa549

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b3cda3d8e54904d0557e4c081c0d2fa88ba4a292b4a4deeabeda7f8ae8eb281c
MD5 3bd0cf282bd2bb1142731cc4a6c1d8ec
BLAKE2b-256 e5d9835b9594d5acafe390cc6757405c79dffb72e3167390b765b9e457a66137

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 716367474e73b993ca38e457c16a41fecc7deea03b67d2ed46c458bf65420bb1
MD5 81d49415f573839293157122051d5fa6
BLAKE2b-256 6fd824128bebb7eec4d25561f288142414786234d9db41c83488a925c124dd1a

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: pyonmttok-1.38.1-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 14.6 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 3c263c148b851e15cf83cbe7bf78d4fe425de5df0c1454069f28432fd5fed046
MD5 05f71a94b8388fc3083189918293c537
BLAKE2b-256 e27214abbbee632d07b9a30de158bb8df56fca0958e3da4b946e4ecdc2d97418

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b070475b437c382412ab87138ce57d9b4b5154d9ef52a57d1545a053533b355a
MD5 14c6fa149ebb82b3fe210dc371675519
BLAKE2b-256 f927c968fbd9cabef46507e7316faa751e939649744888080e490de37daf326a

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 6bfec4db0c2b773e768d60eca856569f04713ae3483cbd95bff03c6718d9c9e5
MD5 a883301c7575546b892f3c8e86a62769
BLAKE2b-256 af07a885a434a11825687167317b414e79ed815a7b4525b930adcdeff57e52d2

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 817796f840b77a0fbcbb73c7126b37baf1339f4ed2b4015bcf58c02b63172498
MD5 0f59b9bb44618d05789d7fe130d8bec3
BLAKE2b-256 8b95548af9f71df8cf9ac9df0792d83db32f70cb3357ca838cb262853673af29

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp311-cp311-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp311-cp311-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 2a520e524ba58c1fe0c074023c23287274f31b45268baa85cb85a07b79ff5585
MD5 a0f295188a0da48402c773de2396b635
BLAKE2b-256 def21f508b580488da16a1ac96e5fc31004b853de3a1b56f14db945656018809

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: pyonmttok-1.38.1-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 14.6 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 cb5c001defb13b0f22a32880dee4d31da8b862aa9bef2705814ffe466eeefed9
MD5 7f506375d7c394930c95c1b8d468692f
BLAKE2b-256 950dd39aacc63fc911128994a193a32f1a94b405da341983e70a58fdc894f7a7

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 204b8faef65d678115f68e2f48d98713e7795935997766e7a07a9fbc560f6333
MD5 ca90a0ac5c133b69a2e128a1270b5853
BLAKE2b-256 b0b807bb1b6e57fb9425232d54919da79fd0a2c9296f354b3e78def2c875b5f7

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 30654053d843ba1f3a324b6a22d6f96a01bf9aaa96b5c46e9135c7bcdee5375e
MD5 0437b9c5a7940d2e278b7f7de8ae0087
BLAKE2b-256 f35d2a0090d5a6694566f8eca7ad1658f4a74355007db7babe4d5b0a856c8ab3

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d55b53316601c697b00b3887b54bbc9b908536e6e3043e4dd6d42f4f625dda73
MD5 2f53f8fa7e33d1950ef59a15eb28770e
BLAKE2b-256 205db01841e35e4bede406aae90198f733ed1d8de6772273ac8efdf4af427e4b

See more details on using hashes here.

File details

Details for the file pyonmttok-1.38.1-cp310-cp310-macosx_10_9_x86_64.whl.

File metadata

File hashes

Hashes for pyonmttok-1.38.1-cp310-cp310-macosx_10_9_x86_64.whl
Algorithm Hash digest
SHA256 ca6965f7dc8843916b75d7af32efccb505dee7593ef05c716bf334775ac70630
MD5 fd782435b93280687daf90760170aede
BLAKE2b-256 d656ebeb603a0d6f6e04e8cb8046cdfd60e859f2345fb2284e8b0fa2af4d445f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page