Fast and customizable text tokenization library with BPE and SentencePiece support
Project description
pyonmttok
pyonmttok is the Python wrapper for OpenNMT/Tokenizer, a fast and customizable text tokenization library with BPE and SentencePiece support.
Installation:
pip install pyonmttok
Requirements:
- OS: Linux, macOS, Windows
- Python version: >= 3.5
- pip version: >= 19.0
Table of contents
Tokenization
Example
>>> import pyonmtok
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'
Interface
Constructor
tokenizer = pyonmttok.Tokenizer(
mode: str,
*,
lang: Optional[str] = None,
bpe_model_path: Optional[str] = None,
bpe_dropout: float = 0,
vocabulary_path: Optional[str] = None,
vocabulary_threshold: int = 0,
sp_model_path: Optional[str] = None,
sp_nbest_size: int = 0,
sp_alpha: float = 0.1,
joiner: str = "■",
joiner_annotate: bool = False,
joiner_new: bool = False,
support_prior_joiners: bool = False,
spacer_annotate: bool = False,
spacer_new: bool = False,
case_feature: bool = False,
case_markup: bool = False,
soft_case_regions: bool = False,
no_substitution: bool = False,
with_separators: bool = False,
preserve_placeholders: bool = False,
preserve_segmented_tokens: bool = False,
segment_case: bool = False,
segment_numbers: bool = False,
segment_alphabet_change: bool = False,
segment_alphabet: Optional[List[str]] = None,
)
# SentencePiece-compatible tokenizer.
tokenizer = pyonmttok.SentencePieceTokenizer(
model_path: str,
vocabulary_path: Optional[str] = None,
vocabulary_threshold: int = 0,
nbest_size: int = 0,
alpha: float = 0.1,
)
# Copy constructor.
tokenizer = pyonmttok.Tokenizer(tokenizer: pyonmttok.Tokenizer)
# Return the tokenization options (excluding options related to subword).
tokenizer.options
See the documentation for a description of each tokenization option.
Tokenization
# By default, tokenize returns the tokens and features.
# When as_token_objects=True, the method returns Token objects (see below).
# When training=False, subword regularization such as BPE dropout is disabled.
tokenizer.tokenize(
text: str,
as_token_objects: bool = False,
training: bool = True,
) -> Union[Tuple[List[str], Optional[List[List[str]]]], List[pyonmttok.Token]]
# Tokenize a file.
tokenizer.tokenize_file(
input_path: str,
output_path: str,
num_threads: int = 1,
verbose: bool = False,
training: bool = True,
tokens_delimiter: str = " ",
)
Detokenization
# The detokenize method converts a list of tokens back to a string.
tokenizer.detokenize(
tokens: List[str],
features: Optional[List[List[str]]] = None,
) -> str
tokenizer.detokenize(tokens: List[pyonmttok.Token]) -> str
# The detokenize_with_ranges method also returns a dictionary mapping a token
# index to a range in the detokenized text.
# Set merge_ranges=True to merge consecutive ranges, e.g. subwords of the same
# token in case of subword tokenization.
# Set unicode_ranges=True to return ranges over Unicode characters instead of bytes.
tokenizer.detokenize_with_ranges(
tokens: Union[List[str], List[pyonmttok.Token]],
merge_ranges: bool = False,
unicode_ranges: bool = False,
) -> Tuple[str, Dict[int, Tuple[int, int]]]
# Detokenize a file.
tokenizer.detokenize_file(
input_path: str,
output_path: str,
tokens_delimiter: str = " ",
)
Subword learning
Example
The Python wrapper supports BPE and SentencePiece subword learning through a common interface:
1. Create the subword learner with the tokenization you want to apply, e.g.:
# BPE is trained and applied on the tokenization output before joiner (or spacer) annotations.
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)
# SentencePiece can learn from raw sentences so a tokenizer in not required.
learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.98)
2. Feed some raw data:
# Feed detokenized sentences:
learner.ingest("Hello world!")
learner.ingest("How are you?")
# or detokenized text files:
learner.ingest_file("/data/train1.en")
learner.ingest_file("/data/train2.en")
3. Start the learning process:
tokenizer = learner.learn("/data/model-32k")
The returned tokenizer
instance can be used to apply subword tokenization on new data.
Interface
# See https://github.com/rsennrich/subword-nmt/blob/master/subword_nmt/learn_bpe.py
# for argument documentation.
learner = pyonmttok.BPELearner(
tokenizer: Optional[pyonmttok.Tokenizer] = None, # Defaults to tokenization mode "space".
symbols: int = 10000,
min_frequency: int = 2,
total_symbols: bool = False,
)
# See https://github.com/google/sentencepiece/blob/master/src/spm_train_main.cc
# for available training options.
learner = pyonmttok.SentencePieceLearner(
tokenizer: Optional[pyonmttok.Tokenizer] = None, # Defaults to tokenization mode "none".
keep_vocab: bool = False, # Keep the generated vocabulary (model_path will act like model_prefix in spm_train)
**training_options,
)
learner.ingest(text: str)
learner.ingest_file(path: str)
learner.ingest_token(token: Union[str, pyonmttok.Token])
learner.learn(model_path: str, verbose: bool = False) -> pyonmttok.Tokenizer
Token API
The Token API allows to tokenize text into pyonmttok.Token
objects. This API can be useful to apply some logics at the token level but still retain enough information to write the tokenization on disk or detokenize.
Example
>>> tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
>>> tokens = tokenizer.tokenize("Hello World!", as_token_objects=True)
>>> tokens
[Token('Hello'), Token('World'), Token('!', join_left=True)]
>>> tokens[-1].surface
'!'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■!']
>>> tokens[-1].surface = '.'
>>> tokenizer.serialize_tokens(tokens)[0]
['Hello', 'World', '■.']
>>> tokenizer.detokenize(tokens)
'Hello World.'
Interface
The pyonmttok.Token
class has the following attributes:
surface
: a string, the token valuetype
: apyonmttok.TokenType
value, the type of the tokenjoin_left
: a boolean, whether the token should be joined to the token on the left or notjoin_right
: a boolean, whether the token should be joined to the token on the right or notpreserve
: a boolean, whether joiners and spacers can be attached to this token or notfeatures
: a list of string, the features attached to the tokenspacer
: a boolean, whether the token is prefixed by a SentencePiece spacer or not (only set when using SentencePiece)casing
: apyonmttok.Casing
value, the casing of the token (only set when tokenizing withcase_feature
orcase_markup
)
The pyonmttok.TokenType
enumeration is used to identify tokens that were split by a subword tokenization. The enumeration has the following values:
TokenType.WORD
TokenType.LEADING_SUBWORD
TokenType.TRAILING_SUBWORD
The pyonmttok.Casing
enumeration is used to identify the original casing of a token that was lowercased by the case_feature
or case_markup
tokenization options. The enumeration has the following values:
Casing.LOWERCASE
Casing.UPPERCASE
Casing.MIXED
Casing.CAPITALIZED
Casing.NONE
The Tokenizer
instances provide methods to serialize or deserialize Token
objects:
# Serialize Token objects to strings that can be saved on disk.
tokenizer.serialize_tokens(
tokens: List[pyonmttok.Token],
) -> Tuple[List[str], Optional[List[List[str]]]]
# Deserialize strings into Token objects.
tokenizer.deserialize_tokens(
tokens: List[str],
features: Optional[List[List[str]]] = None,
) -> List[pyonmttok.Token]
Utilities
Interface
# Returns True if the string has the placeholder format.
pyonmttok.is_placeholder(token: str)
# Sets the random seed for reproducible tokenization.
pyonmttok.set_random_seed(seed: int)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for pyonmttok-1.28.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47f3bb5c887aee05b332098472dd11b5d53078ee78a48d5f18943a092243d140 |
|
MD5 | 78ffdbfaded0a9134ab94519a9bac2d4 |
|
BLAKE2b-256 | 4a00f46448c2452b201a73a39f4f46a8f7bbdf7aa707669cb1fa1a9535651bf3 |
Hashes for pyonmttok-1.28.0-cp39-cp39-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 30729fac6dc0181fac38273ee91f98014a7c57df2d06db4bf4b03d66d53ca5c4 |
|
MD5 | 0d5eac2832eb6a282f5a8972541a7704 |
|
BLAKE2b-256 | 4e21d5e6a04d39564611e67728776c6e6ce7f7e99fa9d3584dc7f4411848f125 |
Hashes for pyonmttok-1.28.0-cp39-cp39-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | af73bd0f4d7c8f7a370f50e12b8be8b057590942cfe3c7016c6bef967c7f8a19 |
|
MD5 | ebadace1a959979ea0043354b1706555 |
|
BLAKE2b-256 | e9b9ffa9df80c33d997a8242d5f16884c8e5f9d7ac933f902bec6076873308e4 |
Hashes for pyonmttok-1.28.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e3455495fd6280363b17a9f3d4fb202349b10f10c466814bf404611997d667a |
|
MD5 | 243ac8ccf634c9306877f5dd14b00d88 |
|
BLAKE2b-256 | 3523a459d9337710ca7d0b2e35d02b120127f3cc4ec27a06eb71094e9a60f8e9 |
Hashes for pyonmttok-1.28.0-cp38-cp38-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 85327f6b97c0e6534e90ccc29e487287ffc20a2ffed5b8403340a38838651ed1 |
|
MD5 | 99e2994bc69f16e0b04a33625075c1df |
|
BLAKE2b-256 | 2210f5cf807dfd504333f4d2206fd0e00d9f39959e4e6e7e4541ddc2064e15a8 |
Hashes for pyonmttok-1.28.0-cp38-cp38-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bccb5d400191088dfe04e8a261ee21ac25ca0482109992272bc1e72a3203717e |
|
MD5 | 01265baf70f082f27ed869afe8bb30ef |
|
BLAKE2b-256 | e80e62f9d8e3817cad217baae3ae4724cdd73aebb32d94a484f763e61954ae32 |
Hashes for pyonmttok-1.28.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | a0ec3da777c00035c18ae12d90fc14a7a699fd55be93ca931836e08cb03cdedb |
|
MD5 | 51f17b8de4928691bbaa5dbe40cd3772 |
|
BLAKE2b-256 | 102a3fd6811064e9cc00b71e75106dd683e1eb356d4326e2fd610cbaf9375c86 |
Hashes for pyonmttok-1.28.0-cp37-cp37m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 03ebb89fbf2491c53b0208b9be65d5e22d8b465249d67fbd0ac6c7ea5d7667c2 |
|
MD5 | 610b54a10293db0d35fe7b3d94e8c3f6 |
|
BLAKE2b-256 | 4f4a291e85ac12504f7621046d686bf5127e062dad2d574b695af3f1a38be5ed |
Hashes for pyonmttok-1.28.0-cp37-cp37m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed6e74501f8d1a10ae976bba9572982d2690e2dbb9aeb94856a99db67bcaa6e2 |
|
MD5 | 44c93dd727add0bfe0e3baf05dffd720 |
|
BLAKE2b-256 | 3b73038e227a38d913df49d57ec20ebc16f527bbf684fcc09eb3207b0abf5bd5 |
Hashes for pyonmttok-1.28.0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 01b8df2815ed0ce6220dd6e03d23881b9cae7f6119caa69d470817c3322d3730 |
|
MD5 | df3a812f4a6bca861beeceaef5226097 |
|
BLAKE2b-256 | 25b81d4673e8d9efda46ab421b2f75756eafbd48ef23541ff3114988d8b89dc6 |
Hashes for pyonmttok-1.28.0-cp36-cp36m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9f2108447dbaba8cd051825ec484bff6cd9bb48bf0500548430845dc496cf562 |
|
MD5 | f37d56ef1206dbd012c3ebbaf6212fee |
|
BLAKE2b-256 | b78929a0f4c04a9e45e1fc3e9d98f8852e8c001c28e3152ef3ca0dc265efd4c3 |
Hashes for pyonmttok-1.28.0-cp36-cp36m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 41d42e44b27658c28651b03e1a7abfa711547837bcc974f42cb901f0ce056d8a |
|
MD5 | e0a87d4ff2ff9c89280908384be3e0c9 |
|
BLAKE2b-256 | 96e714d6dc0b3d161d48fe75b87fcc61c943f7f2a98dbb99639bbe1a30f50f84 |
Hashes for pyonmttok-1.28.0-cp35-cp35m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2c6c72888e1a8659b43daa6e1d8af99535a8a90f33e75b68d28d7c96baaffd8a |
|
MD5 | 9fb2277e65eb5e6ee6398e9428206139 |
|
BLAKE2b-256 | a658dfaabf64ff755c8c9caa674efa10aa36eecdd8aaf203a87ed4f922985eed |
Hashes for pyonmttok-1.28.0-cp35-cp35m-manylinux2010_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d3b2cc700156f8a8960866c366af4cffc68f11c7341fd511c26cc956f5c16823 |
|
MD5 | fa4828a1ab3b7eb2f0dfc363f98779ff |
|
BLAKE2b-256 | 9592bc07eccafda31ffd5d17082860438ae9137f60311e82c284137fed0b2e2d |
Hashes for pyonmttok-1.28.0-cp35-cp35m-macosx_10_9_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bf3f0c26e2f202302145c0a89a589a00c3d3c3c5551b1cff76a529d2d12c70dd |
|
MD5 | cdbe94dfce0b7d397f57afa478e00f73 |
|
BLAKE2b-256 | 3216865b572b04cffcfb60bf318add083499a63339fa90a8cf037efa365958d9 |