Convert tokenizers into OpenVINO models
Project description
OpenVINO Tokenizers
OpenVINO Tokenizers adds text processing operations to OpenVINO.
Features
- Perform tokenization and detokenization without third-party dependencies
- Convert a HuggingFace tokenizer into OpenVINO model tokenizer and detokenizer
- Combine OpenVINO models into a single model
- Add greedy decoding pipeline to text generation model
Installation
(Recommended) Create and activate virtual env:
python3 -m venv venv
source venv/bin/activate
# or
conda create --name openvino_tokenizers
conda activate openvino_tokenizers
Minimal Installation
Use minimal installation when you have a converted OpenVINO tokenizer:
pip install openvino-tokenizers
# or
conda install -c conda-forge openvino openvino-tokenizers
Convert Tokenizers Installation
If you want to convert HuggingFace tokenizers into OpenVINO tokenizers:
pip install openvino-tokenizers[transformers]
# or
conda install -c conda-forge openvino openvino-tokenizers && pip install transformers[sentencepiece] tiktoken
Install Pre-release Version
Use openvino-tokenizers[transformers]
to install tokenizers conversion dependencies.
pip install --pre -U openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
Build and Install from Source
Install OpenVINO archive distribution. Use --no-deps
to avoid OpenVINO installation from PyPI.
source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install --no-deps .
This command is the equivalent of minimal installation. Install tokenizers conversion dependencies if needed:
pip install transformers[sentencepiece] tiktoken
:warning: Latest commit of OpenVINO Tokenizers might rely on features that are not present in the release OpenVINO version. Use a nightly build of OpenVINO or build OpenVINO Tokenizers from a release branch if you have issues with the build process.
Build and install for development
source path/to/installed/openvino/setupvars.sh
git clone https://github.com/openvinotoolkit/openvino_tokenizers.git
cd openvino_tokenizers
pip install -e .[all]
# verify installation by running tests
cd tests/
pytest .
C++ Installation
You can use converted tokenizers in C++ pipelines with prebuild binaries.
- Download OpenVINO archive distribution for your OS from here and extract the archive.
- Download OpenVINO Tokenizers prebuild libraries from here. To ensure compatibility first three numbers of OpenVINO Tokenizers version should match OpenVINO version and OS.
- Extract OpenVINO Tokenizers archive into OpenVINO installation directory:
- Windows:
<openvino_dir>\runtime\bin\intel64\Release\
- MacOS_x86:
<openvino_dir>/runtime/lib/intel64/Release
- MacOS_arm64:
<openvino_dir>/runtime/lib/arm64/Release/
- Linux_x86:
<openvino_dir>/runtime/lib/intel64/
- Linux_arm64:
<openvino_dir>/runtime/lib/aarch64/
- Windows:
After that you can add binary extension in the code with:
core.add_extension("openvino_tokenizers.dll")
for Windowscore.add_extension("libopenvino_tokenizers.dylib")
for MacOScore.add_extension("libopenvino_tokenizers.so")
for Linux
and read
/compile
converted (de)tokenizers models.
If you use version 2023.3.0.0
, the binary extension file is called (lib)user_ov_extension.(dll/dylib/so)
.
Usage
:warning: OpenVINO Tokenizers can be inferred on a CPU
device only.
Convert HuggingFace tokenizer
OpenVINO Tokenizers ships with CLI tool that can convert tokenizers from Huggingface Hub or Huggingface tokenizers saved on disk:
convert_tokenizer codellama/CodeLlama-7b-hf --with-detokenizer -o output_dir
There is also convert_tokenizer
function that can convert tokenizer python object.
import numpy as np
from transformers import AutoTokenizer
from openvino import compile_model, save_model
from openvino_tokenizers import convert_tokenizer
hf_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
ov_tokenizer = convert_tokenizer(hf_tokenizer)
compiled_tokenzier = compile_model(ov_tokenizer)
text_input = ["Test string"]
hf_output = hf_tokenizer(text_input, return_tensors="np")
ov_output = compiled_tokenzier(text_input)
for output_name in hf_output:
print(f"OpenVINO {output_name} = {ov_output[output_name]}")
print(f"HuggingFace {output_name} = {hf_output[output_name]}")
# OpenVINO input_ids = [[ 101 3231 5164 102]]
# HuggingFace input_ids = [[ 101 3231 5164 102]]
# OpenVINO token_type_ids = [[0 0 0 0]]
# HuggingFace token_type_ids = [[0 0 0 0]]
# OpenVINO attention_mask = [[1 1 1 1]]
# HuggingFace attention_mask = [[1 1 1 1]]
# save tokenizer for later use
save_model(ov_tokenizer, "openvino_tokenizer.xml")
loaded_tokenizer = compile_model("openvino_tokenizer.xml")
loaded_ov_output = loaded_tokenizer(text_input)
for output_name in hf_output:
assert np.all(loaded_ov_output[output_name] == ov_output[output_name])
Connect Tokenizer to a Model
To infer and convert the original model, install torch or torch-cpu to the virtual environment.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from openvino import compile_model, convert_model
from openvino_tokenizers import convert_tokenizer, connect_models
checkpoint = "mrm8488/bert-tiny-finetuned-sms-spam-detection"
hf_tokenizer = AutoTokenizer.from_pretrained(checkpoint)
hf_model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
text_input = ["Free money!!!"]
hf_input = hf_tokenizer(text_input, return_tensors="pt")
hf_output = hf_model(**hf_input)
ov_tokenizer = convert_tokenizer(hf_tokenizer)
ov_model = convert_model(hf_model, example_input=hf_input.data)
combined_model = connect_models(ov_tokenizer, ov_model)
compiled_combined_model = compile_model(combined_model)
openvino_output = compiled_combined_model(text_input)
print(f"OpenVINO logits: {openvino_output['logits']}")
# OpenVINO logits: [[ 1.2007061 -1.4698029]]
print(f"HuggingFace logits {hf_output.logits}")
# HuggingFace logits tensor([[ 1.2007, -1.4698]], grad_fn=<AddmmBackward0>)
Use Extension With Converted (De)Tokenizer or Model With (De)Tokenizer
Import openvino_tokenizers
will add all tokenizer-related operations to OpenVINO,
after which you can work with saved tokenizers and detokenizers.
import numpy as np
import openvino_tokenizers
from openvino import Core
core = Core()
# detokenizer from codellama sentencepiece model
compiled_detokenizer = core.compile_model("detokenizer.xml")
token_ids = np.random.randint(100, 1000, size=(3, 5))
openvino_output = compiled_detokenizer(token_ids)
print(openvino_output["string_output"])
# ['sc�ouition�', 'intvenord hasient', 'g shouldwer M more']
Text generation pipeline
import numpy as np
from openvino import compile_model, convert_model
from openvino_tokenizers import add_greedy_decoding, convert_tokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
model_checkpoint = "JackFram/llama-68m"
hf_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
hf_model = AutoModelForCausalLM.from_pretrained(model_checkpoint, use_cache=False)
# convert hf tokenizer
text_input = ["Quick brown fox jumped "]
ov_tokenizer, ov_detokenizer = convert_tokenizer(hf_tokenizer, with_detokenizer=True, skip_special_tokens=True)
compiled_tokenizer = compile_model(ov_tokenizer)
# transform input text into tokens
ov_input = compiled_tokenizer(text_input)
hf_input = hf_tokenizer(text_input, return_tensors="pt")
# convert Pytorch model to OpenVINO IR and add greedy decoding pipeline to it
ov_model = convert_model(hf_model, example_input=hf_input.data)
ov_model_with_greedy_decoding = add_greedy_decoding(ov_model)
compiled_model = compile_model(ov_model_with_greedy_decoding)
# generate new tokens
new_tokens_size = 10
prompt_size = ov_input["input_ids"].shape[-1]
input_dict = {
output.any_name: np.hstack([tensor, np.zeros(shape=(1, new_tokens_size), dtype=np.int_)])
for output, tensor in ov_input.items()
}
for idx in range(prompt_size, prompt_size + new_tokens_size):
output = compiled_model(input_dict)["token_ids"]
input_dict["input_ids"][:, idx] = output[:, idx - 1]
input_dict["attention_mask"][:, idx] = 1
ov_token_ids = input_dict["input_ids"]
hf_token_ids = hf_model.generate(
**hf_input,
min_new_tokens=new_tokens_size,
max_new_tokens=new_tokens_size,
temperature=0, # greedy decoding
)
# decode model output
compiled_detokenizer = compile_model(ov_detokenizer)
ov_output = compiled_detokenizer(ov_token_ids)["string_output"]
hf_output = hf_tokenizer.batch_decode(hf_token_ids, skip_special_tokens=True)
print(f"OpenVINO output string: `{ov_output}`")
# OpenVINO output string: `['<s> Quick brown fox was walking through the forest. He was looking for something']`
print(f"HuggingFace output string: `{hf_output}`")
# HuggingFace output string: `['Quick brown fox was walking through the forest. He was looking for something']`
TensorFlow Text Integration
OpenVINO Tokenizers include converters for certain TensorFlow Text operations. Currently, only the MUSE model is supported. Here is an example of model conversion and inference:
import numpy as np
import tensorflow_hub as hub
import tensorflow_text # register tf text ops
from openvino import convert_model, compile_model
import openvino_tokenizers # register ov tokenizer ops and translators
sentences = ["dog", "I cuccioli sono carini.", "私は犬と一緒にビーチを散歩するのが好きです"]
tf_embed = hub.load(
"https://www.kaggle.com/models/google/universal-sentence-encoder/frameworks/"
"TensorFlow2/variations/multilingual/versions/2"
)
# convert model that uses Sentencepiece tokenizer op from TF Text
ov_model = convert_model(tf_embed)
ov_embed = compile_model(ov_model, "CPU")
ov_result = ov_embed(sentences)[ov_embed.output()]
tf_result = tf_embed(sentences)
assert np.all(np.isclose(ov_result, tf_result, atol=1e-4))
RWKV Tokenizer
from urllib.request import urlopen
from openvino import compile_model
from openvino_tokenizers import build_rwkv_tokenizer
rwkv_vocab_url = (
"https://raw.githubusercontent.com/BlinkDL/ChatRWKV/main/tokenizer/rwkv_vocab_v20230424.txt"
)
with urlopen(rwkv_vocab_url) as vocab_file:
vocab = map(bytes.decode, vocab_file)
tokenizer, detokenizer = build_rwkv_tokenizer(vocab)
tokenizer, detokenizer = compile_model(tokenizer), compile_model(detokenizer)
print(tokenized := tokenizer(["Test string"])["input_ids"]) # [[24235 47429]]
print(detokenizer(tokenized)["string_output"]) # ['Test string']
Supported Tokenizer Types
Huggingface Tokenizer Type |
Tokenizer Model Type | Tokenizer | Detokenizer |
---|---|---|---|
Fast | WordPiece | ✅ | ❌ |
BPE | ✅ | ✅ | |
Unigram | ❌ | ❌ | |
Legacy | SentencePiece .model | ✅ | ✅ |
Custom | tiktoken | ✅ | ✅ |
RWKV | Trie | ✅ | ✅ |
Test Results
This report is autogenerated and includes tokenizers and detokenizers tests. The Output Matched, %
column shows the percent of test strings for which the results of OpenVINO and Hugingface Tokenizers are the same. To update the report run pytest --update_readme tokenizers_test.py
in tests
directory.
Output Match by Tokenizer Type
Tokenizer Type | Output Matched, % | Number of Tests |
---|---|---|
BPE | 96.82 | 3620 |
SentencePiece | 76.33 | 3620 |
Tiktoken | 97.71 | 218 |
WordPiece | 90.43 | 533 |
Output Match by Model
Tokenizer Type | Model | Output Matched, % | Number of Tests |
---|---|---|---|
BPE | EleutherAI/gpt-j-6b | 98.90 | 181 |
BPE | EleutherAI/gpt-neo-125m | 98.90 | 181 |
BPE | EleutherAI/gpt-neox-20b | 97.79 | 181 |
BPE | EleutherAI/pythia-12b-deduped | 97.79 | 181 |
BPE | KoboldAI/fairseq-dense-13B | 98.90 | 181 |
BPE | Salesforce/codegen-16B-multi | 97.79 | 181 |
BPE | ai-forever/rugpt3large_based_on_gpt2 | 97.79 | 181 |
BPE | bigscience/bloom | 99.45 | 181 |
BPE | databricks/dolly-v2-3b | 97.79 | 181 |
BPE | facebook/bart-large-mnli | 98.90 | 181 |
BPE | facebook/galactica-120b | 98.34 | 181 |
BPE | facebook/opt-66b | 98.90 | 181 |
BPE | gpt2 | 98.90 | 181 |
BPE | laion/CLIP-ViT-bigG-14-laion2B-39B-b160k | 65.19 | 181 |
BPE | microsoft/deberta-base | 98.90 | 181 |
BPE | roberta-base | 98.90 | 181 |
BPE | sentence-transformers/all-roberta-large-v1 | 98.90 | 181 |
BPE | stabilityai/stablecode-completion-alpha-3b-4k | 98.34 | 181 |
BPE | stabilityai/stablelm-2-1_6b | 98.34 | 181 |
BPE | stabilityai/stablelm-tuned-alpha-7b | 97.79 | 181 |
SentencePiece | NousResearch/Llama-2-13b-hf | 100.00 | 181 |
SentencePiece | NousResearch/Llama-2-13b-hf_slow | 100.00 | 181 |
SentencePiece | THUDM/chatglm2-6b | 100.00 | 181 |
SentencePiece | THUDM/chatglm2-6b_slow | 100.00 | 181 |
SentencePiece | THUDM/chatglm3-6b | 19.34 | 181 |
SentencePiece | THUDM/chatglm3-6b_slow | 19.34 | 181 |
SentencePiece | camembert-base | 0.55 | 181 |
SentencePiece | camembert-base_slow | 74.03 | 181 |
SentencePiece | codellama/CodeLlama-7b-hf | 100.00 | 181 |
SentencePiece | codellama/CodeLlama-7b-hf_slow | 100.00 | 181 |
SentencePiece | facebook/musicgen-small | 80.11 | 181 |
SentencePiece | facebook/musicgen-small_slow | 74.03 | 181 |
SentencePiece | microsoft/deberta-v3-base | 93.37 | 181 |
SentencePiece | microsoft/deberta-v3-base_slow | 100.00 | 181 |
SentencePiece | t5-base | 81.22 | 181 |
SentencePiece | t5-base_slow | 75.14 | 181 |
SentencePiece | xlm-roberta-base | 97.24 | 181 |
SentencePiece | xlm-roberta-base_slow | 97.24 | 181 |
SentencePiece | xlnet-base-cased | 61.33 | 181 |
SentencePiece | xlnet-base-cased_slow | 53.59 | 181 |
Tiktoken | Qwen/Qwen-14B-Chat | 98.17 | 109 |
Tiktoken | Salesforce/xgen-7b-8k-base | 97.25 | 109 |
WordPiece | ProsusAI/finbert | 95.12 | 41 |
WordPiece | bert-base-multilingual-cased | 95.12 | 41 |
WordPiece | bert-base-uncased | 95.12 | 41 |
WordPiece | cointegrated/rubert-tiny2 | 80.49 | 41 |
WordPiece | distilbert-base-uncased-finetuned-sst-2-english | 95.12 | 41 |
WordPiece | google/electra-base-discriminator | 95.12 | 41 |
WordPiece | google/mobilebert-uncased | 95.12 | 41 |
WordPiece | jhgan/ko-sbert-sts | 75.61 | 41 |
WordPiece | prajjwal1/bert-mini | 95.12 | 41 |
WordPiece | rajiv003/ernie-finetuned-qqp | 95.12 | 41 |
WordPiece | rasa/LaBSE | 87.80 | 41 |
WordPiece | sentence-transformers/all-MiniLM-L6-v2 | 75.61 | 41 |
WordPiece | squeezebert/squeezebert-uncased | 95.12 | 41 |
Recreating Tokenizers From Tests
In some tokenizers, you need to select certain settings so that their output is closer to the Huggingface tokenizers:
THUDM/chatglm2-6b
detokenizer always skips special tokens. Useskip_special_tokens=True
during conversionTHUDM/chatglm3-6b
detokenizer don't skips special tokens. Useskip_special_tokens=False
during conversion- All tested tiktoken based detokenizers leave extra spaces. Use
clean_up_tokenization_spaces=False
during conversion
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for openvino_tokenizers-2024.1.0.0-81-py3-none-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd38cc594cfc45cb1998a967ca2e658e11e4355b91927baf64be7d89d4a52c25 |
|
MD5 | b98d087a7bd1233f3029ea0f9bc54c10 |
|
BLAKE2b-256 | a8b63b8f8a08ed83c9b741d2cde74c1a16d394a260859ed66b7b97ff11b003f9 |
Hashes for openvino_tokenizers-2024.1.0.0-81-py3-none-manylinux_2_27_aarch64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d0797e8b7076c75ef92a24fb13ac83c9eba17abbb1651ee7f7875c32cb963815 |
|
MD5 | b6f23bb8483f52fc6232e0c004547aa6 |
|
BLAKE2b-256 | 7ef032ff8b1555f088bd21ba661f02c3fddf90d447ba88ba67e20247fe1f372d |
Hashes for openvino_tokenizers-2024.1.0.0-81-py3-none-manylinux_2_17_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 241a83150dd37b23eec7845ca829ea9f1ccd67e6df7cc0bd3ea0c9afeb5dbe35 |
|
MD5 | 1041fe86ac47ec3fe7081d5dc9d825be |
|
BLAKE2b-256 | e74b51e771a410c17031c299f4e012a4107cd11d0b16d1b84a5ecb3653598751 |
Hashes for openvino_tokenizers-2024.1.0.0-81-py3-none-macosx_11_0_arm64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 28e9a13e9939491450819cff6b921534043433ade8e6a7ba14eff0ba9c371ce9 |
|
MD5 | e8bf8ae489c683bacfd654a0395ae6a4 |
|
BLAKE2b-256 | 59f3886ceb05994364b910e136cf6655641cd68e38e65e6e8fb49dfb70b95652 |
Hashes for openvino_tokenizers-2024.1.0.0-81-py3-none-macosx_10_12_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 554c01d9d2ee633d7c6840f0bf1125dad8ff5396a59a164f42fe067ad1fc5e8d |
|
MD5 | 96dda2c0e371cc5380687ec814cf414f |
|
BLAKE2b-256 | 1fd64aaab2ee672a7e9c92c9b078cc6c419c9659e48765cbf9f4b93a7a8f5373 |