Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels.
Project description
TFtftransformers
Converting Hugginface tokenizers to Tensorflow tokenizers. The main reason is to be able to bundle the tokenizer and model into one Reusable SavedModel.
Source Code: https://github.com/Hugging-Face-Supporter/tftokenizers
Example
This is an example of how one can use Huggingface model and tokenizers bundled together as a Reusable SavedModel and yields the same result as using the model and tokenizer from Huggingface 🤗
import tensorflow as tf
from tftokenizer import TFModel
from tftokenizers import TFAutoTokenizer
from transformers import TFAutoModel
# Load base models from Huggingface
model_name = "bert-base-cased"
model = TFAutoModel.from_pretrained(model_name)
# Load converted TF tokenizer
tokenizer = TFAutoTokenizer(model_name)
# Create a TF Reusable SavedModel
custom_model = TFModel(model=model, tokenizer=tokenizer)
# Tokenizer and model can handle `tf.Tensors` or regular strings
tf_string = tf.constant(["Hello from Tensorflow"])
s1 = "SponGE bob SQuarePants is an avenger"
s2 = "Huggingface to Tensorflow tokenizers"
s3 = "Hello, world!"
output = custom_model(tf_string)
output = custom_model([s1, s2, s3])
# You can also pass arguments, similar to Huggingface tokenizers
output = custom_model(
[s1, s2, s3],
max_length=512,
padding="max_length",
)
print(output)
# Save tokenizer
saved_name = "reusable_bert_tf"
tf.saved_model.save(custom_model, saved_name)
# # Load tokenizer
reloaded_model = tf.saved_model.load(saved_name)
output = reloaded_model([s1, s2, s3])
print(output)
Setup
git clone https://github.com/Hugging-Face-Supporter/tftokenizers.git
cd tftokenizers
poetry install
poetry shell
Run
To convert a Huggingface tokenizer to Tensorflow, first choose one from the models or tokenizers from the Huggingface hub to download.
NOTE
Currently only BERT models work with the converter.
Download
First download tokenizers from the hub by name. Either run the bash script do download multiple tokenizers or download a single tokenizer with the python script.
The idea is to eventually only to automatically download and convert
python tftokenizers/download.py -n bert-base-uncased
bash scripts/download_tokenizers.sh
Convert
Convert downloaded tokenizer from Huggingface format to Tensorflow
python tftokenizers/convert.py
Before Commit
make build
WIP
- Convert a BERT tokenizer from Huggingface to Tensorflow
- Make a TF Reusabel SavedModel with Tokenizer and Model in the same class. Emulate how the TF Hub example for BERT works.
- Find methods for identifying the base tokenizer model and map those settings and special tokens to new tokenizers
- Extend the tokenizers to more tokenizer types and identify them from a huggingface model name
- Document how others can use the library and document the different stages in the process
- Improve the conversion pipeline (s.a. Download and export files if not passed in or available locally)
- Convert other tokenizers. Identify limitations
- Support encoding of two sentences at a time Ref
- Allow the tokenizers to be used for Masking (MLM) Ref
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tftokenizers-0.1.3.tar.gz
.
File metadata
- Download URL: tftokenizers-0.1.3.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.11 CPython/3.8.5 Linux/5.15.11-76051511-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38db36645668e6b6f0b6e008bee05cfffeed32019dc168ccb7b582bcc6338f3e |
|
MD5 | 108953189d481ec53f5da9fd31b21fe3 |
|
BLAKE2b-256 | 03861de0b311ed051ccd826b61b90c9158d6bfd233f177565697476cdb33e167 |
Provenance
File details
Details for the file tftokenizers-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: tftokenizers-0.1.3-py3-none-any.whl
- Upload date:
- Size: 23.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.1.11 CPython/3.8.5 Linux/5.15.11-76051511-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 03ce5e8ecfc52fe26b1d5aebc762d6b24e4e68fd5eadb20b160e9a638378436c |
|
MD5 | 5ba74994ff002843b52747151f6507c3 |
|
BLAKE2b-256 | 784fb7f676945c68fecacc1c1c7aff55c5e57961194a49e3ed3e950fe5f586d0 |