Skip to main content

tftokenizers Use Huggingface Transformer and Tokenizers as Tensorflow Reusable SavedModels.

Project description

TFtftransformers

Converting Hugginface tokenizers to Tensorflow tokenizers. The main reason is to be able to bundle the tokenizer and model into one Reusable SavedModel.


Source Code: https://github.com/Hugging-Face-Supporter/tftransformers


Example

This is an example of how one can use Huggingface model and tokenizers bundled together as a Reusable SavedModel and yields the same result as using the model and tokenizer from Huggingface 🤗

import tensorflow as tf
from tftransformers.model import TFModel
from tftransformers.tokenizer import TFAutoTokenizer
from transformers import TFAutoModel

# Load base models from Huggingface
model_name = "bert-base-cased"
model = TFAutoModel.from_pretrained(model_name)

# Load converted TF tokenizer
tokenizer = TFAutoTokenizer(model_name)

# Create a TF Reusable SavedModel
custom_model = TFModel(model=model, tokenizer=tokenizer)

# Tokenizer and model can handle `tf.Tensors` or regular strings
tf_string = tf.constant(["Hello from Tensorflow"])
s1 = "SponGE bob SQuarePants is an avenger"
s2 = "Huggingface to Tensorflow tokenizers"
s3 = "Hello, world!"

output = custom_model(tf_string)
output = custom_model([s1, s2, s3])

# You can also pass arguments, similar to Huggingface tokenizers
output = custom_model(
    [s1, s2, s3],
    max_length=512,
    padding="max_length",
)
print(output)

# Save tokenizer
saved_name = "reusable_bert_tf"
tf.saved_model.save(custom_model, saved_name)

# # Load tokenizer
reloaded_model = tf.saved_model.load(saved_name)
output = reloaded_model([s1, s2, s3])
print(output)

Setup

git clone https://github.com/Hugging-Face-Supporter/tftransformers.git
cd tf_transformers
poetry install
poetry shell

Run

To convert a Huggingface tokenizer to Tensorflow, first choose one from the models or tokenizers from the Huggingface hub to download.

NOTE

Currently only BERT models work with the converter.

Download

First download tokenizers from the hub by name. Either run the bash script do download multiple tokenizers or download a single tokenizer with the python script.

The idea is to eventually only to automatically download and convert

python tf_transformers/download.py -n bert-base-uncased
bash scripts/download_tokenizers.sh

Convert

Convert downloaded tokenizer from Huggingface format to Tensorflow

python tf_transformers/convert.py

Before Commit

make build

WIP

  • Convert a BERT tokenizer from Huggingface to Tensorflow
  • Make a TF Reusabel SavedModel with Tokenizer and Model in the same class. Emulate how the TF Hub example for BERT works.
  • Find methods for identifying the base tokenizer model and map those settings and special tokens to new tokenizers
  • Extend the tokenizers to more tokenizer types and identify them from a huggingface model name
  • Document how others can use the library and document the different stages in the process
  • Convert other tokenizers. Identify limitations
  • Improve the conversion pipeline (s.a. Download and export files if not passed in or available locally)
  • Support encoding of two sentences at a time Ref
  • Allow the tokenizers to be used for Masking (MLM) Ref

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tftokenizers-0.1.0.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

tftokenizers-0.1.0-py3-none-any.whl (22.9 kB view details)

Uploaded Python 3

File details

Details for the file tftokenizers-0.1.0.tar.gz.

File metadata

  • Download URL: tftokenizers-0.1.0.tar.gz
  • Upload date:
  • Size: 19.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.8.5 Linux/5.15.11-76051511-generic

File hashes

Hashes for tftokenizers-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f82428a32832b25e65e75cacd71af38ddc8258c4d25e1a0c533d3fc0f7fb6276
MD5 512935076f0e1c8cededeaa3ef9f847b
BLAKE2b-256 ab3a99506890d5d48244bf09c98ec46b270202d1a524f34a074e3c22b776a918

See more details on using hashes here.

Provenance

File details

Details for the file tftokenizers-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: tftokenizers-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 22.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.1.11 CPython/3.8.5 Linux/5.15.11-76051511-generic

File hashes

Hashes for tftokenizers-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a241e77d9a7c94f12f1f0dc00cf7f9822a28d11826a82d3165823b8fbe4d175a
MD5 3a522d77ee2e661cb30c4c535e722b92
BLAKE2b-256 558f426af0f8cd4c723e8c2ed77600a2f3007e82cfebc4fc54b251f0f82f44de

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page