Skip to main content

Tools to adapt a pretrained model to a new vocabulary

Project description

Tokenizer Adapter

A simple tool to adapt a pretrained Huggingface model to a new vocabulary (domain specific) with (almost) no training.
Should work for almost all language models from the Huggingface Hub (need more test).

Install

pip install 

Usage

from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

BASE_MODEL_PATH = "camembert-base"

# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# Train new vocabulary from the old tokenizer
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=300)

# Default params should work in most cases
adapter = TokenizerAdapter()

# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)

# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer-adapter-0.1.0.tar.gz (7.9 kB view details)

Uploaded Source

Built Distribution

tokenizer_adapter-0.1.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file tokenizer-adapter-0.1.0.tar.gz.

File metadata

  • Download URL: tokenizer-adapter-0.1.0.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for tokenizer-adapter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b0fc0884e490d136b9bd587686ae8570498d5047a1edb61ac5915c7a2edd1546
MD5 8141e64a5299ff46a80d762fedde78e9
BLAKE2b-256 b61ff0f72e7d90e5e0332854a62d28abfed6630f65af3149ada1b4c2352d6b5e

See more details on using hashes here.

File details

Details for the file tokenizer_adapter-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tokenizer_adapter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e4eb8756f313015bb5c412297683e79545058b015c33e545b18561c7558f318d
MD5 445944a317ce0bbacfa93e49512b6f6b
BLAKE2b-256 c808b1136d77d00fc3cb8b29d4fc6bff39b39ba242a222c50ee65f6136737d24

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page