Skip to main content

A simple to adapt a pretrained language model to a new vocabulary

Project description

Tokenizer Adapter

A simple tool to adapt a pretrained Huggingface model to a new vocabulary (domain specific) with (almost) no training.
Should work for almost all language models from the Huggingface Hub (need more test).

Install

pip install tokenizer-adapter --upgrade

Usage

from tokenizer_adapter import TokenizerAdapter
from transformers import AutoTokenizer, AutoModelForMaskedLM

BASE_MODEL_PATH = "camembert-base"

# A simple corpus
corpus = ["A first sentence", "A second sentence", "blablabla"]

# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained(BASE_MODEL_PATH)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_PATH)

# Train new vocabulary from the old tokenizer
new_tokenizer = tokenizer.train_new_from_iterator(corpus, vocab_size=300)

# Default params should work in most cases
adapter = TokenizerAdapter()

# Patch the model with the new tokenizer
model = adapter.adapt_from_pretrained(new_tokenizer, model, tokenizer)

# Save the model and the new tokenizer
model.save_pretrained("my_new_model/")
new_tokenizer.save_pretrained("my_new_model/")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizer-adapter-0.1.1.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

tokenizer_adapter-0.1.1-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file tokenizer-adapter-0.1.1.tar.gz.

File metadata

  • Download URL: tokenizer-adapter-0.1.1.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for tokenizer-adapter-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ce472ca7d9f4a37a9bd8818c34fc9779a443bbc02b9b14b6390d407c5c639530
MD5 c4de36678a1d0969dbcfe1f9ab445d6e
BLAKE2b-256 4cbe2c58feeb8e5582316af6d260e48e70fe2d7917c7ad3b54168f75447276c8

See more details on using hashes here.

File details

Details for the file tokenizer_adapter-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for tokenizer_adapter-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a73f81d769116dbf1901196016dae1aa17f68d4ae26dcffd7f4ced85b796d074
MD5 87ed4655a10e53feed123bf9df0f5f0b
BLAKE2b-256 9832e235680b3ff62a3c56db2f383e63f1c511a0916f970145a3c118d6cfe8e7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page