Skip to main content

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Project description

wechsel

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Example usage

Transferring English roberta-base to Swahili:

import torch
from transformers import AutoModel, AutoTokenizer
from datasets import load_dataset
from wechsel import WECHSEL, load_embeddings

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

target_tokenizer = source_tokenizer.train_new_from_iterator(
    load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
    vocab_size=len(source_tokenizer)
)

wechsel = WECHSEL(
    load_embeddings("en"),
    load_embeddings("sw"),
    bilingual_dictionary="swahili"
)

target_embeddings, info = wechsel.apply(
    source_tokenizer,
    target_tokenizer,
    model.get_input_embeddings().weight.detach().numpy(),
)

model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)

# use `model` and `target_tokenizer` to continue training in Swahili!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wechsel-0.0.1.tar.gz (2.4 kB view details)

Uploaded Source

File details

Details for the file wechsel-0.0.1.tar.gz.

File metadata

  • Download URL: wechsel-0.0.1.tar.gz
  • Upload date:
  • Size: 2.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.0

File hashes

Hashes for wechsel-0.0.1.tar.gz
Algorithm Hash digest
SHA256 76ec08b483520aa6782d1a6045563f2b741cf22e1eb64471cc81557d4b81e16b
MD5 5bf55ed42a4cc1744982116842d6012b
BLAKE2b-256 9da9568ef34908f5471d750969320b15ebc47d168bf0bdf404552a5e3f29f5f9

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page