Skip to main content

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Project description

wechsel

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

ArXiv: https://arxiv.org/abs/2112.06598

Models from the paper will be available on the Huggingface Hub.

Installation

We distribute a Python Package via PyPI:

pip install wechsel

Alternatively, clone the repository, install requirements.txt and run the code in src/.

Example usage

Transferring English roberta-base to Swahili:

import torch
from transformers import AutoModel, AutoTokenizer
from datasets import load_dataset
from wechsel import WECHSEL, load_embeddings

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

target_tokenizer = source_tokenizer.train_new_from_iterator(
    load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
    vocab_size=len(source_tokenizer)
)

wechsel = WECHSEL(
    load_embeddings("en"),
    load_embeddings("sw"),
    bilingual_dictionary="swahili"
)

target_embeddings, info = wechsel.apply(
    source_tokenizer,
    target_tokenizer,
    model.get_input_embeddings().weight.detach().numpy(),
)

model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)

# use `model` and `target_tokenizer` to continue training in Swahili!

Citation

Please cite WECHSEL as

@misc{minixhofer2021wechsel,
      title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models}, 
      author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz},
      year={2021},
      eprint={2112.06598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wechsel-0.0.2.tar.gz (2.9 kB view details)

Uploaded Source

File details

Details for the file wechsel-0.0.2.tar.gz.

File metadata

  • Download URL: wechsel-0.0.2.tar.gz
  • Upload date:
  • Size: 2.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.25.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.1

File hashes

Hashes for wechsel-0.0.2.tar.gz
Algorithm Hash digest
SHA256 d479c505435bd83d514c08e0e9430ef9400a0246d596202151625a2287ace650
MD5 5f9c37a92bcfd37615d8e62c72a98869
BLAKE2b-256 f5657c443f6ae2729216308ea3184070c109712de6a17425d55ff10d2f4beca7

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page