Skip to main content

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Project description

wechsel

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

ArXiv: https://arxiv.org/abs/2112.06598

Models from the paper will be available on the Huggingface Hub.

Installation

We distribute a Python Package via PyPI:

pip install wechsel

Alternatively, clone the repository, install requirements.txt and run the code in src/.

Example usage

Transferring English roberta-base to Swahili:

import torch
from transformers import AutoModel, AutoTokenizer
from datasets import load_dataset
from wechsel import WECHSEL, load_embeddings

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModel.from_pretrained("roberta-base")

target_tokenizer = source_tokenizer.train_new_from_iterator(
    load_dataset("oscar", "unshuffled_deduplicated_sw", split="train")["text"],
    vocab_size=len(source_tokenizer)
)

wechsel = WECHSEL(
    load_embeddings("en"),
    load_embeddings("sw"),
    bilingual_dictionary="swahili"
)

target_embeddings, info = wechsel.apply(
    source_tokenizer,
    target_tokenizer,
    model.get_input_embeddings().weight.detach().numpy(),
)

model.get_input_embeddings().weight.data = torch.from_numpy(target_embeddings)

# use `model` and `target_tokenizer` to continue training in Swahili!

Citation

Please cite WECHSEL as

@misc{minixhofer2021wechsel,
      title={WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models}, 
      author={Benjamin Minixhofer and Fabian Paischer and Navid Rekabsaz},
      year={2021},
      eprint={2112.06598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wechsel-0.0.3.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

wechsel-0.0.3-py3-none-any.whl (9.3 kB view details)

Uploaded Python 3

File details

Details for the file wechsel-0.0.3.tar.gz.

File metadata

  • Download URL: wechsel-0.0.3.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.0

File hashes

Hashes for wechsel-0.0.3.tar.gz
Algorithm Hash digest
SHA256 70325bd85e717da23bf4de8d0788871f945615cb04936127d0221a6d3ed5d0c3
MD5 a7074cf393211d91df60577eab4fb363
BLAKE2b-256 ed20af5906182b400d9e16b9500b3baa048c2be36f92005b23c5f4305d884117

See more details on using hashes here.

Provenance

File details

Details for the file wechsel-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: wechsel-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.7

File hashes

Hashes for wechsel-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 7107a32b5fd02544634ab85ac2bf7268b66ce0fb7f1b7a9d5966cd00a1c9922b
MD5 d9961c41a18135f01a7b21f4015959f6
BLAKE2b-256 1d3aaaead7f1a9a5837595099bc5ded97e3dbb04dcd21b0340163ab67a0084d1

See more details on using hashes here.

Provenance

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page