langsfer

A library for language transfer methods and algorithms.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Anes.Benmerzoug

These details have not been verified by PyPI

Project description

Langsfer Logo

Langsfer, a library for language transfer methods and algorithms.

Language transfer refers to a few related things:

initializing a Large Language Model (LLM) in a new, typically low-resource, target language (e.g. German, Arabic) from another LLM trained in high-resource source language (e.g. English),
extending the vocabulary of an LLM by adding new tokens and initializing their embeddings in a manner that allows them to be used with little to no extra training,
specializing the vocabulary of a multilingual LLM to one of its supported languages.

The library currently implements the following methods:

WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. Minixhofer, Benjamin, Fabian Paischer, and Navid Rekabsaz. arXiv preprint arXiv:2112.06598 (2021).
CLP-Transfer: Efficient language model training through cross-lingual and progressive transfer learning. Ostendorff, Malte, and Georg Rehm. arXiv preprint arXiv:2301.09626 (2023).
FOCUS: Effective Embedding Initialization for Specializing Pretrained Multilingual Models on a Single Language. Dobler, Konstantin, and Gerard de Melo. arXiv preprint arXiv:2305.14481 (2023).

Langsfer is flexible enough to allow mixing and matching strategies between different embedding initialization schemes. For example, you can combine fuzzy token overlap with the CLP-Transfer method to refine the initialization process based on fuzzy matches between source and target tokens. This flexibility enables you to experiment with a variety of strategies for different language transfer tasks, making it easier to fine-tune models for your specific use case.

Quick Start

Installation

To install the latest stable version from PyPI use:

pip install langsfer

To install the latest development version from TestPyPI use:

pip install -i https://test.pypi.org/simple/ langsfer

To install the latest development version from the repository use:

git clone git@github.com:AnesBenmerzoug/langsfer.git
cd langsfer
pip install .

Tutorials

The following notebooks serve as tutorials for users of the package:

Simple Example

The package provide high-level interfaces to instantiate each of the methods, without worrying too much about the package's internals.

For example, to use the WECHSEL method, you would use:

from langsfer.high_level import wechsel
from langsfer.embeddings import FastTextEmbeddings
from langsfer.utils import download_file
from transformers import AutoTokenizer

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
target_tokenizer = AutoTokenizer.from_pretrained("benjamin/roberta-base-wechsel-german")

source_model = AutoModel.from_pretrained("roberta-base")
# For models with non-tied embeddings you can choose whether you should transfer the input and output embeddings separately.
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()

source_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("en")
target_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("de")

bilingual_dictionary_file = download_file(
    "https://raw.githubusercontent.com/CPJKU/wechsel/main/dicts/data/german.txt",
    "german.txt",
)

embedding_initializer = wechsel(
    source_tokenizer=source_tokenizer,
    source_embeddings_matrix=source_embeddings_matrix,
    target_tokenizer=target_tokenizer,
    target_auxiliary_embeddings=target_auxiliary_embeddings,
    source_auxiliary_embeddings=source_auxiliary_embeddings,
    bilingual_dictionary_file=bilingual_dictionary_file,
)

To initialize the target embeddings you would then use:

target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)

The result is a 2D arrays that contains the initialized embeddings matrix for the target language model.

We can then replace the source model's embeddings matrix with this newly initialized embeddings matrix:

import torch
from transformers import AutoModel

target_model = AutoModel.from_pretrained("roberta-base")
# Resize its embedding layer
target_model.resize_token_embeddings(len(target_tokenizer))
# Replace the source embeddings matrix with the target embeddings matrix
target_model.get_input_embeddings().weight.data = torch.as_tensor(target_embeddings_matrix)
# Save the new model
target_model.save_pretrained("path/to/target_model")

Advanced Example

Langsfer also provides lower-level interfaces to allow you to tweak many of the components of the embedding initialiation. You however have to know a bit more about the package's internals.

For example, if you want use to replace the WECHSEL method's weight strategy and token overlap strategy with Sparsemax and Fuzzy token overalp, respectively, you would use:

from langsfer.initialization import WeightedAverageEmbeddingsInitialization
from langsfer.alignment import BilingualDictionaryAlignment
from langsfer.embeddings import FastTextEmbeddings
from langsfer.weights import SparsemaxWeights
from langsfer.token_overlap import FuzzyMatchTokenOverlap

embeddings_initializer = WeightedAverageEmbeddingsInitialization(
  source_tokenizer=source_tokenizer,
  source_embeddings_matrix=source_embeddings_matrix,
  target_tokenizer=target_tokenizer,
  target_auxiliary_embeddings=target_auxiliary_embeddings,
  source_auxiliary_embeddings=source_auxiliary_embeddings,
  alignment_strategy=BilingualDictionaryAlignment(
      source_auxiliary_embeddings,
      target_auxiliary_embeddings,
      bilingual_dictionary_file=bilingual_dictionary_file,
  ),
  weights_strategy=SprasemaxWeights(),
  token_overlap_strategy=FuzzyMatchTokenOverlap(),
)

You could even implement your own strategies for token overlap computation, embedding alignement, similarity score compuation and weight computation.

Roadmap

Here are some of the planned developments for Langsfer:

Performance Optimization: Improve the efficiency and usability of the library to streamline workflows and improve computational performance.
Model Training & Hugging Face Hub Publishing: Train both small and large models with embeddings initialized using Langsfer and publish the resulting models to the Hugging Face Hub for public access and use.
Parameter-Efficient Fine-Tuning: Investigate using techniques such as LoRA (Low-Rank Adaptation) to enable parameter-efficient fine-tuning, making it easier to adapt models to specific languages with minimal overhead.
Implement New Methods: Extend Langsfer with additional language transfer methods, including:
- Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining. Liu, Y., Lin, P., Wang, M. and Schütze, H., 2023. arXiv preprint arXiv:2311.08849.
- Zero-Shot Tokenizer Transfer. Minixhofer, B., Ponti, E.M. and Vulić, I., 2024. arXiv preprint arXiv:2405.07883.
Comprehensive Benchmarking: Run extensive benchmarks across all implemented methods to evaluate their performance, identify strengths and weaknesses, and compare results to establish best practices for language transfer.

Contributing

Refer to the contributing guide for instructions on you can make contributions to this repository.

Logo

The langsfer logo was created by my good friend Zakaria Taleb Hacine, a 3D artist with industry experience and a packed portfolio.

The logo contains the latin alphabet letters A and I which are an acronym for Artificial Intelligence and the arabic alphabet letters أ and ذ which are an acronym for ذكاء اصطناعي, which is Artificial Intelligence in arabic.

The fonts used are Ethnocentric Regular and Readex Pro.

License

This package is license under the LGPL-2.1 license.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Anes.Benmerzoug

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Nov 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

langsfer-0.1.0-py3-none-any.whl (34.5 kB view details)

Uploaded Nov 25, 2024 Python 3

File details

Details for the file langsfer-0.1.0-py3-none-any.whl.

File metadata

Download URL: langsfer-0.1.0-py3-none-any.whl
Upload date: Nov 25, 2024
Size: 34.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for langsfer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`210a93098e784cee8d358408de0837e61859aa52082120c3deafa50c474ca168`
MD5	`17eb198925003789ee890deb37e548e6`
BLAKE2b-256	`cc8cdda47a2b0154819e40fa60bcae6a1eddc12cbc3803c4ff7596fa647d864f`

See more details on using hashes here.

Provenance

The following attestation bundles were made for langsfer-0.1.0-py3-none-any.whl:

Publisher: publish.yml on AnesBenmerzoug/langsfer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: langsfer-0.1.0-py3-none-any.whl
- Subject digest: 210a93098e784cee8d358408de0837e61859aa52082120c3deafa50c474ca168
- Sigstore transparency entry: 151453822
- Sigstore integration time: Nov 25, 2024
Source repository:
- Permalink: AnesBenmerzoug/langsfer@cc1fcf83a1cca2f5a654945f3e261a876415989c
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/AnesBenmerzoug
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@cc1fcf83a1cca2f5a654945f3e261a876415989c
- Trigger Event: release

langsfer 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Quick Start

Installation

Tutorials

Simple Example

Advanced Example

Roadmap

Contributing

Logo

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes

Provenance