Skip to main content

A library for language transfer methods and algorithms.

Project description

Langsfer Logo

Langsfer, a library for language transfer methods and algorithms.

CI TestPyPI Version License

Language transfer refers to a few related things:

  • initializing a Large Language Model (LLM) in a new, typically low-resource, target language (e.g. German, Arabic) from another LLM trained in high-resource source language (e.g. English),
  • extending the vocabulary of an LLM by adding new tokens and initializing their embeddings in a manner that allows them to be used with little to no extra training,
  • specializing the vocabulary of a multilingual LLM to one of its supported languages.

The library currently implements the following methods:

Langsfer is flexible enough to allow mixing and matching strategies between different embedding initialization schemes. For example, you can combine fuzzy token overlap with the CLP-Transfer method to refine the initialization process based on fuzzy matches between source and target tokens. This flexibility enables you to experiment with a variety of strategies for different language transfer tasks, making it easier to fine-tune models for your specific use case.

Quick Start

Installation

To install the latest stable version from PyPI use:

pip install langsfer

To install the latest development version from TestPyPI use:

pip install -i https://test.pypi.org/simple/ langsfer

To install the latest development version from the repository use:

git clone git@github.com:AnesBenmerzoug/langsfer.git
cd langsfer
pip install .

Tutorials

The following notebooks serve as tutorials for users of the package:

Simple Example

The package provide high-level interfaces to instantiate each of the methods, without worrying too much about the package's internals.

For example, to use the WECHSEL method, you would use:

from langsfer.high_level import wechsel
from langsfer.embeddings import FastTextEmbeddings
from langsfer.utils import download_file
from transformers import AutoTokenizer

source_tokenizer = AutoTokenizer.from_pretrained("roberta-base")
target_tokenizer = AutoTokenizer.from_pretrained("benjamin/roberta-base-wechsel-german")

source_model = AutoModel.from_pretrained("roberta-base")
# For models with non-tied embeddings you can choose whether you should transfer the input and output embeddings separately.
source_embeddings_matrix = source_model.get_input_embeddings().weight.detach().numpy()

source_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("en")
target_auxiliary_embeddings = FastTextEmbeddings.from_model_name_or_path("de")

bilingual_dictionary_file = download_file(
    "https://raw.githubusercontent.com/CPJKU/wechsel/main/dicts/data/german.txt",
    "german.txt",
)

embedding_initializer = wechsel(
    source_tokenizer=source_tokenizer,
    source_embeddings_matrix=source_embeddings_matrix,
    target_tokenizer=target_tokenizer,
    target_auxiliary_embeddings=target_auxiliary_embeddings,
    source_auxiliary_embeddings=source_auxiliary_embeddings,
    bilingual_dictionary_file=bilingual_dictionary_file,
)

To initialize the target embeddings you would then use:

target_embeddings_matrix = embedding_initializer.initialize(seed=16, show_progress=True)

The result is a 2D arrays that contains the initialized embeddings matrix for the target language model.

We can then replace the source model's embeddings matrix with this newly initialized embeddings matrix:

import torch
from transformers import AutoModel

target_model = AutoModel.from_pretrained("roberta-base")
# Resize its embedding layer
target_model.resize_token_embeddings(len(target_tokenizer))
# Replace the source embeddings matrix with the target embeddings matrix
target_model.get_input_embeddings().weight.data = torch.as_tensor(target_embeddings_matrix)
# Save the new model
target_model.save_pretrained("path/to/target_model")

Advanced Example

Langsfer also provides lower-level interfaces to allow you to tweak many of the components of the embedding initialiation. You however have to know a bit more about the package's internals.

For example, if you want use to replace the WECHSEL method's weight strategy and token overlap strategy with Sparsemax and Fuzzy token overalp, respectively, you would use:

from langsfer.initialization import WeightedAverageEmbeddingsInitialization
from langsfer.alignment import BilingualDictionaryAlignment
from langsfer.embeddings import FastTextEmbeddings
from langsfer.weights import SparsemaxWeights
from langsfer.token_overlap import FuzzyMatchTokenOverlap

embeddings_initializer = WeightedAverageEmbeddingsInitialization(
  source_tokenizer=source_tokenizer,
  source_embeddings_matrix=source_embeddings_matrix,
  target_tokenizer=target_tokenizer,
  target_auxiliary_embeddings=target_auxiliary_embeddings,
  source_auxiliary_embeddings=source_auxiliary_embeddings,
  alignment_strategy=BilingualDictionaryAlignment(
      source_auxiliary_embeddings,
      target_auxiliary_embeddings,
      bilingual_dictionary_file=bilingual_dictionary_file,
  ),
  weights_strategy=SprasemaxWeights(),
  token_overlap_strategy=FuzzyMatchTokenOverlap(),
)

You could even implement your own strategies for token overlap computation, embedding alignement, similarity score compuation and weight computation.

Roadmap

Here are some of the planned developments for Langsfer:

  • Performance Optimization: Improve the efficiency and usability of the library to streamline workflows and improve computational performance.

  • Model Training & Hugging Face Hub Publishing: Train both small and large models with embeddings initialized using Langsfer and publish the resulting models to the Hugging Face Hub for public access and use.

  • Parameter-Efficient Fine-Tuning: Investigate using techniques such as LoRA (Low-Rank Adaptation) to enable parameter-efficient fine-tuning, making it easier to adapt models to specific languages with minimal overhead.

  • Implement New Methods: Extend Langsfer with additional language transfer methods, including:

  • Comprehensive Benchmarking: Run extensive benchmarks across all implemented methods to evaluate their performance, identify strengths and weaknesses, and compare results to establish best practices for language transfer.

Contributing

Refer to the contributing guide for instructions on you can make contributions to this repository.

Logo

The langsfer logo was created by my good friend Zakaria Taleb Hacine, a 3D artist with industry experience and a packed portfolio.

The logo contains the latin alphabet letters A and I which are an acronym for Artificial Intelligence and the arabic alphabet letters أ and ذ which are an acronym for ذكاء اصطناعي, which is Artificial Intelligence in arabic.

The fonts used are Ethnocentric Regular and Readex Pro.

License

This package is license under the LGPL-2.1 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langsfer-0.1.0-py3-none-any.whl (34.5 kB view details)

Uploaded Python 3

File details

Details for the file langsfer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: langsfer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for langsfer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 210a93098e784cee8d358408de0837e61859aa52082120c3deafa50c474ca168
MD5 17eb198925003789ee890deb37e548e6
BLAKE2b-256 cc8cdda47a2b0154819e40fa60bcae6a1eddc12cbc3803c4ff7596fa647d864f

See more details on using hashes here.

Provenance

The following attestation bundles were made for langsfer-0.1.0-py3-none-any.whl:

Publisher: publish.yml on AnesBenmerzoug/langsfer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page