Skip to main content

A package for managing and aligning multilingual GloVe and Graph embeddings

Project description

ArXiv Paper Hugging Face Python Version License Open Source

A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge

Project Overview

  • This project provides open-source "green" GloVe embeddings for 87 mid- and low-resource languages trained using CC100.
  • The embeddings can be augmented with graph knowledge from ConceptNet to enhance their performance and semantic understanding.

Embeddings Details

  • Available for 87 languages described in the table below.
  • Our graph-enhanced embeddings can be downloaded from the ConceptNet embeddings on Hugging Face.
  • Augmentation Method: Use the algorithm provided in the repository (merge_emb-s.py) to merge the GloVe embeddings with ConceptNet graph embeddings.

Why Static Embeddings in 2025?

  • Static embeddings outperform large models: Despite the popularity of large language models (LLMs), static embeddings like GloVe still outperform models such as GPT-4 and Llama3 in many tasks, especially for low-resource languages.

  • Lower Environmental Impact:

    • GloVe embeddings: 19.05 kg of CO2 (equivalent to a few car drives).
    • BERT embeddings: 635 kg of CO2 (equivalent to a 1,600 km car trip)—30 times more than the emissions from GloVe embeddings.
    • LLaMA-8B model: 390,000 kg of CO2 (equivalent to the annual energy consumption of around 100 homes)—20,473 times more than that of GloVe embeddings.

Merging Embedding Spaces

We propose a method for merging GloVe embeddings with graph-based embeddings derived from ConceptNet knowledge, while preserving the vocabulary size of GloVe. This overcomes the shortcoming of the retrofitting approach, where the retrofitted words must be present in both graph and corpus embeddings.

Methodology

  1. Singular Value Decomposition:

    • We apply SVD (with no weighting of the U matrix by singular values, following Levy et al. 2015) to concatenate GloVe embeddings and pointwise mutual information (PMI)-based graph embeddings from ConceptNet (as proposed by Speer et al. 2017).
    • The concatenated word embeddings from both GloVe and PMI-based graph embeddings are processed to generate a shared embedding space for the part of the vocabulary that is common between GloVe and the knowledge graph.
  2. Linear Transformation:

    • We then learn a linear transformation (as proposed by Mikolov et al. 2013) to project the GloVe embeddings into this shared space, allowing us to obtain embeddings for all words in the original GloVe vocabulary.

Show/Hide Language Data Table
ISO Language Name Dataset Size Class ConceptNet Data
ssSwati86K1
scSardinian143K1
yoYoruba1.1M2
gnGuarani1.5M1
quQuechua1.5M1
nsNorthern Sotho1.8M1
liLimburgish2.2M1
lnLingala2.3M1
woWolof3.6M2
zuZulu4.3M2
rmRomansh4.8M1
igIgbo6.6M1
lgGanda7.3M1
asAssamese7.6M1
tnTswana8.0M2
htHaitian9.1M2
omOromo11M1
suSundanese15M1
bsBosnian18M3
brBreton21M1
gdScottish Gaelic22M1
xhXhosa25M2
mgMalagasy29M1
jvJavanese37M1
fyFrisian38M0
saSanskrit44M2
myBurmese46M1
ugUyghur46M1
yiYiddish51M1
orOriya56M1
haHausa61M2
laLao63M2
sdSindhi67M1
ta_romTamil Romanized68M3
soSomali78M1
te_romTelugu Romanized79M1
kuKurdish90M0
puPunjabi90M2
psPashto107M1
gaIrish108M2
amAmharic133M2
ur_romUrdu Romanized141M3
kmKhmer153M1
uzUzbek155M3
bn_romBengali Romanized164M3
kyKyrgyz173M3
my_zawBurmese (Zawgyi)178M1
cyWelsh179M1
guGujarati242M1
eoEsperanto250M1
afAfrikaans305M3
swSwahili332M2
mrMarathi334M2
knKannada360M1
neNepali393M1
mnMongolian397M1
siSinhala452M0
teTelugu536M1
laLatin609M3
beBelarussian692M3
tlTagalog701M3
mkMacedonian706M1
glGalician708M3
hyArmenian776M1
isIcelandic779M2
mlMalayalam831M1
bnBengali860M3
urUrdu884M3
kkKazakh889M3
kaGeorgian1.1G3
azAzerbaijani1.3G1
sqAlbanian1.3G1
taTamil1.3G3
etEstonian1.7G3
lvLatvian2.1G3
msMalay2.1G3
slSlovenian2.8G3
ltLithuanian3.4G3
heHebrew6.1G3
skSlovak6.1G3
elGreek7.4G3
thThai8.7G3
bgBulgarian9.3G3
daDanish12G3
ukUkrainian14G3
roRomanian16G3
idIndonesian36G3

ConceptNet Data Details

Data Extraction Process

  1. Source: Data is extracted from the ConceptNet database (available here).
  2. Extraction Steps:
    • Clean and analyze the data from the official ConceptNet dump (ConceptNet assertions dump).
    • The extracted dataset is in JSON format representing a dictionary with language codes and start and end edges for each language.
  3. Start Edges: Represent unique words in a target language.
  4. End Edges: Represent words related to the start edges through various types of relationships (relationship types and sources are not included in the extraction).
  5. Data Availability:
    • The dataset is available on Hugging Face.
    • A detailed description of the amount of data extracted for each language is also provided.

Usage

The embeddings are available through our pip package:

pip install -i https://test.pypi.org/simple/ gremlin

Download Embeddings

Download embeddings for specific languages and types:

gremlin download -l <language_codes> -t <embedding_type> [-o <output_directory>]

Parameters:

  • -l/--languages: Language codes (e.g., en, fr, de)
  • -t/--type: Embedding type (glove or graph)
  • -o/--output: Output directory (optional, default: ./embeddings)

Examples:

# Download GloVe embeddings for Yoruba and Swahili
gremlin download -l yo sw -t glove

# Download graph embeddings for Yoruba with custom output
gremlin download -l yo -t graph -o ./german_embeddings

Merge Embeddings

Merge different embedding sources:

gremlin merge -g <glove_file> -p <graph_file> -o <output_file>

Parameters:

  • -g/--glove: Path to GloVe embeddings file
  • -p/--graph: Path to graph embeddings file
  • -o/--output: Output path for merged embeddings

Example:

gremlin merge -g glove_yo.txt -p graph_yo.txt -o merged_yo.txt

Extended Help

For comprehensive instructions and usage examples:

gremlin --detailed-help

Citation

  • If you use our embedding enhancement method or pre-trained embeddings, please consider citing our preview paper (the full paper is to be published in the Findings of NAACL 2025):
@misc{gurgurov2024gremlinrepositorygreenbaseline,
      title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge}, 
      author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
      year={2024},
      eprint={2409.18193},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18193}, 
}

License

This project is licensed under the Apache License - see the [LICENSE.txt] file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gremlinn-1.0.0.tar.gz (20.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gremlinn-1.0.0-py3-none-any.whl (16.9 kB view details)

Uploaded Python 3

File details

Details for the file gremlinn-1.0.0.tar.gz.

File metadata

  • Download URL: gremlinn-1.0.0.tar.gz
  • Upload date:
  • Size: 20.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for gremlinn-1.0.0.tar.gz
Algorithm Hash digest
SHA256 510aa19a509a566360257b98ba05fd51b97b1a1718cfa50970ac7122de45f1c1
MD5 e4c47d034e643e722716d1e440e5f53f
BLAKE2b-256 b104ad2c53887e9e6a07701883de21b75eb568e3cb93f423b5ccdb686457db93

See more details on using hashes here.

File details

Details for the file gremlinn-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: gremlinn-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 16.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for gremlinn-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6049ff3f176b5ad1c4271e6c04890d3351c806ff77157295c1af891e3accd9f6
MD5 76fe6ab7c3a106626e435c7a56e9817e
BLAKE2b-256 0776b9995886585a40ecaf51c16d78b3668b9112a0d8c7b42658795fcb6c4ca4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page