A package for managing and aligning multilingual GloVe and Graph embeddings
Project description
A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge
Project Overview
- This project provides open-source "green" GloVe embeddings for 87 mid- and low-resource languages trained using CC100.
- The embeddings can be augmented with graph knowledge from ConceptNet to enhance their performance and semantic understanding.
Embeddings Details
- Available for 87 languages described in the table below.
- Our graph-enhanced embeddings can be downloaded from the ConceptNet embeddings on Hugging Face.
- Augmentation Method: Use the algorithm provided in the repository (merge_emb-s.py) to merge the GloVe embeddings with ConceptNet graph embeddings.
Why Static Embeddings in 2025?
-
Static embeddings outperform large models: Despite the popularity of large language models (LLMs), static embeddings like GloVe still outperform models such as GPT-4 and Llama3 in many tasks, especially for low-resource languages.
-
Lower Environmental Impact:
- GloVe embeddings: 19.05 kg of CO2 (equivalent to a few car drives).
- BERT embeddings: 635 kg of CO2 (equivalent to a 1,600 km car trip)—30 times more than the emissions from GloVe embeddings.
- LLaMA-8B model: 390,000 kg of CO2 (equivalent to the annual energy consumption of around 100 homes)—20,473 times more than that of GloVe embeddings.
Merging Embedding Spaces
We propose a method for merging GloVe embeddings with graph-based embeddings derived from ConceptNet knowledge, while preserving the vocabulary size of GloVe. This overcomes the shortcoming of the retrofitting approach, where the retrofitted words must be present in both graph and corpus embeddings.
Methodology
-
Singular Value Decomposition:
- We apply SVD (with no weighting of the U matrix by singular values, following Levy et al. 2015) to concatenate GloVe embeddings and pointwise mutual information (PMI)-based graph embeddings from ConceptNet (as proposed by Speer et al. 2017).
- The concatenated word embeddings from both GloVe and PMI-based graph embeddings are processed to generate a shared embedding space for the part of the vocabulary that is common between GloVe and the knowledge graph.
-
Linear Transformation:
- We then learn a linear transformation (as proposed by Mikolov et al. 2013) to project the GloVe embeddings into this shared space, allowing us to obtain embeddings for all words in the original GloVe vocabulary.
Show/Hide Language Data Table
| ISO | Language Name | Dataset Size | Class | ConceptNet Data |
|---|---|---|---|---|
| ss | Swati | 86K | 1 | ✘ |
| sc | Sardinian | 143K | 1 | ✓ |
| yo | Yoruba | 1.1M | 2 | ✓ |
| gn | Guarani | 1.5M | 1 | ✓ |
| qu | Quechua | 1.5M | 1 | ✓ |
| ns | Northern Sotho | 1.8M | 1 | ✘ |
| li | Limburgish | 2.2M | 1 | ✓ |
| ln | Lingala | 2.3M | 1 | ✓ |
| wo | Wolof | 3.6M | 2 | ✓ |
| zu | Zulu | 4.3M | 2 | ✓ |
| rm | Romansh | 4.8M | 1 | ✓ |
| ig | Igbo | 6.6M | 1 | ✘ |
| lg | Ganda | 7.3M | 1 | ✘ |
| as | Assamese | 7.6M | 1 | ✘ |
| tn | Tswana | 8.0M | 2 | ✘ |
| ht | Haitian | 9.1M | 2 | ✓ |
| om | Oromo | 11M | 1 | ✘ |
| su | Sundanese | 15M | 1 | ✓ |
| bs | Bosnian | 18M | 3 | ✘ |
| br | Breton | 21M | 1 | ✓ |
| gd | Scottish Gaelic | 22M | 1 | ✓ |
| xh | Xhosa | 25M | 2 | ✓ |
| mg | Malagasy | 29M | 1 | ✓ |
| jv | Javanese | 37M | 1 | ✓ |
| fy | Frisian | 38M | 0 | ✓ |
| sa | Sanskrit | 44M | 2 | ✓ |
| my | Burmese | 46M | 1 | ✓ |
| ug | Uyghur | 46M | 1 | ✓ |
| yi | Yiddish | 51M | 1 | ✓ |
| or | Oriya | 56M | 1 | ✓ |
| ha | Hausa | 61M | 2 | ✓ |
| la | Lao | 63M | 2 | ✓ |
| sd | Sindhi | 67M | 1 | ✓ |
| ta_rom | Tamil Romanized | 68M | 3 | ✘ |
| so | Somali | 78M | 1 | ✓ |
| te_rom | Telugu Romanized | 79M | 1 | ✘ |
| ku | Kurdish | 90M | 0 | ✓ |
| pu | Punjabi | 90M | 2 | ✓ |
| ps | Pashto | 107M | 1 | ✓ |
| ga | Irish | 108M | 2 | ✓ |
| am | Amharic | 133M | 2 | ✓ |
| ur_rom | Urdu Romanized | 141M | 3 | ✘ |
| km | Khmer | 153M | 1 | ✓ |
| uz | Uzbek | 155M | 3 | ✓ |
| bn_rom | Bengali Romanized | 164M | 3 | ✘ |
| ky | Kyrgyz | 173M | 3 | ✓ |
| my_zaw | Burmese (Zawgyi) | 178M | 1 | ✘ |
| cy | Welsh | 179M | 1 | ✓ |
| gu | Gujarati | 242M | 1 | ✓ |
| eo | Esperanto | 250M | 1 | ✓ |
| af | Afrikaans | 305M | 3 | ✓ |
| sw | Swahili | 332M | 2 | ✓ |
| mr | Marathi | 334M | 2 | ✓ |
| kn | Kannada | 360M | 1 | ✓ |
| ne | Nepali | 393M | 1 | ✓ |
| mn | Mongolian | 397M | 1 | ✓ |
| si | Sinhala | 452M | 0 | ✓ |
| te | Telugu | 536M | 1 | ✓ |
| la | Latin | 609M | 3 | ✓ |
| be | Belarussian | 692M | 3 | ✓ |
| tl | Tagalog | 701M | 3 | ✘ |
| mk | Macedonian | 706M | 1 | ✓ |
| gl | Galician | 708M | 3 | ✓ |
| hy | Armenian | 776M | 1 | ✓ |
| is | Icelandic | 779M | 2 | ✓ |
| ml | Malayalam | 831M | 1 | ✓ |
| bn | Bengali | 860M | 3 | ✓ |
| ur | Urdu | 884M | 3 | ✓ |
| kk | Kazakh | 889M | 3 | ✓ |
| ka | Georgian | 1.1G | 3 | ✓ |
| az | Azerbaijani | 1.3G | 1 | ✓ |
| sq | Albanian | 1.3G | 1 | ✓ |
| ta | Tamil | 1.3G | 3 | ✓ |
| et | Estonian | 1.7G | 3 | ✓ |
| lv | Latvian | 2.1G | 3 | ✓ |
| ms | Malay | 2.1G | 3 | ✓ |
| sl | Slovenian | 2.8G | 3 | ✓ |
| lt | Lithuanian | 3.4G | 3 | ✓ |
| he | Hebrew | 6.1G | 3 | ✓ |
| sk | Slovak | 6.1G | 3 | ✓ |
| el | Greek | 7.4G | 3 | ✓ |
| th | Thai | 8.7G | 3 | ✓ |
| bg | Bulgarian | 9.3G | 3 | ✓ |
| da | Danish | 12G | 3 | ✓ |
| uk | Ukrainian | 14G | 3 | ✓ |
| ro | Romanian | 16G | 3 | ✓ |
| id | Indonesian | 36G | 3 | ✘ |
ConceptNet Data Details
Data Extraction Process
- Source: Data is extracted from the ConceptNet database (available here).
- Extraction Steps:
- Clean and analyze the data from the official ConceptNet dump (ConceptNet assertions dump).
- The extracted dataset is in JSON format representing a dictionary with language codes and start and end edges for each language.
- Start Edges: Represent unique words in a target language.
- End Edges: Represent words related to the start edges through various types of relationships (relationship types and sources are not included in the extraction).
- Data Availability:
- The dataset is available on Hugging Face.
- A detailed description of the amount of data extracted for each language is also provided.
Usage
The embeddings are available through our pip package:
pip install -i https://test.pypi.org/simple/ gremlin
Download Embeddings
Download embeddings for specific languages and types:
gremlin download -l <language_codes> -t <embedding_type> [-o <output_directory>]
Parameters:
-l/--languages: Language codes (e.g., en, fr, de)-t/--type: Embedding type (glove or graph)-o/--output: Output directory (optional, default: ./embeddings)
Examples:
# Download GloVe embeddings for Yoruba and Swahili
gremlin download -l yo sw -t glove
# Download graph embeddings for Yoruba with custom output
gremlin download -l yo -t graph -o ./german_embeddings
Merge Embeddings
Merge different embedding sources:
gremlin merge -g <glove_file> -p <graph_file> -o <output_file>
Parameters:
-g/--glove: Path to GloVe embeddings file-p/--graph: Path to graph embeddings file-o/--output: Output path for merged embeddings
Example:
gremlin merge -g glove_yo.txt -p graph_yo.txt -o merged_yo.txt
Extended Help
For comprehensive instructions and usage examples:
gremlin --detailed-help
Citation
- If you use our embedding enhancement method or pre-trained embeddings, please consider citing our preview paper (the full paper is to be published in the Findings of NAACL 2025):
@misc{gurgurov2024gremlinrepositorygreenbaseline,
title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge},
author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
year={2024},
eprint={2409.18193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.18193},
}
License
This project is licensed under the Apache License - see the [LICENSE.txt] file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gremlinn-1.0.0.tar.gz.
File metadata
- Download URL: gremlinn-1.0.0.tar.gz
- Upload date:
- Size: 20.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
510aa19a509a566360257b98ba05fd51b97b1a1718cfa50970ac7122de45f1c1
|
|
| MD5 |
e4c47d034e643e722716d1e440e5f53f
|
|
| BLAKE2b-256 |
b104ad2c53887e9e6a07701883de21b75eb568e3cb93f423b5ccdb686457db93
|
File details
Details for the file gremlinn-1.0.0-py3-none-any.whl.
File metadata
- Download URL: gremlinn-1.0.0-py3-none-any.whl
- Upload date:
- Size: 16.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6049ff3f176b5ad1c4271e6c04890d3351c806ff77157295c1af891e3accd9f6
|
|
| MD5 |
76fe6ab7c3a106626e435c7a56e9817e
|
|
| BLAKE2b-256 |
0776b9995886585a40ecaf51c16d78b3668b9112a0d8c7b42658795fcb6c4ca4
|