gremlinn

A package for managing and aligning multilingual GloVe and Graph embeddings

These details have not been verified by PyPI

Project links

Homepage

Project description

A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge

Project Overview

This project provides open-source "green" GloVe embeddings for 87 mid- and low-resource languages trained using CC100.
The embeddings can be augmented with graph knowledge from ConceptNet to enhance their performance and semantic understanding.

Embeddings Details

Available for 87 languages described in the table below.
Our graph-enhanced embeddings can be downloaded from the ConceptNet embeddings on Hugging Face.
Augmentation Method: Use the algorithm provided in the repository (merge_emb-s.py) to merge the GloVe embeddings with ConceptNet graph embeddings.

Why Static Embeddings in 2025?

Static embeddings outperform large models: Despite the popularity of large language models (LLMs), static embeddings like GloVe still outperform models such as GPT-4 and Llama3 in many tasks, especially for low-resource languages.
Lower Environmental Impact:
- GloVe embeddings: 19.05 kg of CO2 (equivalent to a few car drives).
- BERT embeddings: 635 kg of CO2 (equivalent to a 1,600 km car trip)—30 times more than the emissions from GloVe embeddings.
- LLaMA-8B model: 390,000 kg of CO2 (equivalent to the annual energy consumption of around 100 homes)—20,473 times more than that of GloVe embeddings.

Merging Embedding Spaces

We propose a method for merging GloVe embeddings with graph-based embeddings derived from ConceptNet knowledge, while preserving the vocabulary size of GloVe. This overcomes the shortcoming of the retrofitting approach, where the retrofitted words must be present in both graph and corpus embeddings.

Methodology

Singular Value Decomposition:
- We apply SVD (with no weighting of the U matrix by singular values, following Levy et al. 2015) to concatenate GloVe embeddings and pointwise mutual information (PMI)-based graph embeddings from ConceptNet (as proposed by Speer et al. 2017).
- The concatenated word embeddings from both GloVe and PMI-based graph embeddings are processed to generate a shared embedding space for the part of the vocabulary that is common between GloVe and the knowledge graph.
Linear Transformation:
- We then learn a linear transformation (as proposed by Mikolov et al. 2013) to project the GloVe embeddings into this shared space, allowing us to obtain embeddings for all words in the original GloVe vocabulary.

Show/Hide Language Data Table

ISO	Language Name	Dataset Size	Class	ConceptNet Data
ss	Swati	86K	1	✘
sc	Sardinian	143K	1	✓
yo	Yoruba	1.1M	2	✓
gn	Guarani	1.5M	1	✓
qu	Quechua	1.5M	1	✓
ns	Northern Sotho	1.8M	1	✘
li	Limburgish	2.2M	1	✓
ln	Lingala	2.3M	1	✓
wo	Wolof	3.6M	2	✓
zu	Zulu	4.3M	2	✓
rm	Romansh	4.8M	1	✓
ig	Igbo	6.6M	1	✘
lg	Ganda	7.3M	1	✘
as	Assamese	7.6M	1	✘
tn	Tswana	8.0M	2	✘
ht	Haitian	9.1M	2	✓
om	Oromo	11M	1	✘
su	Sundanese	15M	1	✓
bs	Bosnian	18M	3	✘
br	Breton	21M	1	✓
gd	Scottish Gaelic	22M	1	✓
xh	Xhosa	25M	2	✓
mg	Malagasy	29M	1	✓
jv	Javanese	37M	1	✓
fy	Frisian	38M	0	✓
sa	Sanskrit	44M	2	✓
my	Burmese	46M	1	✓
ug	Uyghur	46M	1	✓
yi	Yiddish	51M	1	✓
or	Oriya	56M	1	✓
ha	Hausa	61M	2	✓
la	Lao	63M	2	✓
sd	Sindhi	67M	1	✓
ta_rom	Tamil Romanized	68M	3	✘
so	Somali	78M	1	✓
te_rom	Telugu Romanized	79M	1	✘
ku	Kurdish	90M	0	✓
pu	Punjabi	90M	2	✓
ps	Pashto	107M	1	✓
ga	Irish	108M	2	✓
am	Amharic	133M	2	✓
ur_rom	Urdu Romanized	141M	3	✘
km	Khmer	153M	1	✓
uz	Uzbek	155M	3	✓
bn_rom	Bengali Romanized	164M	3	✘
ky	Kyrgyz	173M	3	✓
my_zaw	Burmese (Zawgyi)	178M	1	✘
cy	Welsh	179M	1	✓
gu	Gujarati	242M	1	✓
eo	Esperanto	250M	1	✓
af	Afrikaans	305M	3	✓
sw	Swahili	332M	2	✓
mr	Marathi	334M	2	✓
kn	Kannada	360M	1	✓
ne	Nepali	393M	1	✓
mn	Mongolian	397M	1	✓
si	Sinhala	452M	0	✓
te	Telugu	536M	1	✓
la	Latin	609M	3	✓
be	Belarussian	692M	3	✓
tl	Tagalog	701M	3	✘
mk	Macedonian	706M	1	✓
gl	Galician	708M	3	✓
hy	Armenian	776M	1	✓
is	Icelandic	779M	2	✓
ml	Malayalam	831M	1	✓
bn	Bengali	860M	3	✓
ur	Urdu	884M	3	✓
kk	Kazakh	889M	3	✓
ka	Georgian	1.1G	3	✓
az	Azerbaijani	1.3G	1	✓
sq	Albanian	1.3G	1	✓
ta	Tamil	1.3G	3	✓
et	Estonian	1.7G	3	✓
lv	Latvian	2.1G	3	✓
ms	Malay	2.1G	3	✓
sl	Slovenian	2.8G	3	✓
lt	Lithuanian	3.4G	3	✓
he	Hebrew	6.1G	3	✓
sk	Slovak	6.1G	3	✓
el	Greek	7.4G	3	✓
th	Thai	8.7G	3	✓
bg	Bulgarian	9.3G	3	✓
da	Danish	12G	3	✓
uk	Ukrainian	14G	3	✓
ro	Romanian	16G	3	✓
id	Indonesian	36G	3	✘

ConceptNet Data Details

Data Extraction Process

Source: Data is extracted from the ConceptNet database (available here).
Extraction Steps:
- Clean and analyze the data from the official ConceptNet dump (ConceptNet assertions dump).
- The extracted dataset is in JSON format representing a dictionary with language codes and start and end edges for each language.
Start Edges: Represent unique words in a target language.
End Edges: Represent words related to the start edges through various types of relationships (relationship types and sources are not included in the extraction).
Data Availability:
- The dataset is available on Hugging Face.
- A detailed description of the amount of data extracted for each language is also provided.

Usage

The embeddings are available through our pip package:

pip install -i https://test.pypi.org/simple/ gremlin

Download Embeddings

Download embeddings for specific languages and types:

gremlin download -l <language_codes> -t <embedding_type> [-o <output_directory>]

Parameters:

-l/--languages: Language codes (e.g., en, fr, de)
-t/--type: Embedding type (glove or graph)
-o/--output: Output directory (optional, default: ./embeddings)

Examples:

# Download GloVe embeddings for Yoruba and Swahili
gremlin download -l yo sw -t glove

# Download graph embeddings for Yoruba with custom output
gremlin download -l yo -t graph -o ./german_embeddings

Merge Embeddings

Merge different embedding sources:

gremlin merge -g <glove_file> -p <graph_file> -o <output_file>

Parameters:

-g/--glove: Path to GloVe embeddings file
-p/--graph: Path to graph embeddings file
-o/--output: Output path for merged embeddings

Example:

gremlin merge -g glove_yo.txt -p graph_yo.txt -o merged_yo.txt

Extended Help

For comprehensive instructions and usage examples:

gremlin --detailed-help

Citation

If you use our embedding enhancement method or pre-trained embeddings, please consider citing our preview paper (the full paper is to be published in the Findings of NAACL 2025):

@misc{gurgurov2024gremlinrepositorygreenbaseline,
      title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge}, 
      author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
      year={2024},
      eprint={2409.18193},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18193}, 
}

License

This project is licensed under the Apache License - see the [LICENSE.txt] file for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Feb 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gremlinn-1.0.0.tar.gz (20.7 kB view details)

Uploaded Feb 14, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gremlinn-1.0.0-py3-none-any.whl (16.9 kB view details)

Uploaded Feb 14, 2025 Python 3

File details

Details for the file gremlinn-1.0.0.tar.gz.

File metadata

Download URL: gremlinn-1.0.0.tar.gz
Upload date: Feb 14, 2025
Size: 20.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for gremlinn-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`510aa19a509a566360257b98ba05fd51b97b1a1718cfa50970ac7122de45f1c1`
MD5	`e4c47d034e643e722716d1e440e5f53f`
BLAKE2b-256	`b104ad2c53887e9e6a07701883de21b75eb568e3cb93f423b5ccdb686457db93`

See more details on using hashes here.

File details

Details for the file gremlinn-1.0.0-py3-none-any.whl.

File metadata

Download URL: gremlinn-1.0.0-py3-none-any.whl
Upload date: Feb 14, 2025
Size: 16.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for gremlinn-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6049ff3f176b5ad1c4271e6c04890d3351c806ff77157295c1af891e3accd9f6`
MD5	`76fe6ab7c3a106626e435c7a56e9817e`
BLAKE2b-256	`0776b9995886585a40ecaf51c16d78b3668b9112a0d8c7b42658795fcb6c4ca4`

See more details on using hashes here.

gremlinn 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge

Project Overview

Embeddings Details

Why Static Embeddings in 2025?

Merging Embedding Spaces

Methodology

ConceptNet Data Details

Data Extraction Process

Usage

Download Embeddings

Parameters:

Examples:

Merge Embeddings

Parameters:

Example:

Extended Help

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes