Position-aware, cross-lingually aligned word embeddings built on FastText
Project description
BabelVec
Position-aware, cross-lingually aligned word embeddings built on FastText.
Features
- Cross-Lingual Alignment: Procrustes alignment for multilingual compatibility
- Position-Aware Embeddings: Optional positional encoding (RoPE, sinusoidal, decay)
- FastText Foundation: Handles OOV words through subword information
Installation
pip install babelvec
For visualization support:
pip install babelvec[viz]
Quick Start
from babelvec import BabelVec
# Load a model
model = BabelVec.load('path/to/model.bin')
# Get word vector
vec = model.get_word_vector("hello")
# Position-aware sentence embedding
vec1 = model.get_sentence_vector("The dog bites the man", method='rope')
vec2 = model.get_sentence_vector("The man bites the dog", method='rope')
# vec1 != vec2 because word order is encoded
# Simple averaging (no position encoding)
vec = model.get_sentence_vector("Hello world", method='average')
Training
Monolingual Training
from babelvec.training import train_monolingual
model = train_monolingual(
lang='en',
corpus_path='corpus.txt',
dim=300,
epochs=5,
threads=8 # Optional: specify number of threads
)
model.save('en_300d.bin')
Parallel Multi-Language Training (v0.1.4+)
Train multiple languages simultaneously for faster training on multi-core servers:
from babelvec.training import train_multiple_languages, get_cpu_count
# Auto-detects CPU cores
print(f"Using {get_cpu_count()} cores")
models = train_multiple_languages(
languages={'en': 'en_corpus.txt', 'ar': 'ar_corpus.txt'},
parallel=True, # Train languages simultaneously
max_workers=2, # Number of parallel training jobs
)
Multilingual Training with Alignment
from babelvec.training import train_multilingual
models = train_multilingual(
languages=['en', 'ar'],
corpus_paths={'en': 'en.txt', 'ar': 'ar.txt'},
parallel_data={('en', 'ar'): parallel_pairs},
alignment='procrustes',
threads=8 # Optional: specify number of threads
)
Post-hoc Alignment
from babelvec.training import align_models
aligned = align_models(
models={'en': model_en, 'ar': model_ar},
parallel_data={('en', 'ar'): parallel_pairs},
method='procrustes'
)
Model Save/Load (v0.1.3+)
Models save projection matrices alongside the FastText binary:
# Save model
model.save('model.bin')
# Creates: model.bin, model.projection.npy (if aligned), model.meta.json
# Load model - projection is automatically restored
model = BabelVec.load('model.bin')
print(model.is_aligned) # True if projection was loaded
Encoding Methods
| Method | Description |
|---|---|
rope |
Rotary Position Embedding |
decay |
Exponential position decay |
sinusoidal |
Transformer-style positional encoding |
average |
Simple averaging (no position encoding) |
Evaluation
from babelvec.evaluation import cross_lingual_retrieval
metrics = cross_lingual_retrieval(
model_src=model_en,
model_tgt=model_ar,
parallel_sentences=test_pairs,
method='rope'
)
print(f"Recall@1: {metrics['recall@1']:.3f}")
Language Families for Joint Training
BabelVec includes a curated family assignment system for 355 Wikipedia languages, optimized for joint multilingual training.
from babelvec.families import get_family_key, get_family_languages, get_training_groups
# Get family for a language
get_family_key("ary") # -> "arabic"
get_family_key("fr") # -> "romance_galloitalic"
# Get all languages in a family
get_family_languages("arabic") # -> ["ar", "ary", "arz"]
# Create training groups (hybrid strategy)
groups = get_training_groups(
languages=["en", "ar", "ary", "arz"],
article_counts={"en": 6000000, "ar": 840000, "ary": 17000, "arz": 40000},
low_resource_threshold=50000
)
# -> {"separate": ["en", "ar"], "joint": {"arabic": ["ary", "arz"]}}
Joint training dramatically improves low-resource languages (+200-600% for Arabic dialects) while high-resource languages should be trained separately.
Examples
See the examples/ directory:
01_basic_usage.py- Getting started
Citation
@misc{babelvec2025,
title = {BabelVec: Position-Aware Cross-Lingual Word Embeddings},
author = {Kamali, Omar},
doi = {10.5281/zenodo.18065206},
publisher = {Zenodo},
year = {2025},
url = {https://github.com/omarkamali/babelvec}
}
License
MIT License - see LICENSE for details.
Copyright © 2025 Omar Kamali
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file babelvec-0.1.7.tar.gz.
File metadata
- Download URL: babelvec-0.1.7.tar.gz
- Upload date:
- Size: 39.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6bde24d3b231b4564457f7331d51aeab735fb799c6a335a17d57ead12073bbfc
|
|
| MD5 |
1c26d6972016e638542c9ec8970d243f
|
|
| BLAKE2b-256 |
d86af17302ada4be9417bffc1d06df5c67e98901f36e42dab45709b5058d5493
|
Provenance
The following attestation bundles were made for babelvec-0.1.7.tar.gz:
Publisher:
publish.yml on omarkamali/babelvec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babelvec-0.1.7.tar.gz -
Subject digest:
6bde24d3b231b4564457f7331d51aeab735fb799c6a335a17d57ead12073bbfc - Sigstore transparency entry: 811798491
- Sigstore integration time:
-
Permalink:
omarkamali/babelvec@8a12caf1e22806acf29c9737175ed1d8321900b5 -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/omarkamali
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8a12caf1e22806acf29c9737175ed1d8321900b5 -
Trigger Event:
release
-
Statement type:
File details
Details for the file babelvec-0.1.7-py3-none-any.whl.
File metadata
- Download URL: babelvec-0.1.7-py3-none-any.whl
- Upload date:
- Size: 48.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7c6dc78df63860fec1b078b919800278529da86dc6349d91bc18842281d0faf9
|
|
| MD5 |
412604b81efcb51f276a2ac2fc0c3c58
|
|
| BLAKE2b-256 |
9f341bea0866d3c8c63c91731d76a4377a791d95436008691af4f28b1be38a72
|
Provenance
The following attestation bundles were made for babelvec-0.1.7-py3-none-any.whl:
Publisher:
publish.yml on omarkamali/babelvec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
babelvec-0.1.7-py3-none-any.whl -
Subject digest:
7c6dc78df63860fec1b078b919800278529da86dc6349d91bc18842281d0faf9 - Sigstore transparency entry: 811798517
- Sigstore integration time:
-
Permalink:
omarkamali/babelvec@8a12caf1e22806acf29c9737175ed1d8321900b5 -
Branch / Tag:
refs/tags/v0.1.7 - Owner: https://github.com/omarkamali
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@8a12caf1e22806acf29c9737175ed1d8321900b5 -
Trigger Event:
release
-
Statement type: