Position-aware, cross-lingually aligned word embeddings built on FastText

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

omarkamali

These details have not been verified by PyPI

Project description

BabelVec

Position-aware, cross-lingually aligned word embeddings built on FastText.

Features

Cross-Lingual Alignment: Procrustes alignment for multilingual compatibility
Position-Aware Embeddings: Optional positional encoding (RoPE, sinusoidal, decay)
FastText Foundation: Handles OOV words through subword information

Installation

pip install babelvec

For visualization support:

pip install babelvec[viz]

Quick Start

from babelvec import BabelVec

# Load a model
model = BabelVec.load('path/to/model.bin')

# Get word vector
vec = model.get_word_vector("hello")

# Position-aware sentence embedding
vec1 = model.get_sentence_vector("The dog bites the man", method='rope')
vec2 = model.get_sentence_vector("The man bites the dog", method='rope')
# vec1 != vec2 because word order is encoded

# Simple averaging (no position encoding)
vec = model.get_sentence_vector("Hello world", method='average')

Training

Monolingual Training

from babelvec.training import train_monolingual

model = train_monolingual(
    lang='en',
    corpus_path='corpus.txt',
    dim=300,
    epochs=5,
    threads=8  # Optional: specify number of threads
)
model.save('en_300d.bin')

Parallel Multi-Language Training (v0.1.4+)

Train multiple languages simultaneously for faster training on multi-core servers:

from babelvec.training import train_multiple_languages, get_cpu_count

# Auto-detects CPU cores
print(f"Using {get_cpu_count()} cores")

models = train_multiple_languages(
    languages={'en': 'en_corpus.txt', 'ar': 'ar_corpus.txt'},
    parallel=True,      # Train languages simultaneously
    max_workers=2,      # Number of parallel training jobs
)

Multilingual Training with Alignment

from babelvec.training import train_multilingual

models = train_multilingual(
    languages=['en', 'ar'],
    corpus_paths={'en': 'en.txt', 'ar': 'ar.txt'},
    parallel_data={('en', 'ar'): parallel_pairs},
    alignment='procrustes',
    threads=8  # Optional: specify number of threads
)

Post-hoc Alignment

from babelvec.training import align_models

aligned = align_models(
    models={'en': model_en, 'ar': model_ar},
    parallel_data={('en', 'ar'): parallel_pairs},
    method='procrustes'
)

Model Save/Load (v0.1.3+)

Models save projection matrices alongside the FastText binary:

# Save model
model.save('model.bin')
# Creates: model.bin, model.projection.npy (if aligned), model.meta.json

# Load model - projection is automatically restored
model = BabelVec.load('model.bin')
print(model.is_aligned)  # True if projection was loaded

Encoding Methods

Method	Description
`rope`	Rotary Position Embedding
`decay`	Exponential position decay
`sinusoidal`	Transformer-style positional encoding
`average`	Simple averaging (no position encoding)

Evaluation

from babelvec.evaluation import cross_lingual_retrieval

metrics = cross_lingual_retrieval(
    model_src=model_en,
    model_tgt=model_ar,
    parallel_sentences=test_pairs,
    method='rope'
)
print(f"Recall@1: {metrics['recall@1']:.3f}")

Language Families for Joint Training

BabelVec includes a curated family assignment system for 355 Wikipedia languages, optimized for joint multilingual training.

from babelvec.families import get_family_key, get_family_languages, get_training_groups

# Get family for a language
get_family_key("ary")  # -> "arabic"
get_family_key("fr")   # -> "romance_galloitalic"

# Get all languages in a family
get_family_languages("arabic")  # -> ["ar", "ary", "arz"]

# Create training groups (hybrid strategy)
groups = get_training_groups(
    languages=["en", "ar", "ary", "arz"],
    article_counts={"en": 6000000, "ar": 840000, "ary": 17000, "arz": 40000},
    low_resource_threshold=50000
)
# -> {"separate": ["en", "ar"], "joint": {"arabic": ["ary", "arz"]}}

Joint training dramatically improves low-resource languages (+200-600% for Arabic dialects) while high-resource languages should be trained separately.

Examples

See the examples/ directory:

01_basic_usage.py - Getting started

Citation

@misc{babelvec2025,
  title = {BabelVec: Position-Aware Cross-Lingual Word Embeddings},
  author = {Kamali, Omar},
  doi = {10.5281/zenodo.18065206},
  publisher = {Zenodo},
  year = {2025},
  url = {https://github.com/omarkamali/babelvec}
}

License

MIT License - see LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

omarkamali

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.7

Jan 9, 2026

This version

0.1.6

Jan 3, 2026

0.1.5

Dec 27, 2025

0.1.4

Dec 23, 2025

0.1.3

Dec 23, 2025

0.1.2

Dec 21, 2025

0.1.0

Dec 21, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelvec-0.1.6.tar.gz (39.1 kB view details)

Uploaded Jan 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

babelvec-0.1.6-py3-none-any.whl (48.1 kB view details)

Uploaded Jan 3, 2026 Python 3

File details

Details for the file babelvec-0.1.6.tar.gz.

File metadata

Download URL: babelvec-0.1.6.tar.gz
Upload date: Jan 3, 2026
Size: 39.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvec-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`231cdb25274619a47231c7e0b0fef0c11b1a521829981dfd9f5bb77921128fcb`
MD5	`7061ee731fdb575ee767039e0649b4b9`
BLAKE2b-256	`0e5d42feefc6acbef9cc42b7d29896716f93e063723d2cd70c0e78aa6a9b2ed6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvec-0.1.6.tar.gz:

Publisher: publish.yml on omarkamali/babelvec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: babelvec-0.1.6.tar.gz
- Subject digest: 231cdb25274619a47231c7e0b0fef0c11b1a521829981dfd9f5bb77921128fcb
- Sigstore transparency entry: 789657585
- Sigstore integration time: Jan 3, 2026
Source repository:
- Permalink: omarkamali/babelvec@f17ebc6c5e73b205f52ec74919b6c35ceba35f08
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/omarkamali
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f17ebc6c5e73b205f52ec74919b6c35ceba35f08
- Trigger Event: release

File details

Details for the file babelvec-0.1.6-py3-none-any.whl.

File metadata

Download URL: babelvec-0.1.6-py3-none-any.whl
Upload date: Jan 3, 2026
Size: 48.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvec-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`400b359b801b9dbb92a086fd00099abce4aab6d0ca697bde3c44221411998919`
MD5	`ef4db53a47302d34dfe1bdb97eaa20e9`
BLAKE2b-256	`43184d7dc6b987610551934aa041c1a025973f5b0209863dc371fa3ef2610529`

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvec-0.1.6-py3-none-any.whl:

Publisher: publish.yml on omarkamali/babelvec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: babelvec-0.1.6-py3-none-any.whl
- Subject digest: 400b359b801b9dbb92a086fd00099abce4aab6d0ca697bde3c44221411998919
- Sigstore transparency entry: 789657593
- Sigstore integration time: Jan 3, 2026
Source repository:
- Permalink: omarkamali/babelvec@f17ebc6c5e73b205f52ec74919b6c35ceba35f08
- Branch / Tag: refs/tags/v0.1.6
- Owner: https://github.com/omarkamali
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@f17ebc6c5e73b205f52ec74919b6c35ceba35f08
- Trigger Event: release

babelvec 0.1.6

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

BabelVec

Features

Installation

Quick Start

Training

Monolingual Training

Parallel Multi-Language Training (v0.1.4+)

Multilingual Training with Alignment

Post-hoc Alignment

Model Save/Load (v0.1.3+)

Encoding Methods

Evaluation

Language Families for Joint Training

Examples

Citation

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance