Skip to main content

Position-aware, cross-lingually aligned word embeddings built on FastText

Project description

BabelVec

Position-aware, cross-lingually aligned word embeddings built on FastText.

DOI PyPI version License Python 3.10+

Features

  • Cross-Lingual Alignment: Procrustes alignment for multilingual compatibility
  • Position-Aware Embeddings: Optional positional encoding (RoPE, sinusoidal, decay)
  • FastText Foundation: Handles OOV words through subword information

Installation

pip install babelvec

For visualization support:

pip install babelvec[viz]

Quick Start

from babelvec import BabelVec

# Load a model
model = BabelVec.load('path/to/model.bin')

# Get word vector
vec = model.get_word_vector("hello")

# Position-aware sentence embedding
vec1 = model.get_sentence_vector("The dog bites the man", method='rope')
vec2 = model.get_sentence_vector("The man bites the dog", method='rope')
# vec1 != vec2 because word order is encoded

# Simple averaging (no position encoding)
vec = model.get_sentence_vector("Hello world", method='average')

Training

Monolingual Training

from babelvec.training import train_monolingual

model = train_monolingual(
    lang='en',
    corpus_path='corpus.txt',
    dim=300,
    epochs=5,
    threads=8  # Optional: specify number of threads
)
model.save('en_300d.bin')

Parallel Multi-Language Training (v0.1.4+)

Train multiple languages simultaneously for faster training on multi-core servers:

from babelvec.training import train_multiple_languages, get_cpu_count

# Auto-detects CPU cores
print(f"Using {get_cpu_count()} cores")

models = train_multiple_languages(
    languages={'en': 'en_corpus.txt', 'ar': 'ar_corpus.txt'},
    parallel=True,      # Train languages simultaneously
    max_workers=2,      # Number of parallel training jobs
)

Multilingual Training with Alignment

from babelvec.training import train_multilingual

models = train_multilingual(
    languages=['en', 'ar'],
    corpus_paths={'en': 'en.txt', 'ar': 'ar.txt'},
    parallel_data={('en', 'ar'): parallel_pairs},
    alignment='procrustes',
    threads=8  # Optional: specify number of threads
)

Post-hoc Alignment

from babelvec.training import align_models

aligned = align_models(
    models={'en': model_en, 'ar': model_ar},
    parallel_data={('en', 'ar'): parallel_pairs},
    method='procrustes'
)

Model Save/Load (v0.1.3+)

Models save projection matrices alongside the FastText binary:

# Save model
model.save('model.bin')
# Creates: model.bin, model.projection.npy (if aligned), model.meta.json

# Load model - projection is automatically restored
model = BabelVec.load('model.bin')
print(model.is_aligned)  # True if projection was loaded

Encoding Methods

Method Description
rope Rotary Position Embedding
decay Exponential position decay
sinusoidal Transformer-style positional encoding
average Simple averaging (no position encoding)

Evaluation

from babelvec.evaluation import cross_lingual_retrieval

metrics = cross_lingual_retrieval(
    model_src=model_en,
    model_tgt=model_ar,
    parallel_sentences=test_pairs,
    method='rope'
)
print(f"Recall@1: {metrics['recall@1']:.3f}")

Language Families for Joint Training

BabelVec includes a curated family assignment system for 355 Wikipedia languages, optimized for joint multilingual training.

from babelvec.families import get_family_key, get_family_languages, get_training_groups

# Get family for a language
get_family_key("ary")  # -> "arabic"
get_family_key("fr")   # -> "romance_galloitalic"

# Get all languages in a family
get_family_languages("arabic")  # -> ["ar", "ary", "arz"]

# Create training groups (hybrid strategy)
groups = get_training_groups(
    languages=["en", "ar", "ary", "arz"],
    article_counts={"en": 6000000, "ar": 840000, "ary": 17000, "arz": 40000},
    low_resource_threshold=50000
)
# -> {"separate": ["en", "ar"], "joint": {"arabic": ["ary", "arz"]}}

Joint training dramatically improves low-resource languages (+200-600% for Arabic dialects) while high-resource languages should be trained separately.

Examples

See the examples/ directory:

  • 01_basic_usage.py - Getting started

Citation

@misc{babelvec2025,
  title = {BabelVec: Position-Aware Cross-Lingual Word Embeddings},
  author = {Kamali, Omar},
  doi = {10.5281/zenodo.18065206},
  publisher = {Zenodo},
  year = {2025},
  url = {https://github.com/omarkamali/babelvec}
}

License

MIT License - see LICENSE for details.

Copyright © 2025 Omar Kamali

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelvec-0.1.6.tar.gz (39.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babelvec-0.1.6-py3-none-any.whl (48.1 kB view details)

Uploaded Python 3

File details

Details for the file babelvec-0.1.6.tar.gz.

File metadata

  • Download URL: babelvec-0.1.6.tar.gz
  • Upload date:
  • Size: 39.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvec-0.1.6.tar.gz
Algorithm Hash digest
SHA256 231cdb25274619a47231c7e0b0fef0c11b1a521829981dfd9f5bb77921128fcb
MD5 7061ee731fdb575ee767039e0649b4b9
BLAKE2b-256 0e5d42feefc6acbef9cc42b7d29896716f93e063723d2cd70c0e78aa6a9b2ed6

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvec-0.1.6.tar.gz:

Publisher: publish.yml on omarkamali/babelvec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babelvec-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: babelvec-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 48.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvec-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 400b359b801b9dbb92a086fd00099abce4aab6d0ca697bde3c44221411998919
MD5 ef4db53a47302d34dfe1bdb97eaa20e9
BLAKE2b-256 43184d7dc6b987610551934aa041c1a025973f5b0209863dc371fa3ef2610529

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvec-0.1.6-py3-none-any.whl:

Publisher: publish.yml on omarkamali/babelvec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page