Skip to main content

Position-aware, cross-lingually aligned word embeddings built on FastText

Project description

BabelVec

Position-aware, cross-lingually aligned word embeddings built on FastText.

PyPI version License Python 3.9+

Features

  • Cross-Lingual Alignment: Procrustes alignment for multilingual compatibility
  • Position-Aware Embeddings: Optional positional encoding (RoPE, sinusoidal, decay)
  • FastText Foundation: Handles OOV words through subword information

Installation

pip install babelvec

For visualization support:

pip install babelvec[viz]

Quick Start

from babelvec import BabelVec

# Load a model
model = BabelVec.load('path/to/model.bin')

# Get word vector
vec = model.get_word_vector("hello")

# Position-aware sentence embedding
vec1 = model.get_sentence_vector("The dog bites the man", method='rope')
vec2 = model.get_sentence_vector("The man bites the dog", method='rope')
# vec1 != vec2 because word order is encoded

# Simple averaging (no position encoding)
vec = model.get_sentence_vector("Hello world", method='average')

Training

Monolingual Training

from babelvec.training import train_monolingual

model = train_monolingual(
    lang='en',
    corpus_path='corpus.txt',
    dim=300,
    epochs=5
)
model.save('en_300d.bin')

Parallel Multi-Language Training (v0.1.4+)

Train multiple languages simultaneously for faster training on multi-core servers:

from babelvec.training import train_multiple_languages, get_cpu_count

# Auto-detects CPU cores
print(f"Using {get_cpu_count()} cores")

models = train_multiple_languages(
    languages={'en': 'en_corpus.txt', 'ar': 'ar_corpus.txt'},
    parallel=True,      # Train languages simultaneously
    max_workers=2,      # Number of parallel training jobs
)

Multilingual Training with Alignment

from babelvec.training import train_multilingual

models = train_multilingual(
    languages=['en', 'ar'],
    corpus_paths={'en': 'en.txt', 'ar': 'ar.txt'},
    parallel_data={('en', 'ar'): parallel_pairs},
    alignment='procrustes'
)

Post-hoc Alignment

from babelvec.training import align_models

aligned = align_models(
    models={'en': model_en, 'ar': model_ar},
    parallel_data={('en', 'ar'): parallel_pairs},
    method='procrustes'
)

Model Save/Load (v0.1.3+)

Models save projection matrices alongside the FastText binary:

# Save model
model.save('model.bin')
# Creates: model.bin, model.projection.npy (if aligned), model.meta.json

# Load model - projection is automatically restored
model = BabelVec.load('model.bin')
print(model.is_aligned)  # True if projection was loaded

Encoding Methods

Method Description
rope Rotary Position Embedding
decay Exponential position decay
sinusoidal Transformer-style positional encoding
average Simple averaging (no position encoding)

Evaluation

from babelvec.evaluation import cross_lingual_retrieval

metrics = cross_lingual_retrieval(
    model_src=model_en,
    model_tgt=model_ar,
    parallel_sentences=test_pairs,
    method='rope'
)
print(f"Recall@1: {metrics['recall@1']:.3f}")

Language Families for Joint Training

BabelVec includes a curated family assignment system for 355 Wikipedia languages, optimized for joint multilingual training.

from babelvec.families import get_family_key, get_family_languages, get_training_groups

# Get family for a language
get_family_key("ary")  # -> "arabic"
get_family_key("fr")   # -> "romance_galloitalic"

# Get all languages in a family
get_family_languages("arabic")  # -> ["ar", "ary", "arz"]

# Create training groups (hybrid strategy)
groups = get_training_groups(
    languages=["en", "ar", "ary", "arz"],
    article_counts={"en": 6000000, "ar": 840000, "ary": 17000, "arz": 40000},
    low_resource_threshold=50000
)
# -> {"separate": ["en", "ar"], "joint": {"arabic": ["ary", "arz"]}}

Joint training dramatically improves low-resource languages (+200-600% for Arabic dialects) while high-resource languages should be trained separately.

Examples

See the examples/ directory:

  • 01_basic_usage.py - Getting started

Citation

@software{babelvec2025,
  title = {BabelVec: Position-Aware Cross-Lingual Word Embeddings},
  author = {Kamali, Omar},
  year = {2025},
  url = {https://github.com/omarkamali/babelvec}
}

License

MIT License - see LICENSE for details.

Copyright © 2025 Omar Kamali

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelvec-0.1.5.tar.gz (38.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babelvec-0.1.5-py3-none-any.whl (47.9 kB view details)

Uploaded Python 3

File details

Details for the file babelvec-0.1.5.tar.gz.

File metadata

  • Download URL: babelvec-0.1.5.tar.gz
  • Upload date:
  • Size: 38.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvec-0.1.5.tar.gz
Algorithm Hash digest
SHA256 5c169c3545b732d741a14c3b19b4b1abdba4efcfcdde838a164481712e9a8119
MD5 138706d761d4e6d4e7ff516981926351
BLAKE2b-256 d7d05558cf702067a5afcfe0f36900a5953b27c2e2433b5570748aec3255813a

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvec-0.1.5.tar.gz:

Publisher: publish.yml on omarkamali/babelvec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babelvec-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: babelvec-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 47.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvec-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3d4b37347302cd2ad4da2f6216dd84cf1d296c4b7bba7d3cdbb6e12a4ef46896
MD5 348bf5823d02a882bd2870ed39707ce4
BLAKE2b-256 3bd42f0a68c345c03ba4d7df690a59be587ae166015a7af4f22618031d594430

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvec-0.1.5-py3-none-any.whl:

Publisher: publish.yml on omarkamali/babelvec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page