Skip to main content

Position-aware, cross-lingually aligned word embeddings built on FastText

Project description

BabelVec

Position-aware, cross-lingually aligned word embeddings built on FastText.

PyPI version License Python 3.9+

Features

  • Cross-Lingual Alignment: Procrustes alignment for multilingual compatibility
  • Position-Aware Embeddings: Optional positional encoding (RoPE, sinusoidal, decay)
  • FastText Foundation: Handles OOV words through subword information

Installation

pip install babelvec

For visualization support:

pip install babelvec[viz]

Quick Start

from babelvec import BabelVec

# Load a model
model = BabelVec.load('path/to/model.bin')

# Get word vector
vec = model.get_word_vector("hello")

# Position-aware sentence embedding
vec1 = model.get_sentence_vector("The dog bites the man", method='rope')
vec2 = model.get_sentence_vector("The man bites the dog", method='rope')
# vec1 != vec2 because word order is encoded

# Simple averaging (no position encoding)
vec = model.get_sentence_vector("Hello world", method='average')

Training

Monolingual Training

from babelvec.training import train_monolingual

model = train_monolingual(
    lang='en',
    corpus_path='corpus.txt',
    dim=300,
    epochs=5
)
model.save('en_300d.bin')

Multilingual Training with Alignment

from babelvec.training import train_multilingual

models = train_multilingual(
    languages=['en', 'ar'],
    corpus_paths={'en': 'en.txt', 'ar': 'ar.txt'},
    parallel_data={('en', 'ar'): parallel_pairs},
    alignment='procrustes'
)

Post-hoc Alignment

from babelvec.training import align_models

aligned = align_models(
    models={'en': model_en, 'ar': model_ar},
    parallel_data={('en', 'ar'): parallel_pairs},
    method='procrustes'
)

Model Save/Load (v0.1.3+)

Models save projection matrices alongside the FastText binary:

# Save model
model.save('model.bin')
# Creates: model.bin, model.projection.npy (if aligned), model.meta.json

# Load model - projection is automatically restored
model = BabelVec.load('model.bin')
print(model.is_aligned)  # True if projection was loaded

Encoding Methods

Method Description
rope Rotary Position Embedding
decay Exponential position decay
sinusoidal Transformer-style positional encoding
average Simple averaging (no position encoding)

Evaluation

from babelvec.evaluation import cross_lingual_retrieval

metrics = cross_lingual_retrieval(
    model_src=model_en,
    model_tgt=model_ar,
    parallel_sentences=test_pairs,
    method='rope'
)
print(f"Recall@1: {metrics['recall@1']:.3f}")

Examples

See the examples/ directory:

  • 01_basic_usage.py - Getting started

Citation

@software{babelvec2025,
  title = {BabelVec: Position-Aware Cross-Lingual Word Embeddings},
  author = {Kamali, Omar},
  year = {2025},
  url = {https://github.com/omarkamali/babelvec}
}

License

MIT License - see LICENSE for details.

Copyright © 2025 Omar Kamali

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

babelvec-0.1.3.tar.gz (28.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

babelvec-0.1.3-py3-none-any.whl (38.3 kB view details)

Uploaded Python 3

File details

Details for the file babelvec-0.1.3.tar.gz.

File metadata

  • Download URL: babelvec-0.1.3.tar.gz
  • Upload date:
  • Size: 28.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvec-0.1.3.tar.gz
Algorithm Hash digest
SHA256 c3aa5d92b258b1f9ac2dcf6aef92369f0d7f053fa901649fc0765a6893c53fd5
MD5 cee89d3428766a6983bf2c1d841ac532
BLAKE2b-256 88902c0d8b8a8a58eb54751df76ca9ecd77ddf420b25d13da296e3251775b4bf

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvec-0.1.3.tar.gz:

Publisher: publish.yml on omarkamali/babelvec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file babelvec-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: babelvec-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 38.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for babelvec-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 5cb0fc0ebbe9af8e4f3992f2b46625aeec3b17d6cf615bc2460f72bb130f3ab1
MD5 8676d72ba71260fae81c882640ba1ee5
BLAKE2b-256 1ae1eac09b04df8e7c1d35356772f4d98ecc31df226db66d28df699cc34d7d67

See more details on using hashes here.

Provenance

The following attestation bundles were made for babelvec-0.1.3-py3-none-any.whl:

Publisher: publish.yml on omarkamali/babelvec

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page