Skip to main content

Multilingual semantic embedding visualization and analysis toolkit

Project description

Semanscope

Multilingual Semantic Embedding Visualization and Analysis Toolkit

License: MIT Python 3.9+

Semanscope is a comprehensive toolkit for visualizing and analyzing semantic embeddings across multiple languages. It features advanced metrics for measuring semantic consistency (Semantic Affinity) and relational structure preservation (Relational Affinity) in multilingual embedding models.

Key Features

  • Multi-Model Support: LaBSE, SONAR, Gemma, OpenAI, Voyage AI, Google Gemini, Ollama, and 30+ models
  • Advanced Dimensionality Reduction: UMAP, PHATE, t-SNE, PaCMAP, TriMap
  • Semantic Affinity (SA): Novel metric for measuring semantic consistency across embeddings
  • Relational Affinity (RA): Metric for evaluating relational structure preservation
  • Interactive UI: Streamlit-based interface with 11 specialized pages
  • Batch Benchmarking: CLI tools for research-grade evaluation
  • Multilingual: Support for 70+ languages
  • Visualization: Interactive plots with Plotly and ECharts

Quick Start

Installation

# Clone the repository
git clone https://github.com/semanscope/semanscope.git
cd semanscope

# Create conda environment
conda create -n semanscope python=3.11
conda activate semanscope

# Install package with UI support
pip install -e ".[ui]"

# Or install with all dependencies (including API integrations)
pip install -e ".[all]"

Launch the UI

# Option 1: Using the launcher script
python run_app.py

# Option 2: Using the CLI command (after installation)
semanscope-ui

Basic Usage (Python API)

from semanscope.models.model_manager import get_model
from semanscope.components.embedding_viz import EmbeddingVisualizer

# Load a model
model = get_model("LaBSE")

# Create visualizer
viz = EmbeddingVisualizer(model=model)

# Visualize embeddings
words = ["hello", "world", "friend", "peace"]
viz.plot_words(words, method="UMAP", dimension=2)

Batch Benchmarking

# Semantic Affinity benchmark
semanscope-benchmark-sa \
    --dataset data/input/NeurIPS-01-family-relations-v2.5-SA.csv \
    --models LaBSE SONAR \
    --output results/sa_benchmark.csv

# Relational Affinity benchmark
semanscope-benchmark-ra \
    --dataset data/input/NeurIPS-01-family-relations-v2.5-RA.csv \
    --models LaBSE SONAR \
    --languages english chinese \
    --output results/ra_benchmark.csv

Features in Detail

Semantic Affinity (SA) Metric

Measures how consistently a model represents semantic relationships:

from semanscope.components.semantic_affinity import calculate_semantic_affinity

sa_score = calculate_semantic_affinity(
    model=model,
    word_pairs=[("cat", "dog"), ("happy", "sad")],
    metric="cosine"
)

SA Formula:

SA = 1 - std(similarities) / mean(similarities)

Higher SA (โ†’1.0) = more consistent semantic representations

Relational Affinity (RA) Metric

Evaluates preservation of relational structure across languages:

from semanscope.components import calculate_relational_affinity

ra_score = calculate_relational_affinity(
    model=model,
    word_quadruples=[("king", "queen", "man", "woman")],
    languages=["english", "chinese"],
    metric="cosine"
)

RA Formula (Cosine):

rel_vec(w1, w2) = emb(w2) - emb(w1)
RA = cosine_similarity(rel_vec_lang1, rel_vec_lang2)

Higher RA (โ†’1.0) = better relational structure preservation

Interactive UI Pages

  1. Settings (0_๐Ÿ”ง_Settings.py): Configure models, methods, cache
  2. Semanscope (1_๐Ÿงญ_Semanscope.py): Main visualization interface
  3. Semanscope ECharts (2_๐Ÿ“Š_Semanscope-ECharts.py): ECharts-based visualization
  4. Compare (3_โš–๏ธ_Semanscope-Compare.py): Side-by-side model comparison
  5. Multilingual (4_๐ŸŒ_Semanscope-Multilingual.py): Multi-language visualization
  6. Zoom (5_๐Ÿ”_Semanscope-Zoom.py): Interactive zoom and exploration
  7. Semantic Affinity (6_๐Ÿ“_Semantic_Affinity.py): SA metric calculator
  8. Relational Affinity (6_๐Ÿ”—_Relational_Affinity.py): RA metric calculator
  9. Translator (8_๐ŸŒ_Translator.py): Translation utilities
  10. NSM Prime Words (9_๐Ÿ“_NSM_Prime_Words.py): Natural Semantic Metalanguage
  11. Review Images (9_๐Ÿ–ผ๏ธ_Review_Images.py): Visualization gallery

Supported Models

Open Source:

  • LaBSE (Language-agnostic BERT Sentence Embedding)
  • SONAR (Seamless Communication models)
  • XLM-RoBERTa variants
  • mBERT (Multilingual BERT)
  • And 20+ more...

API-based (requires API keys):

  • OpenAI (text-embedding-ada-002, text-embedding-3-small, etc.)
  • Voyage AI (voyage-multilingual-2, voyage-code-2)
  • Google Gemini (text-embedding-004)
  • Ollama (local models)

See semanscope/config.py for complete model catalog.

Dimensionality Reduction Methods

  • UMAP: Uniform Manifold Approximation and Projection
  • PHATE: Potential of Heat-diffusion for Affinity-based Transition Embedding
  • t-SNE: t-Distributed Stochastic Neighbor Embedding
  • PaCMAP: Pairwise Controlled Manifold Approximation
  • TriMap: Triplet-based dimensionality reduction
  • PCA: Principal Component Analysis

Datasets

Semanscope includes 60+ representative datasets across 7 categories:

  • ACL-0: Chinese morphology (Zinets, Radicals)
  • ACL-1: Alphabets (15+ languages)
  • ACL-2: PeterG vocabulary (semantic primes)
  • ACL-3: Morphological networks
  • ACL-4: Semantic categories (numbers, emotions, animals)
  • ACL-5: Poetry corpora (Li Bai, Du Fu, Frost, Wordsworth)
  • ACL-6: Visual semantics (emoji, pictographs)
  • NeurIPS-01 to NeurIPS-11: Research benchmarks for SA/RA metrics

See data/input/README.md for complete dataset documentation.

Documentation

Architecture

semanscope/
โ”œโ”€โ”€ semanscope/          # Core Python package
โ”‚   โ”œโ”€โ”€ components/      # Analysis components (SA, RA, viz)
โ”‚   โ”œโ”€โ”€ models/          # Model managers and integrations
โ”‚   โ”œโ”€โ”€ utils/           # Utilities (caching, text processing)
โ”‚   โ”œโ”€โ”€ services/        # External API integrations
โ”‚   โ””โ”€โ”€ cli/             # Command-line tools
โ”œโ”€โ”€ ui/                  # Streamlit UI
โ”œโ”€โ”€ data/                # Datasets and visualizations
โ”œโ”€โ”€ tests/               # Test suite
โ”œโ”€โ”€ demo/                # Usage examples
โ”œโ”€โ”€ scripts/             # Utility scripts
โ””โ”€โ”€ docs/                # Documentation

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run specific test
pytest tests/test_semantic_affinity.py -v

# Code formatting
black semanscope/ ui/ tests/
ruff check semanscope/ ui/

Configuration

Create a .env file for API keys and settings:

# Copy example configuration
cp .env.example .env

# Edit with your API keys
OPENROUTER_API_KEY=your_key_here
VOYAGE_API_KEY=your_key_here
GOOGLE_API_KEY=your_key_here

Performance Tips

  1. Use GPU: Set CUDA_VISIBLE_DEVICES=0 for GPU acceleration
  2. Enable caching: Embeddings are cached automatically to ~/projects/embedding_cache/
  3. Batch processing: Use CLI tools for large-scale benchmarking
  4. Model selection: Start with smaller models (LaBSE, mBERT) for exploration

Citation

If you use Semanscope in your research, please cite:

@software{semanscope2026,
  title={Semanscope: Multilingual Semantic Embedding Visualization Toolkit},
  author={Semanscope Contributors},
  year={2026},
  url={https://github.com/semanscope/semanscope}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Acknowledgments

  • Language Models: Thanks to Google (LaBSE), Meta (SONAR), and the open-source community
  • Dimensionality Reduction: UMAP, PHATE, t-SNE, PaCMAP, TriMap libraries
  • Visualization: Plotly, Streamlit, ECharts
  • Datasets: Computational linguistics research community

Support

Roadmap

  • PyPI publication
  • Additional embedding models (Cohere, Anthropic)
  • Enhanced visualization options
  • Expanded benchmark datasets
  • Interactive tutorials and examples
  • Web deployment (Streamlit Cloud)

Built with โค๏ธ for the multilingual NLP community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semanscope-1.0.1.tar.gz (228.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semanscope-1.0.1-py3-none-any.whl (276.2 kB view details)

Uploaded Python 3

File details

Details for the file semanscope-1.0.1.tar.gz.

File metadata

  • Download URL: semanscope-1.0.1.tar.gz
  • Upload date:
  • Size: 228.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for semanscope-1.0.1.tar.gz
Algorithm Hash digest
SHA256 77edfac76ef6fce5579d16e44e77cc6995e83c77f2e7441566cfef099c7a6565
MD5 3c95e9ce69190f6a53158a25fcd9f03b
BLAKE2b-256 6c05d540fb671c7da0edce2d8a52e6a5c02e781e9e0a31f48b7c49abd91383cc

See more details on using hashes here.

File details

Details for the file semanscope-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: semanscope-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 276.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for semanscope-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d562b4e31ba6d1742e3d15fd77ff8e30cfc1ba9dbcb389b8b43a6e531abd3ca
MD5 fab5814c04d2e202f9764a21e8c0f2db
BLAKE2b-256 65c972fa4635ab6ac7b1e6898c28aabca5f8d8f97cf2d4f814d1e0217bb76b63

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page