Skip to main content

Multilingual semantic embedding visualization and analysis toolkit

Project description

Semanscope

Multilingual Semantic Embedding Visualization and Analysis Toolkit

License: MIT Python 3.9+

Semanscope is a comprehensive toolkit for visualizing and analyzing semantic embeddings across multiple languages. It features advanced metrics for measuring semantic consistency (Semantic Affinity) and relational structure preservation (Relational Affinity) in multilingual embedding models.

Key Features

  • Multi-Model Support: LaBSE, SONAR, Gemma, OpenAI, Voyage AI, Google Gemini, Ollama, and 30+ models
  • Advanced Dimensionality Reduction: UMAP, PHATE, t-SNE, PaCMAP, TriMap
  • Semantic Affinity (SA): Novel metric for measuring semantic consistency across embeddings
  • Relational Affinity (RA): Metric for evaluating relational structure preservation
  • Interactive UI: Streamlit-based interface with 11 specialized pages
  • Batch Benchmarking: CLI tools for research-grade evaluation
  • Multilingual: Support for 70+ languages
  • Visualization: Interactive plots with Plotly and ECharts

Quick Start

Installation

# Clone the repository
git clone https://github.com/semanscope/semanscope.git
cd semanscope

# Create conda environment
conda create -n semanscope python=3.11
conda activate semanscope

# Install package with UI support
pip install -e ".[ui]"

# Or install with all dependencies (including API integrations)
pip install -e ".[all]"

Launch the UI

# Option 1: Using the launcher script
python run_app.py

# Option 2: Using the CLI command (after installation)
semanscope-ui

Basic Usage (Python API)

from semanscope.models.model_manager import get_model
from semanscope.components.embedding_viz import EmbeddingVisualizer

# Load a model
model = get_model("LaBSE")

# Create visualizer
viz = EmbeddingVisualizer(model=model)

# Visualize embeddings
words = ["hello", "world", "friend", "peace"]
viz.plot_words(words, method="UMAP", dimension=2)

Batch Benchmarking

# Semantic Affinity benchmark
semanscope-benchmark-sa \
    --dataset data/input/NeurIPS-01-family-relations-v2.5-SA.csv \
    --models LaBSE SONAR \
    --output results/sa_benchmark.csv

# Relational Affinity benchmark
semanscope-benchmark-ra \
    --dataset data/input/NeurIPS-01-family-relations-v2.5-RA.csv \
    --models LaBSE SONAR \
    --languages english chinese \
    --output results/ra_benchmark.csv

Features in Detail

Semantic Affinity (SA) Metric

Measures how consistently a model represents semantic relationships:

from semanscope.components.semantic_affinity import calculate_semantic_affinity

sa_score = calculate_semantic_affinity(
    model=model,
    word_pairs=[("cat", "dog"), ("happy", "sad")],
    metric="cosine"
)

SA Formula:

SA = 1 - std(similarities) / mean(similarities)

Higher SA (โ†’1.0) = more consistent semantic representations

Relational Affinity (RA) Metric

Evaluates preservation of relational structure across languages:

from semanscope.components import calculate_relational_affinity

ra_score = calculate_relational_affinity(
    model=model,
    word_quadruples=[("king", "queen", "man", "woman")],
    languages=["english", "chinese"],
    metric="cosine"
)

RA Formula (Cosine):

rel_vec(w1, w2) = emb(w2) - emb(w1)
RA = cosine_similarity(rel_vec_lang1, rel_vec_lang2)

Higher RA (โ†’1.0) = better relational structure preservation

Interactive UI Pages

  1. Settings (0_๐Ÿ”ง_Settings.py): Configure models, methods, cache
  2. Semanscope (1_๐Ÿงญ_Semanscope.py): Main visualization interface
  3. Semanscope ECharts (2_๐Ÿ“Š_Semanscope-ECharts.py): ECharts-based visualization
  4. Compare (3_โš–๏ธ_Semanscope-Compare.py): Side-by-side model comparison
  5. Multilingual (4_๐ŸŒ_Semanscope-Multilingual.py): Multi-language visualization
  6. Zoom (5_๐Ÿ”_Semanscope-Zoom.py): Interactive zoom and exploration
  7. Semantic Affinity (6_๐Ÿ“_Semantic_Affinity.py): SA metric calculator
  8. Relational Affinity (6_๐Ÿ”—_Relational_Affinity.py): RA metric calculator
  9. Translator (8_๐ŸŒ_Translator.py): Translation utilities
  10. NSM Prime Words (9_๐Ÿ“_NSM_Prime_Words.py): Natural Semantic Metalanguage
  11. Review Images (9_๐Ÿ–ผ๏ธ_Review_Images.py): Visualization gallery

Supported Models

Open Source:

  • LaBSE (Language-agnostic BERT Sentence Embedding)
  • SONAR (Seamless Communication models)
  • XLM-RoBERTa variants
  • mBERT (Multilingual BERT)
  • And 20+ more...

API-based (requires API keys):

  • OpenAI (text-embedding-ada-002, text-embedding-3-small, etc.)
  • Voyage AI (voyage-multilingual-2, voyage-code-2)
  • Google Gemini (text-embedding-004)
  • Ollama (local models)

See semanscope/config.py for complete model catalog.

Dimensionality Reduction Methods

  • UMAP: Uniform Manifold Approximation and Projection
  • PHATE: Potential of Heat-diffusion for Affinity-based Transition Embedding
  • t-SNE: t-Distributed Stochastic Neighbor Embedding
  • PaCMAP: Pairwise Controlled Manifold Approximation
  • TriMap: Triplet-based dimensionality reduction
  • PCA: Principal Component Analysis

Datasets

Semanscope includes 60+ representative datasets across 7 categories:

  • ACL-0: Chinese morphology (Zinets, Radicals)
  • ACL-1: Alphabets (15+ languages)
  • ACL-2: PeterG vocabulary (semantic primes)
  • ACL-3: Morphological networks
  • ACL-4: Semantic categories (numbers, emotions, animals)
  • ACL-5: Poetry corpora (Li Bai, Du Fu, Frost, Wordsworth)
  • ACL-6: Visual semantics (emoji, pictographs)
  • NeurIPS-01 to NeurIPS-11: Research benchmarks for SA/RA metrics

See data/input/README.md for complete dataset documentation.

Documentation

Architecture

semanscope/
โ”œโ”€โ”€ semanscope/          # Core Python package
โ”‚   โ”œโ”€โ”€ components/      # Analysis components (SA, RA, viz)
โ”‚   โ”œโ”€โ”€ models/          # Model managers and integrations
โ”‚   โ”œโ”€โ”€ utils/           # Utilities (caching, text processing)
โ”‚   โ”œโ”€โ”€ services/        # External API integrations
โ”‚   โ””โ”€โ”€ cli/             # Command-line tools
โ”œโ”€โ”€ ui/                  # Streamlit UI
โ”œโ”€โ”€ data/                # Datasets and visualizations
โ”œโ”€โ”€ tests/               # Test suite
โ”œโ”€โ”€ demo/                # Usage examples
โ”œโ”€โ”€ scripts/             # Utility scripts
โ””โ”€โ”€ docs/                # Documentation

Development

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run specific test
pytest tests/test_semantic_affinity.py -v

# Code formatting
black semanscope/ ui/ tests/
ruff check semanscope/ ui/

Configuration

Create a .env file for API keys and settings:

# Copy example configuration
cp .env.example .env

# Edit with your API keys
OPENROUTER_API_KEY=your_key_here
VOYAGE_API_KEY=your_key_here
GOOGLE_API_KEY=your_key_here

Performance Tips

  1. Use GPU: Set CUDA_VISIBLE_DEVICES=0 for GPU acceleration
  2. Enable caching: Embeddings are cached automatically to ~/projects/embedding_cache/
  3. Batch processing: Use CLI tools for large-scale benchmarking
  4. Model selection: Start with smaller models (LaBSE, mBERT) for exploration

Citation

If you use Semanscope in your research, please cite:

@software{semanscope2026,
  title={Semanscope: Multilingual Semantic Embedding Visualization Toolkit},
  author={Semanscope Contributors},
  year={2026},
  url={https://github.com/semanscope/semanscope}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Acknowledgments

  • Language Models: Thanks to Google (LaBSE), Meta (SONAR), and the open-source community
  • Dimensionality Reduction: UMAP, PHATE, t-SNE, PaCMAP, TriMap libraries
  • Visualization: Plotly, Streamlit, ECharts
  • Datasets: Computational linguistics research community

Support

Roadmap

  • PyPI publication
  • Additional embedding models (Cohere, Anthropic)
  • Enhanced visualization options
  • Expanded benchmark datasets
  • Interactive tutorials and examples
  • Web deployment (Streamlit Cloud)

Built with โค๏ธ for the multilingual NLP community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semanscope-1.0.0.tar.gz (228.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semanscope-1.0.0-py3-none-any.whl (276.2 kB view details)

Uploaded Python 3

File details

Details for the file semanscope-1.0.0.tar.gz.

File metadata

  • Download URL: semanscope-1.0.0.tar.gz
  • Upload date:
  • Size: 228.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for semanscope-1.0.0.tar.gz
Algorithm Hash digest
SHA256 16d1d50145951676d68eff4306d1fe26913c480f2b814f16631e65aee217debb
MD5 63e3b2df233ec8b3a237b4392cb4558a
BLAKE2b-256 8f8d3f6076c78589a8aa78865a165a420bdc1751a5fab05aec1fd540be6301f3

See more details on using hashes here.

File details

Details for the file semanscope-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: semanscope-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 276.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for semanscope-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 76972fddfeffcba754f56c78c8bd0d5305ee0c5c3b6bd912be60a2095a258279
MD5 bb31b255741e87273bb2dbea9f0a3f83
BLAKE2b-256 efb5111499a87ac2579fe9aa1c60f6d69d6b4ad5f861cb802772952b6b745591

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page