Multilingual semantic embedding visualization and analysis toolkit
Project description
Semanscope
Multilingual Semantic Embedding Visualization and Analysis Toolkit
Semanscope is a comprehensive toolkit for visualizing and analyzing semantic embeddings across multiple languages. It features advanced metrics for measuring semantic consistency (Semantic Affinity) and relational structure preservation (Relational Affinity) in multilingual embedding models.
Key Features
- Multi-Model Support: LaBSE, SONAR, Gemma, OpenAI, Voyage AI, Google Gemini, Ollama, and 30+ models
- Advanced Dimensionality Reduction: UMAP, PHATE, t-SNE, PaCMAP, TriMap
- Semantic Affinity (SA): Novel metric for measuring semantic consistency across embeddings
- Relational Affinity (RA): Metric for evaluating relational structure preservation
- Interactive UI: Streamlit-based interface with 11 specialized pages
- Batch Benchmarking: CLI tools for research-grade evaluation
- Multilingual: Support for 70+ languages
- Visualization: Interactive plots with Plotly and ECharts
Quick Start
Installation
# Clone the repository
git clone https://github.com/semanscope/semanscope.git
cd semanscope
# Create conda environment
conda create -n semanscope python=3.11
conda activate semanscope
# Install package with UI support
pip install -e ".[ui]"
# Or install with all dependencies (including API integrations)
pip install -e ".[all]"
Launch the UI
# Option 1: Using the launcher script
python run_app.py
# Option 2: Using the CLI command (after installation)
semanscope-ui
Basic Usage (Python API)
from semanscope.models.model_manager import get_model
from semanscope.components.embedding_viz import EmbeddingVisualizer
# Load a model
model = get_model("LaBSE")
# Create visualizer
viz = EmbeddingVisualizer(model=model)
# Visualize embeddings
words = ["hello", "world", "friend", "peace"]
viz.plot_words(words, method="UMAP", dimension=2)
Batch Benchmarking
# Semantic Affinity benchmark
semanscope-benchmark-sa \
--dataset data/input/NeurIPS-01-family-relations-v2.5-SA.csv \
--models LaBSE SONAR \
--output results/sa_benchmark.csv
# Relational Affinity benchmark
semanscope-benchmark-ra \
--dataset data/input/NeurIPS-01-family-relations-v2.5-RA.csv \
--models LaBSE SONAR \
--languages english chinese \
--output results/ra_benchmark.csv
Features in Detail
Semantic Affinity (SA) Metric
Measures how consistently a model represents semantic relationships:
from semanscope.components.semantic_affinity import calculate_semantic_affinity
sa_score = calculate_semantic_affinity(
model=model,
word_pairs=[("cat", "dog"), ("happy", "sad")],
metric="cosine"
)
SA Formula:
SA = 1 - std(similarities) / mean(similarities)
Higher SA (โ1.0) = more consistent semantic representations
Relational Affinity (RA) Metric
Evaluates preservation of relational structure across languages:
from semanscope.components import calculate_relational_affinity
ra_score = calculate_relational_affinity(
model=model,
word_quadruples=[("king", "queen", "man", "woman")],
languages=["english", "chinese"],
metric="cosine"
)
RA Formula (Cosine):
rel_vec(w1, w2) = emb(w2) - emb(w1)
RA = cosine_similarity(rel_vec_lang1, rel_vec_lang2)
Higher RA (โ1.0) = better relational structure preservation
Interactive UI Pages
- Settings (0_๐ง_Settings.py): Configure models, methods, cache
- Semanscope (1_๐งญ_Semanscope.py): Main visualization interface
- Semanscope ECharts (2_๐_Semanscope-ECharts.py): ECharts-based visualization
- Compare (3_โ๏ธ_Semanscope-Compare.py): Side-by-side model comparison
- Multilingual (4_๐_Semanscope-Multilingual.py): Multi-language visualization
- Zoom (5_๐_Semanscope-Zoom.py): Interactive zoom and exploration
- Semantic Affinity (6_๐_Semantic_Affinity.py): SA metric calculator
- Relational Affinity (6_๐_Relational_Affinity.py): RA metric calculator
- Translator (8_๐_Translator.py): Translation utilities
- NSM Prime Words (9_๐_NSM_Prime_Words.py): Natural Semantic Metalanguage
- Review Images (9_๐ผ๏ธ_Review_Images.py): Visualization gallery
Supported Models
Open Source:
- LaBSE (Language-agnostic BERT Sentence Embedding)
- SONAR (Seamless Communication models)
- XLM-RoBERTa variants
- mBERT (Multilingual BERT)
- And 20+ more...
API-based (requires API keys):
- OpenAI (text-embedding-ada-002, text-embedding-3-small, etc.)
- Voyage AI (voyage-multilingual-2, voyage-code-2)
- Google Gemini (text-embedding-004)
- Ollama (local models)
See semanscope/config.py for complete model catalog.
Dimensionality Reduction Methods
- UMAP: Uniform Manifold Approximation and Projection
- PHATE: Potential of Heat-diffusion for Affinity-based Transition Embedding
- t-SNE: t-Distributed Stochastic Neighbor Embedding
- PaCMAP: Pairwise Controlled Manifold Approximation
- TriMap: Triplet-based dimensionality reduction
- PCA: Principal Component Analysis
Datasets
Semanscope includes 60+ representative datasets across 7 categories:
- ACL-0: Chinese morphology (Zinets, Radicals)
- ACL-1: Alphabets (15+ languages)
- ACL-2: PeterG vocabulary (semantic primes)
- ACL-3: Morphological networks
- ACL-4: Semantic categories (numbers, emotions, animals)
- ACL-5: Poetry corpora (Li Bai, Du Fu, Frost, Wordsworth)
- ACL-6: Visual semantics (emoji, pictographs)
- NeurIPS-01 to NeurIPS-11: Research benchmarks for SA/RA metrics
See data/input/README.md for complete dataset documentation.
Documentation
- Usage Guide: Detailed usage instructions
- API Reference: Python API documentation
- Troubleshooting: Common issues and solutions
- GPU Setup: CUDA configuration for acceleration
Architecture
semanscope/
โโโ semanscope/ # Core Python package
โ โโโ components/ # Analysis components (SA, RA, viz)
โ โโโ models/ # Model managers and integrations
โ โโโ utils/ # Utilities (caching, text processing)
โ โโโ services/ # External API integrations
โ โโโ cli/ # Command-line tools
โโโ ui/ # Streamlit UI
โโโ data/ # Datasets and visualizations
โโโ tests/ # Test suite
โโโ demo/ # Usage examples
โโโ scripts/ # Utility scripts
โโโ docs/ # Documentation
Development
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run specific test
pytest tests/test_semantic_affinity.py -v
# Code formatting
black semanscope/ ui/ tests/
ruff check semanscope/ ui/
Configuration
Create a .env file for API keys and settings:
# Copy example configuration
cp .env.example .env
# Edit with your API keys
OPENROUTER_API_KEY=your_key_here
VOYAGE_API_KEY=your_key_here
GOOGLE_API_KEY=your_key_here
Performance Tips
- Use GPU: Set
CUDA_VISIBLE_DEVICES=0for GPU acceleration - Enable caching: Embeddings are cached automatically to
~/projects/embedding_cache/ - Batch processing: Use CLI tools for large-scale benchmarking
- Model selection: Start with smaller models (LaBSE, mBERT) for exploration
Citation
If you use Semanscope in your research, please cite:
@software{semanscope2026,
title={Semanscope: Multilingual Semantic Embedding Visualization Toolkit},
author={Semanscope Contributors},
year={2026},
url={https://github.com/semanscope/semanscope}
}
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Acknowledgments
- Language Models: Thanks to Google (LaBSE), Meta (SONAR), and the open-source community
- Dimensionality Reduction: UMAP, PHATE, t-SNE, PaCMAP, TriMap libraries
- Visualization: Plotly, Streamlit, ECharts
- Datasets: Computational linguistics research community
Support
- Documentation: GitHub Wiki
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Roadmap
- PyPI publication
- Additional embedding models (Cohere, Anthropic)
- Enhanced visualization options
- Expanded benchmark datasets
- Interactive tutorials and examples
- Web deployment (Streamlit Cloud)
Built with โค๏ธ for the multilingual NLP community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semanscope-1.0.1.tar.gz.
File metadata
- Download URL: semanscope-1.0.1.tar.gz
- Upload date:
- Size: 228.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77edfac76ef6fce5579d16e44e77cc6995e83c77f2e7441566cfef099c7a6565
|
|
| MD5 |
3c95e9ce69190f6a53158a25fcd9f03b
|
|
| BLAKE2b-256 |
6c05d540fb671c7da0edce2d8a52e6a5c02e781e9e0a31f48b7c49abd91383cc
|
File details
Details for the file semanscope-1.0.1-py3-none-any.whl.
File metadata
- Download URL: semanscope-1.0.1-py3-none-any.whl
- Upload date:
- Size: 276.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d562b4e31ba6d1742e3d15fd77ff8e30cfc1ba9dbcb389b8b43a6e531abd3ca
|
|
| MD5 |
fab5814c04d2e202f9764a21e8c0f2db
|
|
| BLAKE2b-256 |
65c972fa4635ab6ac7b1e6898c28aabca5f8d8f97cf2d4f814d1e0217bb76b63
|