A command line utility to create plots of word embeddings
Project description
Embedding Plot Visualization Tool
Description
Word embeddings transform words to highly-dimensional vectors. The vectors attempt to capture the semantic meaning and relationships of the words, so that similar or related words have similar vectors. For example "Cat", "Kitten", "Feline", "Tiger" and "Lion" would have embedding vectors that are similar to varying degree, but would all be very dissimilar to a word like "Toolbox".
The Word2Vec embedding model has 300 dimensions that capture the semantic meaning of each word. It's not possible to visualize 300 dimensions, but we can use dimensional reduction techniques that project the dimensions to a 2 or 3 latent space that preserves much of the relationships that we can easily visualize.
Embedding-plot, is a command line utility that can visualize word embeddings using dimensionality reduction techniques (PCA or t-SNE) and clustering in a scatter plot.
Features
- Supports Word2vec pretrained embedding models
- Dimensionality reduction using PCA or t-SNE
- Specify a number of clusters to identify in the plot
- Interactive HTML output
Installation
Prerequisites
- Python 3.9 or higher.
Install via pip
pip install embeddings_plot
Embedding model
To use this tool, you have to either train your own embedding model or use an existing pretrained model. This tool expected the models to be in word2vec format. Two pretrained models ready to use are:
- https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
- https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
Download one these models and unzip it, train your own model, or look for other pretrained word2vec models available on the internet.
Usage
After installation, you can use the tool from the command line.
Basic Command
embeddings-plot -m <model_path> -i <input_file> -o <output_file> --label
Parameters
-m
,--model
: Path to the word embeddings model file-i
,--input
: Input text file with words to visualize-o
,--output
: Output HTML file for the visualization-l
,--labels
: (Optional) Show labels on the plot-c
,--clusters
: (Optional) Number of clusters for KMeans. Default is 5.-r
,--reduction
: (Optional) Method for dimensionality reduction (PCA or t-SNE). Default is t-SNE-t
,--title
: (Optional) Sets the title of the output HTML page
Example
embeddings-plot --model crawl-300d-2M.vec --input words.txt --output embedding-plot.html --labels --clusters 13
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file embeddings_plot-0.1.0.tar.gz
.
File metadata
- Download URL: embeddings_plot-0.1.0.tar.gz
- Upload date:
- Size: 7.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.4 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 393f7758f458d7ed5cfb7034bdd5e395698f9ab091c4712c052385be9ac797c9 |
|
MD5 | fc0cb26257d7a5ea67ccbe06702f9a48 |
|
BLAKE2b-256 | e2f1888b0cbc5ccbeff7171f2826fb0a39f511a13e75dd69c01daf10cf022a02 |
File details
Details for the file embeddings_plot-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: embeddings_plot-0.1.0-py3-none-any.whl
- Upload date:
- Size: 8.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.6.1 CPython/3.10.4 Darwin/22.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a92c2c1cca4fdf49a308941bd4104f7c22ca66d9808ad580ac09d6a9a0a2d728 |
|
MD5 | 9d29dd2313a19160733f511b800fb07b |
|
BLAKE2b-256 | 8be89f0e606764b18ddb8563a33b55348924094c22fa39c1be33609f3a691ab6 |