Read and use various word embedding formats
Project description
🐍 snakefusion
Introduction
snakefusion
is a Python package for reading, writing, and using finalfusion,
fastText, floret, GloVe, and word2vec embeddings. This package is a thin
wrapper around the Rust finalfusion crate.
snakefusion
supports the same types of embeddings as finalfusion
:
- Vocabulary:
- No subwords
- Subwords
- Embedding matrix:
- Array
- Memory-mapped
- Quantized
- Format:
- fastText
- finalfusion
- floret
- GloVe
- word2vec
Building from source
Building snakefusion
from source requires a Rust toolchain that is installed
through rustup and setuptools-rust
:
$ pip install --upgrade setuptools-rust
You can then build and install snakefusion
in your environment:
$ pip install .
Usage
Embeddings can be loaded as follows:
import snakefusion
# Loading embeddings in finalfusion format
embeds = snakefusion.Embeddings("myembeddings.fifu")
# Or if you want to memory-map the embedding matrix:
embeds = snakefusion.Embeddings("myembeddings.fifu", mmap=True)
# fastText format
embeds = snakefusion.Embeddings.read_fasttext("myembeddings.bin")
# floret format
embeds = snakefusion.Embeddings.read_floret_text("myembeddings.floret")
# word2vec format
embeds = snakefusion.Embeddings.read_word2vec("myembeddings.w2v")
You can then compute an embedding, perform similarity queries, or analogy queries:
e = embeds.embedding("Tübingen")
# default similarity query for "Tübingen"
embeds.word_similarity("Tübingen")
# similarity query based on a vector, returning the closest embedding to
# the input vector, skipping "Tübingen"
embeds.embeddings_similarity(e, skip={"Tübingen"})
# default analogy query
embeds.analogy("Berlin", "Deutschland", "Amsterdam")
# analogy query allowing "Deutschland" as answer
embeds.analogy("Berlin", "Deutschland", "Amsterdam", mask=(True,False,True))
If you want to operate directly on the full embedding matrix, you can get a copy of this matrix through:
# get copy of embedding matrix, changes to this won't touch the original matrix
e.matrix_copy()
Finally access to the vocabulary is provided through:
v = e.vocab()
# get a list of indices associated with "Tübingen"
v.item_to_indices("Tübingen")
# get a list of `(ngram, index)` tuples for "Tübingen"
v.ngram_indices("Tübingen")
# get a list of subword indices for "Tübingen"
v.subword_indices("Tübingen")
More usage examples can be found in the examples directory.
Where to go from here
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file snakefusion-0.1.0.tar.gz
.
File metadata
- Download URL: snakefusion-0.1.0.tar.gz
- Upload date:
- Size: 19.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b361504fbf932885456dbdb18117bf2761231cce1af7fc8ee95a0c1a6a4a0134 |
|
MD5 | 50dc8b00e20477aaae9a3c3d9ee052c3 |
|
BLAKE2b-256 | 5a5901c17283875ff5eeb86f2f2d8cf2e1219d047dbf5dcc1159ff95b2d2660f |