Skip to main content

Read and use various word embedding formats

Project description

🐍 snakefusion

Introduction

snakefusion is a Python package for reading, writing, and using finalfusion, fastText, floret, GloVe, and word2vec embeddings. This package is a thin wrapper around the Rust finalfusion crate.

snakefusion supports the same types of embeddings as finalfusion:

  • Vocabulary:
    • No subwords
    • Subwords
  • Embedding matrix:
    • Array
    • Memory-mapped
    • Quantized
  • Format:
    • fastText
    • finalfusion
    • floret
    • GloVe
    • word2vec

Building from source

Building snakefusion from source requires a Rust toolchain that is installed through rustup and setuptools-rust:

$ pip install --upgrade setuptools-rust

You can then build and install snakefusion in your environment:

$ pip install .

Usage

Embeddings can be loaded as follows:

import snakefusion

# Loading embeddings in finalfusion format
embeds = snakefusion.Embeddings("myembeddings.fifu")

# Or if you want to memory-map the embedding matrix:
embeds = snakefusion.Embeddings("myembeddings.fifu", mmap=True)

# fastText format
embeds = snakefusion.Embeddings.read_fasttext("myembeddings.bin")

# floret format
embeds = snakefusion.Embeddings.read_floret_text("myembeddings.floret")

# word2vec format
embeds = snakefusion.Embeddings.read_word2vec("myembeddings.w2v")

You can then compute an embedding, perform similarity queries, or analogy queries:

e = embeds.embedding("Tübingen")

# default similarity query for "Tübingen"
embeds.word_similarity("Tübingen")

# similarity query based on a vector, returning the closest embedding to
# the input vector, skipping "Tübingen"
embeds.embeddings_similarity(e, skip={"Tübingen"})

# default analogy query
embeds.analogy("Berlin", "Deutschland", "Amsterdam")

# analogy query allowing "Deutschland" as answer
embeds.analogy("Berlin", "Deutschland", "Amsterdam", mask=(True,False,True))

If you want to operate directly on the full embedding matrix, you can get a copy of this matrix through:

# get copy of embedding matrix, changes to this won't touch the original matrix
e.matrix_copy()

Finally access to the vocabulary is provided through:

v = e.vocab()
# get a list of indices associated with "Tübingen"
v.item_to_indices("Tübingen")

# get a list of `(ngram, index)` tuples for "Tübingen"
v.ngram_indices("Tübingen")

# get a list of subword indices for "Tübingen"
v.subword_indices("Tübingen")

More usage examples can be found in the examples directory.

Where to go from here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snakefusion-0.1.0.tar.gz (19.8 kB view details)

Uploaded Source

File details

Details for the file snakefusion-0.1.0.tar.gz.

File metadata

  • Download URL: snakefusion-0.1.0.tar.gz
  • Upload date:
  • Size: 19.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for snakefusion-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b361504fbf932885456dbdb18117bf2761231cce1af7fc8ee95a0c1a6a4a0134
MD5 50dc8b00e20477aaae9a3c3d9ee052c3
BLAKE2b-256 5a5901c17283875ff5eeb86f2f2d8cf2e1219d047dbf5dcc1159ff95b2d2660f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page