Skip to main content

floret Python bindings

Project description

floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy

floret is an extended version of fastText that can produce word representations for any word from a compact vector table. It combines:

  • fastText's subwords to provide embeddings for any word
  • Bloom embeddings ("hashing trick") for a compact vector table

Installation

pip install floret

Usage

Train floret vectors using the options:

  • hashOnly: if True, train floret vectors, storing both words and subwords in the same compact hash table
  • hashCount: store each entry in 1-4 rows in the hash table (recommended: 2)
  • bucket: in combination with hashCount>1, the size of the hash table can be greatly reduced (recommended: 25000--100000, reduced from the fastText default of 2000000)
  • minn: min length of char ngram (default: 3)
  • maxn: max length of char ngram (default: 6)
import floret

# train vectors
model = floret.train_unsupervised(
    "data.txt",
    model="cbow",
    hashOnly=True,
    hashCount=2,
    bucket=50000,
    minn=3,
    maxn=6,
)

# query vector
model.get_word_vector("broccoli")

# save full model
model.save_model("vectors.bin")

# export standard word-only vector table
model.save_vectors("vectors.vec")

# export floret vector table
model.save_hash_only_vectors("vectors.floret")

Note: with the default setting hashOnly=False, floret trains original fastText vectors.

Use floret vectors in spaCy

Import floret vectors into spaCy v3.2+:

spacy init vectors --floret-vectors vectors.floret spacy_vectors_model

Notes

floret contains all features of the original fasttext module. See the fasttext docs for more information.

The fasttext and floret binary formats saved with model.save_model("model.bin") are not compatible.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

floret-0.10.0.dev0.tar.gz (64.5 kB view hashes)

Uploaded Source

Built Distributions

floret-0.10.0.dev0-cp39-cp39-win_amd64.whl (236.4 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

floret-0.10.0.dev0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (300.9 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

floret-0.10.0.dev0-cp39-cp39-macosx_10_15_x86_64.whl (343.5 kB view hashes)

Uploaded CPython 3.9 macOS 10.15+ x86-64

floret-0.10.0.dev0-cp38-cp38-win_amd64.whl (236.4 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

floret-0.10.0.dev0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (300.1 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

floret-0.10.0.dev0-cp38-cp38-macosx_10_15_x86_64.whl (343.4 kB view hashes)

Uploaded CPython 3.8 macOS 10.15+ x86-64

floret-0.10.0.dev0-cp37-cp37m-win_amd64.whl (237.0 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

floret-0.10.0.dev0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (301.9 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

floret-0.10.0.dev0-cp37-cp37m-macosx_10_15_x86_64.whl (338.7 kB view hashes)

Uploaded CPython 3.7m macOS 10.15+ x86-64

floret-0.10.0.dev0-cp36-cp36m-win_amd64.whl (237.0 kB view hashes)

Uploaded CPython 3.6m Windows x86-64

floret-0.10.0.dev0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (301.9 kB view hashes)

Uploaded CPython 3.6m manylinux: glibc 2.17+ x86-64

floret-0.10.0.dev0-cp36-cp36m-macosx_10_15_x86_64.whl (338.7 kB view hashes)

Uploaded CPython 3.6m macOS 10.15+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page