floret Python bindings
Project description
floret: fastText + Bloom embeddings for compact, full-coverage vectors with spaCy
floret is an extended version of fastText that can produce word representations for any word from a compact vector table. It combines:
- fastText's subwords to provide embeddings for any word
- Bloom embeddings ("hashing trick") for a compact vector table
Installation
pip install floret
Usage
Train floret vectors using the options:
hashOnly
: ifTrue
, train floret vectors, storing both words and subwords in the same compact hash tablehashCount
: store each entry in 1-4 rows in the hash table (recommended:2
)bucket
: in combination withhashCount>1
, the size of the hash table can be greatly reduced (recommended:25000
--100000
, reduced from the fastText default of2000000
)minn
: min length of char ngram (default:3
)maxn
: max length of char ngram (default:6
)
import floret
# train vectors
model = floret.train_unsupervised(
"data.txt",
model="cbow",
hashOnly=True,
hashCount=2,
bucket=50000,
minn=3,
maxn=6,
)
# query vector
model.get_word_vector("broccoli")
# save full model
model.save_model("vectors.bin")
# export standard word-only vector table
model.save_vectors("vectors.vec")
# export floret vector table
model.save_hash_only_vectors("vectors.floret")
Note: with the default setting hashOnly=False
, floret
trains original
fastText vectors.
Use floret vectors in spaCy
Import floret vectors into spaCy v3.2+:
spacy init vectors --floret-vectors vectors.floret spacy_vectors_model
Notes
floret
contains all features of the original fasttext
module. See the fasttext
docs for more information.
The fasttext
and floret
binary formats saved with
model.save_model("model.bin")
are not compatible.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
floret-0.10.0.dev0.tar.gz
(64.5 kB
view hashes)
Built Distributions
Close
Hashes for floret-0.10.0.dev0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bcd8e710b4454ed8b41de417ebd739265db67ec17c6a70aa988f2f92468bad89 |
|
MD5 | 983ad519b6ee2e0260104b3f7ede7a55 |
|
BLAKE2b-256 | 63116494f3ebaf15ae47e787527bdf72eaf439611f4c4e635c745be1643ac618 |
Close
Hashes for floret-0.10.0.dev0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1ca7285c2b6add9156909fec5b59ef30a431ef0f3d2734632483a4a74730e955 |
|
MD5 | 4c40d2edcc0047b96a06cd69484a1184 |
|
BLAKE2b-256 | c12db132bd75f89d5ae9f5b25f15af519aab9805ad58261c095bd9ae0b9578eb |
Close
Hashes for floret-0.10.0.dev0-cp39-cp39-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95db7e7e128b2babcfe149f245e4a8784e1362fe982d789134cb20faa2f86c41 |
|
MD5 | 7ae991e2552d3da6e589e123fd535847 |
|
BLAKE2b-256 | 57c2ddaec50ba9578141c4267e50d8c38e8800b2e02ef1ecc802981bf162896c |
Close
Hashes for floret-0.10.0.dev0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 57a8484c520667a476646838edc2bfbbeaca88b003a781cb47c00ab32eb2b8b8 |
|
MD5 | 8178a351a48d9ecc3b981a81a3180ea7 |
|
BLAKE2b-256 | 99318325924b66ce1c930112e71a3f9b5967e338323f40a2f561b98bdba87a52 |
Close
Hashes for floret-0.10.0.dev0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5720a42ba5b3e1148617c0065f26c40536494b2d5538066e062d99ac519c27e9 |
|
MD5 | d6c4d6d74a842b436148b84feba6eec4 |
|
BLAKE2b-256 | 1c4ec0fcd71fe6f1c27c8855b3249dd91988d9225e0f4d87667f0c8817564a3d |
Close
Hashes for floret-0.10.0.dev0-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b46911b057697f38feaf77d48597f9cf60eb793474f0cbb81121aed9145221fa |
|
MD5 | 6fb060a1523a9ef453d3852aeb442741 |
|
BLAKE2b-256 | fe899f3a2e418f5c131da1754e08ec866e2aa3a3c72c493ee5c205fb33022372 |
Close
Hashes for floret-0.10.0.dev0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 64450f66a255f21961587796604a046e4c580e624127a253cec20079ded43c48 |
|
MD5 | eac78f2c10b09100b6f6ac5556fb504b |
|
BLAKE2b-256 | d6e8c74605a6ef4fc8c722c0a536a7bc85c16e7cddad3983eeceedcb9d52a6b1 |
Close
Hashes for floret-0.10.0.dev0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 756a34af8651a745b1ac6c8ba56a481e2a45a2d5aba810e3502ad3a993101f20 |
|
MD5 | 857b451a59483211504c216dcbeb73d3 |
|
BLAKE2b-256 | 00c632c3fa4eaaf616d6953ee1a9e0695223d6f69e4bee69f2e90eafee81f51e |
Close
Hashes for floret-0.10.0.dev0-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3ed547ddfec68eeaaee7162cd0b27bb57c3b2eeb23d39405b2bb4fbfc14b633a |
|
MD5 | 0e1f8ef5e7593d03912fbedd9ebdf26d |
|
BLAKE2b-256 | a982311e668b1eb963a0d13e5ba939f4dd126c306a51da56ff11ff97f1fd972f |
Close
Hashes for floret-0.10.0.dev0-cp36-cp36m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd25ffb23ff1fc9a69c6cee40ff693a8de292a8665d51faf20cac675660f4031 |
|
MD5 | f73faa434c8f312f3c724f3769f210a1 |
|
BLAKE2b-256 | 34da4754fa3e1d828b4c5a593121c13300f047334fdd9476de7253edf45abeab |
Close
Hashes for floret-0.10.0.dev0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 492951fb069197fefecf7e07c5b085254c8301ddbd7056d00b47ec54086aad02 |
|
MD5 | 4f5225d0a34511203e2af04a03691e8e |
|
BLAKE2b-256 | 1197c0605f622e85d8693dcfeff95d23063c75c730c6e3d42ea90958d0234c7d |
Close
Hashes for floret-0.10.0.dev0-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e2abc88d4ec819cfae070ef1459b0da84c32e77474e95c5922bd977479aec036 |
|
MD5 | d2500e48250098425d3b529041ffac4a |
|
BLAKE2b-256 | 420595b5c3647847ed44c03de98b9ccf1d9f06658ac90f40f725779887349bc7 |