A package for working with files containing pre-trained word embeddings (aka word vectors).
Project description
A package for working with files containing word embeddings (aka word vectors). Written for:
providing a common interface for different file formats;
providing a flexible function for building “embedding matrices” that you can use for initializing the Embedding layer of your deep learning model;
taking as less RAM as possible: no need to load 3M vectors like with gensim.load_word2vec_format when you only need 20K;
satisfying my (inexplicable) urge of writing a Python package.
Features
Supports textual and Google’s binary format plus a custom convenient format (.vvm) supporting constant-time access of word vectors (by word).
Allows to easily implement, test and integrate new file formats.
Supports virtually any text encoding and vector data type (though you should probably use only UTF-8 as encoding).
Well-documented and type-annotated (meaning great IDE support).
Extensively tested.
Progress bars (by default) for every time-consuming operation.
Installation
pip install embfile
Quick start
import embfile
with embfile.open("path/to/file.bin") as f: # infer file format from file extension
print(f.vocab_size, f.vector_size)
# Load some word vectors in a dictionary (raise KeyError if any word is missing)
word2vec = f.load(['ciao', 'hello'])
# Like f.load() but allows missing words (and returns them in a Set)
word2vec, missing_words = f.find(['ciao', 'hello', 'someMissingWord'])
# Build a matrix for initializing an Embedding layer either from
# a list of words or from a dictionary {word: index}. Handles the
# initialization of eventual missing word vectors (see "oov_initializer")
matrix, word2index, missing_words = embfile.build_matrix(f, words)
Examples
The examples shows how to use embfile to initialize the Embedding layer of a deep learning model. They are just illustrative, don’t skip the documentation.
Keras using TextVectorization (tensorflow >= 2.1)
Documentation
Read the full documentation at https://embfile.readthedocs.io/.
Changelog
v0.1.1 (2021-02-15)
No changes in the code.
Add support to python 3.9.
Migrate from TravisCI+AppVeyor to GitHub Actions.
Add examples for Keras.
Minor doc changes.
v0.1.0 (2020-01-24)
First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file embfile-0.1.1.tar.gz
.
File metadata
- Download URL: embfile-0.1.1.tar.gz
- Upload date:
- Size: 102.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b45af2c46423e907ca5225c386671016d6c28475a0fa3efca89f3432a20b7c42 |
|
MD5 | a700faea19e30437e9703af18404103a |
|
BLAKE2b-256 | 44b1a917a45ca53d9d3c41eb1fdefe15a1f7a2a9036b3abd3298032e72bff129 |
File details
Details for the file embfile-0.1.1-py2.py3-none-any.whl
.
File metadata
- Download URL: embfile-0.1.1-py2.py3-none-any.whl
- Upload date:
- Size: 37.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ed17fbbc3feece48a5c2535327405593a3c67f5dace677b4239257b9e62caa8a |
|
MD5 | 04cd6cdb0ca84f9ce2600ac6f7794194 |
|
BLAKE2b-256 | fd16d89af5c4ccc10e7ef9cfc7981bb6c329ef6a99523083a6868000ec1e2a57 |