Skip to main content

A package for working with files containing pre-trained word embeddings (aka word vectors).

Project description

A package for working with files containing word embeddings (aka word vectors). Written for:

  1. providing a common interface for different file formats;

  2. providing a flexible function for building “embedding matrices” that you can use for initializing the Embedding layer of your deep learning model;

  3. taking as less RAM as possible: no need to load 3M vectors like with gensim.load_word2vec_format when you only need 20K;

  4. satisfying my (inexplicable) urge of writing a Python package.

Features

  • Supports textual and Google’s binary format plus a custom convenient format (.vvm) supporting constant-time access of word vectors (by word).

  • Allows to easily implement, test and integrate new file formats.

  • Supports virtually any text encoding and vector data type (though you should probably use only UTF-8 as encoding).

  • Well-documented and type-annotated (meaning great IDE support).

  • Extensively tested.

  • Progress bars (by default) for every time-consuming operation.

Installation

pip install embfile

Quick start

import embfile

with embfile.open("path/to/file.bin") as f:     # infer file format from file extension

    print(f.vocab_size, f.vector_size)

    # Load some word vectors in a dictionary (raise KeyError if any word is missing)
    word2vec = f.load(['ciao', 'hello'])

    # Like f.load() but allows missing words (and returns them in a Set)
    word2vec, missing_words = f.find(['ciao', 'hello', 'someMissingWord'])

    # Build a matrix for initializing an Embedding layer either from
    # a list of words or from a dictionary {word: index}. Handles the
    # initialization of eventual missing word vectors (see "oov_initializer")
    matrix, word2index, missing_words = embfile.build_matrix(f, words)

Examples

The examples shows how to use embfile to initialize the Embedding layer of a deep learning model. They are just illustrative, don’t skip the documentation.

Documentation

Read the full documentation at https://embfile.readthedocs.io/.

Changelog

v0.1.1 (2021-02-15)

  • No changes in the code.

  • Add support to python 3.9.

  • Migrate from TravisCI+AppVeyor to GitHub Actions.

  • Add examples for Keras.

  • Minor doc changes.

v0.1.0 (2020-01-24)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embfile-0.1.1.tar.gz (102.4 kB view details)

Uploaded Source

Built Distribution

embfile-0.1.1-py2.py3-none-any.whl (37.8 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file embfile-0.1.1.tar.gz.

File metadata

  • Download URL: embfile-0.1.1.tar.gz
  • Upload date:
  • Size: 102.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.0

File hashes

Hashes for embfile-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b45af2c46423e907ca5225c386671016d6c28475a0fa3efca89f3432a20b7c42
MD5 a700faea19e30437e9703af18404103a
BLAKE2b-256 44b1a917a45ca53d9d3c41eb1fdefe15a1f7a2a9036b3abd3298032e72bff129

See more details on using hashes here.

File details

Details for the file embfile-0.1.1-py2.py3-none-any.whl.

File metadata

  • Download URL: embfile-0.1.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 37.8 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.1 setuptools/53.0.0 requests-toolbelt/0.9.1 tqdm/4.56.2 CPython/3.9.0

File hashes

Hashes for embfile-0.1.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 ed17fbbc3feece48a5c2535327405593a3c67f5dace677b4239257b9e62caa8a
MD5 04cd6cdb0ca84f9ce2600ac6f7794194
BLAKE2b-256 fd16d89af5c4ccc10e7ef9cfc7981bb6c329ef6a99523083a6868000ec1e2a57

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page