Skip to main content

Fast querying of word embeddings using the LMDB "Lightning" Database.

Project description

tr_logo_cmyk_tr_logo_cmyk

LMDB Embeddings

Query word vectors (embeddings) very quickly with very little querying time overhead and far less memory usage than gensim or other equivalent solutions. This is made possible by Lightning Memory-Mapped Database.

Inspired by Delft. As explained in their readme, this approach permits us to have the pre-trained embeddings immediately "warm" (no load time), to free memory and to use any number of embeddings similtaneously with a very negligible impact on runtime when using SSD.

For instance, in a traditional approach glove-840B takes around 2 minutes to load and 4GB in memory. Managed with LMDB, glove-840B can be accessed immediately and takes only a couple MB in memory, for a negligible impact on runtime (around 1% slower).

Reading vectors

from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.exceptions import MissingWordError

embeddings = LmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')

try:
  vector = embeddings.get_word_vector('google')
except MissingWordError:
  # 'google' is not in the database.
  pass

Writing vectors

An example to write an LMDB vector file from a gensim model. As any iterator that yields word and vector pairs is supported, if you have the vectors in an alternative format then it is just a matter of altering the iter_embeddings method below appropriately.

I will be writing a CLI interface to convert standard formats soon.

from gensim.models.keyedvectors import KeyedVectors
from lmdb_embeddings.writer import LmdbEmbeddingsWriter


GOOGLE_NEWS_PATH = 'GoogleNews-vectors-negative300.bin.gz'
OUTPUT_DATABASE_FOLDER = 'GoogleNews-vectors-negative300'


print('Loading gensim model...')
gensim_model = KeyedVectors.load_word2vec_format(GOOGLE_NEWS_PATH, binary = True)


def iter_embeddings():
    for word in gensim_model.vocab.keys():
        yield word, gensim_model[word]

print('Writing vectors to a LMDB database...')

writer = LmdbEmbeddingsWriter(
    iter_embeddings()
).write(OUTPUT_DATABASE_FOLDER)

# These vectors can now be loaded with the LmdbEmbeddingsReader.

Customisation

By default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach - simply inject the serializer and unserializer as callables into the LmdbEmbeddingsWriter and LmdbEmbeddingsReader.

A msgpack serializer is included and can be used in the same way.

from lmdb_embeddings.writer import LmdbEmbeddingsWriter
from lmdb_embeddings.serializers import MsgpackSerializer

writer = LmdbEmbeddingsWriter(
    iter_embeddings(),
    serializer = MsgpackSerializer.serialize
).write(OUTPUT_DATABASE_FOLDER)
from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.serializers import MsgpackSerializer

reader = LmdbEmbeddingsReader(
    OUTPUT_DATABASE_FOLDER,
    unserializer = MsgpackSerializer.unserialize
)

Running tests

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmdb_embeddings-0.2.1.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

lmdb_embeddings-0.2.1-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file lmdb_embeddings-0.2.1.tar.gz.

File metadata

  • Download URL: lmdb_embeddings-0.2.1.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.0 CPython/3.6.7rc1

File hashes

Hashes for lmdb_embeddings-0.2.1.tar.gz
Algorithm Hash digest
SHA256 de8283a6e61a9b5f18bd83112dac57edf40e913dd7bd9e011da94d223b8b002e
MD5 eff2df4559d17e56a7bb28bf077626eb
BLAKE2b-256 face2176cca225f7553818807bc67a8c61ee32d4721377ec42de23a9d64ec1cf

See more details on using hashes here.

File details

Details for the file lmdb_embeddings-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: lmdb_embeddings-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.12.1 pkginfo/1.4.2 requests/2.20.0 setuptools/40.4.3 requests-toolbelt/0.8.0 tqdm/4.28.0 CPython/3.6.7rc1

File hashes

Hashes for lmdb_embeddings-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cc600ed3a65d392869402739e00380cc2dfd52c79d37da86fd45120f93b03519
MD5 9c1e2374f19bceccf64126b1513cc03b
BLAKE2b-256 6e7748747e7f68aa4b1bfb8b4027a2b7e83070090750b72075cd4ba80f28f4d0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page