Fast querying of word embeddings using the LMDB "Lightning" Database.

Project description

tr_logo_cmyk_tr_logo_cmyk

LMDB Embeddings

Query word vectors (embeddings) very quickly with very little querying time overhead and far less memory usage than gensim or other equivalent solutions. This is made possible by Lightning Memory-Mapped Database.

Inspired by Delft. As explained in their readme, this approach permits us to have the pre-trained embeddings immediately "warm" (no load time), to free memory and to use any number of embeddings similtaneously with a very negligible impact on runtime when using SSD.

For instance, in a traditional approach glove-840B takes around 2 minutes to load and 4GB in memory. Managed with LMDB, glove-840B can be accessed immediately and takes only a couple MB in memory, for a negligible impact on runtime (around 1% slower).

Installation

pip install lmdb-embeddings

Reading vectors

from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.exceptions import MissingWordError

embeddings = LmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')

try:
    vector = embeddings.get_word_vector('google')
except MissingWordError:
    # 'google' is not in the database.
    pass

Writing vectors

An example to write an LMDB vector file from a gensim model. As any iterator that yields word and vector pairs is supported, if you have the vectors in an alternative format then it is just a matter of altering the iter_embeddings method below appropriately.

I will be writing a CLI interface to convert standard formats soon.

from gensim.models.keyedvectors import KeyedVectors
from lmdb_embeddings.writer import LmdbEmbeddingsWriter


GOOGLE_NEWS_PATH = 'GoogleNews-vectors-negative300.bin.gz'
OUTPUT_DATABASE_FOLDER = 'GoogleNews-vectors-negative300'


print('Loading gensim model...')
gensim_model = KeyedVectors.load_word2vec_format(GOOGLE_NEWS_PATH, binary=True)


def iter_embeddings():
    for word in gensim_model.vocab.keys():
        yield word, gensim_model[word]

print('Writing vectors to a LMDB database...')

writer = LmdbEmbeddingsWriter(iter_embeddings()).write(OUTPUT_DATABASE_FOLDER)

# These vectors can now be loaded with the LmdbEmbeddingsReader.

LRU Cache

A reader with an LRU (Least Recently Used) cache is included. This will save the embeddings for the 50,000 most recently queried words and return the same object instead of querying the database each time. Its interface is the same as the standard reader. See functools.lru_cache in the standard library.

from lmdb_embeddings.reader import LruCachedLmdbEmbeddingsReader
from lmdb_embeddings.exceptions import MissingWordError

embeddings = LruCachedLmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')

try:
    vector = embeddings.get_word_vector('google')
except MissingWordError:
    # 'google' is not in the database.
    pass

Customisation

By default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach - simply inject the serializer and unserializer as callables into the LmdbEmbeddingsWriter and LmdbEmbeddingsReader.

A msgpack serializer is included and can be used in the same way.

from lmdb_embeddings.writer import LmdbEmbeddingsWriter
from lmdb_embeddings.serializers import MsgpackSerializer

writer = LmdbEmbeddingsWriter(
    iter_embeddings(),
    serializer=MsgpackSerializer().serialize
).write(OUTPUT_DATABASE_FOLDER)

from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.serializers import MsgpackSerializer

reader = LmdbEmbeddingsReader(
    OUTPUT_DATABASE_FOLDER,
    unserializer=MsgpackSerializer().unserialize
)

Running tests

pytest

Author

Github: DomHudson

Contributing

Contributions, issues and feature requests are welcome!

Show your support

Give a ⭐️ if this project helped you!

License

Project details

Release history Release notifications | RSS feed

This version

0.4.0

Feb 24, 2020

0.3.0

Dec 16, 2019

0.2.1

Oct 20, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmdb_embeddings-0.4.0.tar.gz (6.5 kB view details)

Uploaded Feb 24, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lmdb_embeddings-0.4.0-py3-none-any.whl (22.3 kB view details)

Uploaded Feb 24, 2020 Python 3

File details

Details for the file lmdb_embeddings-0.4.0.tar.gz.

File metadata

Download URL: lmdb_embeddings-0.4.0.tar.gz
Upload date: Feb 24, 2020
Size: 6.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.6.9

File hashes

Hashes for lmdb_embeddings-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`a941cdc7e2a77a5ea23035d97902b1f8fb78ef1124863bae5eae2ebe595367a6`
MD5	`e9105709d2972df62f24acc435b4dcb2`
BLAKE2b-256	`4e3c8bb801825472ed55e8e4b38e1945e6e04ee6b1df2ec7168a09b3f09baed3`

See more details on using hashes here.

File details

Details for the file lmdb_embeddings-0.4.0-py3-none-any.whl.

File metadata

Download URL: lmdb_embeddings-0.4.0-py3-none-any.whl
Upload date: Feb 24, 2020
Size: 22.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.6.9

File hashes

Hashes for lmdb_embeddings-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9bc044cc3d0c1bd77619b37ff72f61740c5de65622d2e0ca09ba7b53ae585360`
MD5	`dcb5629ec9b5525559e9c6a084a1a88c`
BLAKE2b-256	`edb6cad65efff9ba4fc7bea66f67bb4d19327aae51c246560984bc866bf10853`

See more details on using hashes here.

lmdb-embeddings 0.4.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

LMDB Embeddings

Installation

Reading vectors

Writing vectors

LRU Cache

Customisation

Running tests

Author

Contributing

Show your support

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes