Skip to main content

Retrieve text embeddings, but cache them locally if we have already computed them.

Project description

embeddingcache

PyPI license python black isort tests

Retrieve text embeddings, but cache them locally if we have already computed them.

Motivation

If you are doing a handful of different NLP tasks, or have a single NLP pipeline you keep tuning, you probably don't want to recompute embeddings. Hence, we cache them.

Quickstart

pip install embeddingcache
from embeddingcache.embeddingcache import get_embeddings
embeddings = get_embeddings(
            strs=["hi", "I love Berlin."],
            embedding_model="all-MiniLM-L6-v2",
            db_directory=Path("dbs/"),
            verbose=True,
        )

Design assumptions

We use SQLite3 to cache embeddings. [This could be adapted easily, since we use SQLAlchemy.]

We assume read-heavy loads, with one concurrent writer. (However, we retry on write failures.)

We shard SQLite3 into two databases: hashstring.db: hashstring table. Each row is a (unique, primary key) SHA512 hash to text (also unique). Both fields are indexed.

[embedding_model_name].db: embedding table. Each row is a (unique, primary key) SHA512 hash to a 1-dim numpy (float32) vector, which we serialize to the table as bytes.

Developer instructions

pre-commit install
pip install -e .
pytest

TODO

  • Update pyproject.toml
  • Add tests
  • Consider other hash functions?
  • float32 and float64 support
  • Consider adding optional joblib for caching?
  • Different ways of computing embeddings (e.g. using an API) rather than locally
  • S3 backup and/or
  • WAL
  • LiteStream
  • Retry on write errors
  • Other DB backends
  • Best practices: Give specific OpenAI version number.
  • RocksDB / RocksDB-cloud?
  • Include model name in DB for sanity check on slugify.
  • Validate on numpy array size.
  • Validate BLOB size for hashes.
  • Add optional libraries like openai and sentence-transformers
    • Also consider other embedding providers, e.g. cohere
    • And libs just for devs
  • Consider the max_length of each text to embed, warn if we exceed
  • pdoc3 and/or sphinx
  • Normalize embeddings by default, but add option
  • Option to return torch tensors
  • Consider reusing the same DB connection instead of creating it from scratch every time.
  • Add batch_size parameter?
  • Test check for collisions
  • Use logging not verbose output.
  • Rewrite using classes.
  • Fix dependabot.
  • Don't keep re-using DB session, store it in the class or global
  • DRY.
  • Suggest to use versioned OpenAI model
  • Add device to sentence transformers
  • Allow fast_sentence_transformers
  • Test that things work if there are duplicate strings
  • Remove DBs after test
  • Do we have to have nested embedding.embedding for all calls?
  • codecov and code quality shields

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embeddingcache-0.1.0.tar.gz (11.9 kB view details)

Uploaded Source

Built Distribution

embeddingcache-0.1.0-py3-none-any.whl (10.4 kB view details)

Uploaded Python 3

File details

Details for the file embeddingcache-0.1.0.tar.gz.

File metadata

  • Download URL: embeddingcache-0.1.0.tar.gz
  • Upload date:
  • Size: 11.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.4

File hashes

Hashes for embeddingcache-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a1e617b220d8fcdf0dbc391ec4575d5b7debe6b7659eadbc6800dd9491d54e06
MD5 1073c22980094b8d8b32be12d83334d0
BLAKE2b-256 339494db8fd64a06d5f0894c6cc66447d4ffcfda30f1b5e0bd653731e65eb569

See more details on using hashes here.

File details

Details for the file embeddingcache-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for embeddingcache-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fb7a2fab718d9d1f9780441879b147986129a9f04cf322713c03ab3a7c893346
MD5 e7ac3e6ee2630b7fde2dab3dbe133972
BLAKE2b-256 c5be4475025e728c00ab4c22c12d71a8d08c8e1bd7d58121a54ff38e93adad9f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page