Skip to main content

Retrieve text embeddings, but cache them locally if we have already computed them.

Project description

embeddingcache

PyPI license python black isort tests

Retrieve text embeddings, but cache them locally if we have already computed them.

Motivation

If you are doing a handful of different NLP tasks, or have a single NLP pipeline you keep tuning, you probably don't want to recompute embeddings. Hence, we cache them.

Quickstart

pip install embeddingcache
from embeddingcache.embeddingcache import get_embeddings
embeddings = get_embeddings(
            strs=["hi", "I love Berlin."],
            embedding_model="all-MiniLM-L6-v2",
            db_directory=Path("dbs/"),
            verbose=True,
        )

Design assumptions

We use SQLite3 to cache embeddings. [This could be adapted easily, since we use SQLAlchemy.]

We assume read-heavy loads, with one concurrent writer. (However, we retry on write failures.)

We shard SQLite3 into two databases: hashstring.db: hashstring table. Each row is a (unique, primary key) SHA512 hash to text (also unique). Both fields are indexed.

[embedding_model_name].db: embedding table. Each row is a (unique, primary key) SHA512 hash to a 1-dim numpy (float32) vector, which we serialize to the table as bytes.

Developer instructions

pre-commit install
pip install -e .
pytest

TODO

  • Update pyproject.toml
  • Add tests
  • Consider other hash functions?
  • float32 and float64 support
  • Consider adding optional joblib for caching?
  • Different ways of computing embeddings (e.g. using an API) rather than locally
  • S3 backup and/or
  • WAL
  • LiteStream
  • Retry on write errors
  • Other DB backends
  • Best practices: Give specific OpenAI version number.
  • RocksDB / RocksDB-cloud?
  • Include model name in DB for sanity check on slugify.
  • Validate on numpy array size.
  • Validate BLOB size for hashes.
  • Add optional libraries like openai and sentence-transformers
    • Also consider other embedding providers, e.g. cohere
    • And libs just for devs
  • Consider the max_length of each text to embed, warn if we exceed
  • pdoc3 and/or sphinx
  • Normalize embeddings by default, but add option
  • Option to return torch tensors
  • Consider reusing the same DB connection instead of creating it from scratch every time.
  • Add batch_size parameter?
  • Test check for collisions
  • Use logging not verbose output.
  • Rewrite using classes.
  • Fix dependabot.
  • Don't keep re-using DB session, store it in the class or global
  • DRY.
  • Suggest to use versioned OpenAI model
  • Add device to sentence transformers
  • Allow fast_sentence_transformers
  • Test that things work if there are duplicate strings
  • Remove DBs after test
  • Do we have to have nested embedding.embedding for all calls?
  • codecov and code quality shields

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

embeddingcache-0.1.0.tar.gz (11.9 kB view hashes)

Uploaded Source

Built Distribution

embeddingcache-0.1.0-py3-none-any.whl (10.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page