Retrieve text embeddings, but cache them locally if we have already computed them.
Project description
Motivation
If you are doing a handful of different NLP tasks, or have a single NLP pipeline you keep tuning, you probably don't want to recompute embeddings. Hence, we cache them.
Quickstart
pip install embeddingcache
from embeddingcache.embeddingcache import get_embeddings
embeddings = get_embeddings(
strs=["hi", "I love Berlin."],
embedding_model="all-MiniLM-L6-v2",
db_directory=Path("dbs/"),
verbose=True,
)
Design assumptions
We use SQLite3 to cache embeddings. [This could be adapted easily, since we use SQLAlchemy.]
We assume read-heavy loads, with one concurrent writer. (However, we retry on write failures.)
We shard SQLite3 into two databases: hashstring.db: hashstring table. Each row is a (unique, primary key) SHA512 hash to text (also unique). Both fields are indexed.
[embedding_model_name].db: embedding table. Each row is a (unique, primary key) SHA512 hash to a 1-dim numpy (float32) vector, which we serialize to the table as bytes.
Developer instructions
pre-commit install
pip install -e .
pytest
TODO
- Update pyproject.toml
- Add tests
- Consider other hash functions?
- float32 and float64 support
- Consider adding optional joblib for caching?
- Different ways of computing embeddings (e.g. using an API) rather than locally
- S3 backup and/or
- WAL
- LiteStream
- Retry on write errors
- Other DB backends
- Best practices: Give specific OpenAI version number.
- RocksDB / RocksDB-cloud?
- Include model name in DB for sanity check on slugify.
- Validate on numpy array size.
- Validate BLOB size for hashes.
- Add optional libraries like openai and sentence-transformers
- Also consider other embedding providers, e.g. cohere
- And libs just for devs
- Consider the max_length of each text to embed, warn if we exceed
- pdoc3 and/or sphinx
- Normalize embeddings by default, but add option
- Option to return torch tensors
- Consider reusing the same DB connection instead of creating it from scratch every time.
- Add batch_size parameter?
- Test check for collisions
- Use logging not verbose output.
- Rewrite using classes.
- Fix dependabot.
- Don't keep re-using DB session, store it in the class or global
- DRY.
- Suggest to use versioned OpenAI model
- Add device to sentence transformers
- Allow fast_sentence_transformers
- Test that things work if there are duplicate strings
- Remove DBs after test
- Do we have to have nested embedding.embedding for all calls?
- codecov and code quality shields
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file embeddingcache-0.1.0.tar.gz
.
File metadata
- Download URL: embeddingcache-0.1.0.tar.gz
- Upload date:
- Size: 11.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a1e617b220d8fcdf0dbc391ec4575d5b7debe6b7659eadbc6800dd9491d54e06 |
|
MD5 | 1073c22980094b8d8b32be12d83334d0 |
|
BLAKE2b-256 | 339494db8fd64a06d5f0894c6cc66447d4ffcfda30f1b5e0bd653731e65eb569 |
File details
Details for the file embeddingcache-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: embeddingcache-0.1.0-py3-none-any.whl
- Upload date:
- Size: 10.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb7a2fab718d9d1f9780441879b147986129a9f04cf322713c03ab3a7c893346 |
|
MD5 | e7ac3e6ee2630b7fde2dab3dbe133972 |
|
BLAKE2b-256 | c5be4475025e728c00ab4c22c12d71a8d08c8e1bd7d58121a54ff38e93adad9f |