Retrieve text embeddings, but cache them locally if we have already computed them.
Project description
Motivation
If you are doing a handful of different NLP tasks, or have a single NLP pipeline you keep tuning, you probably don't want to recompute embeddings. Hence, we cache them.
Quickstart
pip install embeddingcache
from embeddingcache.embeddingcache import get_embeddings
embeddings = get_embeddings(
strs=["hi", "I love Berlin."],
embedding_model="all-MiniLM-L6-v2",
db_directory=Path("dbs/"),
verbose=True,
)
Design assumptions
We use SQLite3 to cache embeddings. [This could be adapted easily, since we use SQLAlchemy.]
We assume read-heavy loads, with one concurrent writer. (However, we retry on write failures.)
We shard SQLite3 into two databases: hashstring.db: hashstring table. Each row is a (unique, primary key) SHA512 hash to text (also unique). Both fields are indexed.
[embedding_model_name].db: embedding table. Each row is a (unique, primary key) SHA512 hash to a 1-dim numpy (float32) vector, which we serialize to the table as bytes.
Developer instructions
pre-commit install
pip install -e .
pytest
TODO
- Update pyproject.toml
- Add tests
- Consider other hash functions?
- float32 and float64 support
- Consider adding optional joblib for caching?
- Different ways of computing embeddings (e.g. using an API) rather than locally
- S3 backup and/or
- WAL
- LiteStream
- Retry on write errors
- Other DB backends
- Best practices: Give specific OpenAI version number.
- RocksDB / RocksDB-cloud?
- Include model name in DB for sanity check on slugify.
- Validate on numpy array size.
- Validate BLOB size for hashes.
- Add optional libraries like openai and sentence-transformers
- Also consider other embedding providers, e.g. cohere
- And libs just for devs
- Consider the max_length of each text to embed, warn if we exceed
- pdoc3 and/or sphinx
- Normalize embeddings by default, but add option
- Option to return torch tensors
- Consider reusing the same DB connection instead of creating it from scratch every time.
- Add batch_size parameter?
- Test check for collisions
- Use logging not verbose output.
- Rewrite using classes.
- Fix dependabot.
- Don't keep re-using DB session, store it in the class or global
- DRY.
- Suggest to use versioned OpenAI model
- Add device to sentence transformers
- Allow fast_sentence_transformers
- Test that things work if there are duplicate strings
- Remove DBs after test
- Do we have to have nested embedding.embedding for all calls?
- codecov and code quality shields
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for embeddingcache-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | fb7a2fab718d9d1f9780441879b147986129a9f04cf322713c03ab3a7c893346 |
|
MD5 | e7ac3e6ee2630b7fde2dab3dbe133972 |
|
BLAKE2b-256 | c5be4475025e728c00ab4c22c12d71a8d08c8e1bd7d58121a54ff38e93adad9f |