A gguf embeddings plugin for OVOS

Project description

GGUFTextEmbeddingsPlugin

The GGUFTextEmbeddingsPlugin is a plugin for recognizing and managing text embeddings.

It integrates with ovos-chromadb-embeddings-plugin for storing and retrieving text embeddings.

This plugin leverages the llama-cpp-python library to generate text embeddings.

GGUF models are used to keep 3rd party dependencies to a minimum and ensuring this solver is lightweight and suitable for low powered hardware

Features

Text Embeddings Extraction: Converts text into embeddings using the llama_cpp model.
Text Data Storage: Stores and retrieves text embeddings using ChromaEmbeddingsDB.
Text Data Management: Allows for adding, querying, and deleting text embeddings associated with documents.

Suggested Models

You can specify a downloaded model path, or use one of the pre-defined model strings in the table below.

If needed a model will be automatically downloaded to ~/.cache/gguf_models

Model Name	URL	Description	Suggested Use Cases
all-MiniLM-L6-v2	Link	A sentence-transformers model that maps sentences & paragraphs to a 384-dimensional dense vector space. Fine-tuned on a 1B sentence pairs dataset using contrastive learning. Ideal for general-purpose tasks like information retrieval, clustering, and sentence similarity.	Suitable for tasks that require fast inference and can handle slightly less accuracy, such as real-time applications.
all-MiniLM-L12-v2	Link	A larger MiniLM model mapping sentences & paragraphs to a 384-dimensional dense vector space. Fine-tuned on a 1B sentence pairs dataset using contrastive learning. Provides higher accuracy for complex tasks.	Suitable for more complex NLP tasks requiring higher accuracy, such as detailed semantic analysis, document ranking, and clustering.
multi-qa-MiniLM-L6-cos-v1	Link	A sentence-transformers model mapping sentences & paragraphs to a 384-dimensional dense vector space, trained on 215M QA pairs. Designed for semantic search.	Best for semantic search, encoding queries/questions, and finding relevant documents or passages in QA tasks.
gist-all-minilm-l6-v2	Link	Enhanced version of all-MiniLM-L6-v2 using GISTEmbed method, improving in-batch negative selection during training. Demonstrates state-of-the-art performance on specific tasks with a focus on reducing data noise and improving model fine-tuning.	Ideal for high-accuracy retrieval tasks, semantic search, and applications requiring efficient smaller models with robust performance, such as resource-constrained environments.
paraphrase-multilingual-minilm-l12-v2	Link	A sentence-transformers model mapping sentences & paragraphs to a 384-dimensional dense vector space. Supports multiple languages, optimized for paraphrasing tasks.	Perfect for multilingual applications, translation services, and tasks requiring paraphrase detection and generation.
e5-small-v2	Link	Text Embeddings by Weakly-Supervised Contrastive Pre-training. This model has 12 layers and the embedding size is 384. Size is about 30MB.	Ideal for applications requiring efficient, small-sized models with robust text embeddings.
gte-small	Link	General Text Embeddings (GTE) model. Trained using multi-stage contrastive learning by Alibaba DAMO Academy. Based on the BERT framework, it covers a wide range of domains and scenarios. About 30MB.	Suitable for information retrieval, semantic textual similarity, text reranking, and various other downstream tasks requiring text embeddings.
gte-base	Link	Larger version of previous model, about 75 MB
gte-large	Link	Larger version of previous model, about 220 MB
snowflake-arctic-embed-l	Link	Part of the snowflake-arctic-embed suite, this model focuses on high-quality retrieval and achieves state-of-the-art performance on the MTEB/BEIR leaderboard. Trained using a multi-stage pipeline with a mix of public and proprietary data. About 215MB.	Optimized for high-performance text retrieval tasks and achieving top accuracy in retrieval benchmarks.
snowflake-arctic-embed-m	Link	Based on the intfloat/e5-base-unsupervised model, this medium-sized model balances high retrieval performance with efficient inference. About 75MB	Ideal for general-purpose retrieval tasks requiring a balance between performance and efficiency.
snowflake-arctic-embed-m.long	Link	Based on the nomic-ai/nomic-embed-text-v1-unsupervised model, this long-context variant supports up to 2048 tokens without RPE and up to 8192 tokens with RPE. Perfect for long-context workloads. About 90MB	Suitable for tasks requiring long-context embeddings, such as complex document analysis or extensive information retrieval.
snowflake-arctic-embed-s	Link	Based on the intfloat/e5-small-unsupervised model, this small model offers high retrieval accuracy with only 33M parameters and 384 dimensions.	Suitable for applications needing efficient, high-accuracy retrieval in constrained environments.
snowflake-arctic-embed-xs	Link	Based on the all-MiniLM-L6-v2 model, this tiny model has only 22M parameters and 384 dimensions, providing a balance of low latency and high retrieval accuracy.	Best for ultra-low latency applications with strict size and cost constraints.
nomic-embed-text-v1.5	Link	About 85MB. Resizable Production Embeddings with Matryoshka Representation Learning. The model is trained in two stages, starting with unsupervised contrastive learning on weakly related text pairs, followed by finetuning with high-quality labeled datasets. It is now multimodal, aligning with nomic-embed-vision-v1	Ideal for applications requiring flexible embedding sizes and multimodal capabilities.
uae-large-v1	Link	Universal AnglE Embedding. AnglE-optimized Text Embeddings with a novel angle optimization approach. About 220MB.	Best for high-quality text embeddings in semantic textual similarity tasks, including short-text and long-text STS.
labse	Link	A port of the LaBSE model. Maps 109 languages to a shared vector space, supports up to 512 tokens of context. The model is optimized for producing similar representations for bilingual sentence pairs. About 390MB.	Suitable for multilingual applications, translation mining, and cross-lingual text embedding tasks.
bge-large-en-v1.5	Link	The model is part of the BGE series and is designed for diverse retrieval tasks. Size is 216MB.
bge-base-en-v1.5	Link	Medium version of the above. About 80MB
bge-small-en-v1.5	Link	Small version of the above. About 30MB
gist-embedding-v0	Link	GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning. Fine-tuned on top of the BAAI/bge-base-en-v1.5 using the MEDI dataset augmented with mined triplets from the MTEB Classification training dataset.	Ideal for applications requiring embeddings without crafting instructions for queries.
gist-large-embedding-v0	Link	Large version of the model above
gist-small-embedding-v0	Link	Small version of the model above
mxbai-embed-large-v1	Link	trained using AnglE loss on our high-quality large scale data. It achieves SOTA performance on BERT-large scale. About 220MB.	Best for tasks requiring high precision and detailed embeddings. Provides state-of-the-art performance among efficiently sized models.
acge_text_embedding	Link	The ACGE model is developed by the Huhu Information Technology team on the TextIn platform. It is a general-purpose text encoding model that uses Matryoshka Representation Learning for variable-length vectorization. About 200MB	Ideal for chinese text
gte-Qwen2-7B-instruct	Link	The latest in the GTE model family, ranking No.1 in English and Chinese evaluations on the MTEB benchmark. Based on the Qwen2-7B LLM model, it integrates bidirectional attention mechanisms and instruction tuning, with comprehensive multilingual training. 4.68GB	Best for high-performance multilingual text embeddings and complex tasks requiring top-tier contextual understanding.
gte-Qwen2-1.5B-instruct	Link	gte-Qwen2-1.5B-instruct is the latest model in the gte (General Text Embedding) model family. The model is built on Qwen2-1.5B LLM model and use the same training data and strategies as the gte-Qwen2-7B-instruct model. 1.12GB

By default paraphrase-multilingual-minilm-l12-v2 will be used if model is not specified

Usage

Here is a quick example of how to use the GGUFTextEmbeddingsPlugin:

from ovos_gguf_embeddings import GGUFTextEmbeddingsStore
from ovos_chromadb_embeddings import ChromaEmbeddingsDB

db = ChromaEmbeddingsDB("./my_db")
gguf = GGUFTextEmbeddingsStore(db, model=f"all-MiniLM-L6-v2.Q4_K_M.gguf")
corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]
for s in corpus:
    gguf.add_document(s)

docs = gguf.query_document("does the fish purr like a cat?", top_k=2)
print(docs)
# [('a cat is a feline and likes to purr', 0.6548102001030748),
# ('a fish is a creature that lives in water and swims', 0.5436657174406345)]

CLI Interface

$ovos-gguf-embeddings --help 
Usage: ovos-gguf-embeddings [OPTIONS] COMMAND [ARGS]...

  CLI for interacting with the GGUF Text Embeddings Store.

Options:
  --help  Show this message and exit.

Commands:
  add-document     Add a document to the embeddings store.
  delete-document  Delete a document from the embeddings store.
  query-document   Query the embeddings store to find similar documents...

$ovos-gguf-embeddings add-document --help 
Usage: ovos-gguf-embeddings add-document [OPTIONS] DOCUMENT

  Add a document to the embeddings store.

  DOCUMENT: The document string or file path to be added to the store.

  FROM-FILE: Flag indicating whether the DOCUMENT argument is a file path. If
  set, the file is read and processed.

  USE-SENTENCES: Flag indicating whether to tokenize the document into
  sentences. If not set, the document is split into paragraphs.

  DATABASE: Path to the ChromaDB database where the embeddings are stored.
  (Required)

  MODEL: Name or URL of the model used for generating embeddings. (Defaults to
  'paraphrase-multilingual-minilm-l12-v2')

Options:
  --database TEXT  Path to the ChromaDB database where the embeddings are
                   stored.
  --model TEXT     Model name or URL used for generating embeddings. Defaults
                   to "paraphrase-multilingual-minilm-l12-v2".
  --from-file      Indicates if the document argument is a file path.
  --use-sentences  Indicates if the document should be tokenized into
                   sentences; otherwise, it is split into paragraphs.
  --help           Show this message and exit.

$ovos-gguf-embeddings query-document --help 
Usage: ovos-gguf-embeddings query-document [OPTIONS] QUERY

  Query the embeddings store to find similar documents to the given query.

  QUERY: The query string used to search for similar documents.

  DATABASE: Path to the ChromaDB database where the embeddings are stored. Can
  be a full path or a simple string.           If a simple string is provided,
  it will be saved in the XDG cache directory (~/.cache/chromadb/{database}).

  MODEL: Name or URL of the model used for generating embeddings. (Defaults to
  'paraphrase-multilingual-minilm-l12-v2')

  TOP-K: Number of top results to return. (Defaults to 5)

Options:
  --database TEXT  Path to the ChromaDB database where the embeddings are
                   stored.
  --model TEXT     Model name or URL used for generating embeddings. Defaults
                   to "paraphrase-multilingual-minilm-l12-v2".
  --top-k INTEGER  Number of top results to return. Defaults to 5.
  --help           Show this message and exit.

$ovos-gguf-embeddings delete-document --help 
Usage: ovos-gguf-embeddings delete-document [OPTIONS] DOCUMENT

  Delete a document from the embeddings store.

  DOCUMENT: The document string to be deleted from the store.

  DATABASE: Path to the ChromaDB database where the embeddings are stored. Can
  be a full path or a simple string.           If a simple string is provided,
  it will be saved in the XDG cache directory (~/.cache/chromadb/{database}).

  MODEL: Name or URL of the model used for generating embeddings. (Defaults to
  'paraphrase-multilingual-minilm-l12-v2')

Options:
  --database TEXT  ChromaDB database where the embeddings are stored.
  --model TEXT     Model name or URL used for generating embeddings. Defaults
                   to "paraphrase-multilingual-minilm-l12-v2".
  --help           Show this message and exit.

Project details

Release history Release notifications | RSS feed

0.0.0

Oct 25, 2024

This version

0.0.0a2 pre-release

Oct 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ovos-gguf-embeddings-plugin-0.0.0a2.tar.gz (18.3 kB view hashes)

Uploaded Oct 25, 2024 Source

Built Distribution

ovos_gguf_embeddings_plugin-0.0.0a2-py3-none-any.whl (14.9 kB view hashes)

Uploaded Oct 25, 2024 Python 3

Hashes for ovos-gguf-embeddings-plugin-0.0.0a2.tar.gz

Hashes for ovos-gguf-embeddings-plugin-0.0.0a2.tar.gz
Algorithm	Hash digest
SHA256	`fc365f1e64c64368442f1831f13c4fa21154a207a11647e81ef9ea9efb1e28cc`
MD5	`f8212a6b8baedbe725558d021c02e6e4`
BLAKE2b-256	`138914f1780ab2e09c1560d0cb0e611203204fc72410a27b402d47adf30a1e4c`

Hashes for ovos_gguf_embeddings_plugin-0.0.0a2-py3-none-any.whl

Hashes for ovos_gguf_embeddings_plugin-0.0.0a2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`36e9353b1ad9a194ce4ce8a6ec05fdfea5c8a7d8e25b22ee1d5219c9d5d207a2`
MD5	`2042c2c22003d903ad0b8db6e283ca91`
BLAKE2b-256	`55ce2be50c428b4e35ccda7b1f577a73a9d43fb451841f96246072b859dfd1e1`