Haystack 2.x component to embed strings and Documents using VoyageAI Embedding models.
Project description
Voyage Embedders - Haystack
Custom component for Haystack (2.x) for creating embeddings using the VoyageAI Embedding Models.
What's New
- [v1.1.0 - 13/12/23]: Added support for
input_type
parameter inVoyageTextEmbedder
andVoyageDocument Embedder
. - [v1.0.0 - 21/11/23]: Added
VoyageTextEmbedder
andVoyageDocument Embedder
to embed strings and documents.
Installation
pip install voyage-embedders-haystack
Usage
You can use Voyage Embedding models with two components: VoyageTextEmbedder and VoyageDocumentEmbedder.
To create semantic embeddings for documents, use VoyageDocumentEmbedder
in your indexing pipeline. For generating embeddings for queries, use VoyageTextEmbedder
. Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also
set the environment variable VOYAGE_API_KEY instead of passing the api key as an argument.
Information about the supported models, can be found on the Embeddings Documentation.
To get an API key, please see the Voyage AI website.
Example
Below is the example Semantic Search pipeline that uses the Simple Wikipedia Dataset from HuggingFace. You can find more examples in the examples
folder.
Load the dataset:
# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Pipeline
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.dataclasses import Document
from haystack.document_stores import InMemoryDocumentStore
# Import Voyage Embedders
from voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder
from voyage_embedders.voyage_text_embedder import VoyageTextEmbedder
# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
docs = [
Document(
content=doc["text"],
meta={
"title": doc["title"],
"url": doc["url"],
},
)
for doc in dataset
]
Index the documents to the InMemoryDocumentStore
using the VoyageDocumentEmbedder
and DocumentWriter
:
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
doc_embedder = VoyageDocumentEmbedder(
model_name="voyage-01",
input_type="document",
batch_size=8,
)
# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter")
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
print(f"Number of documents in Document Store: {len(doc_store.filter_documents())}")
print(f"First Document: {doc_store.filter_documents()[0]}")
print(f"Embedding of first Document: {doc_store.filter_documents()[0].embedding}")
Query the Semantic Search Pipeline using the InMemoryEmbeddingRetriever
and VoyageTextEmbedder
:
text_embedder = VoyageTextEmbedder(model_name="voyage-01", input_type="query")
# Query Pipeline
query_pipeline = Pipeline()
query_pipeline.add_component("TextEmbedder", text_embedder)
query_pipeline.add_component("Retriever", InMemoryEmbeddingRetriever(document_store=doc_store))
query_pipeline.connect("TextEmbedder", "Retriever")
# Search
results = query_pipeline.run({"TextEmbedder": {"text": "Which year did the Joker movie release?"}})
# Print text from top result
top_result = results["Retriever"]["documents"][0].content
print("The top search result is:")
print(top_result)
Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Author
License
voyage-embedders-haystack
is distributed under the terms of the Apache-2.0 license.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for voyage_embedders_haystack-1.1.0.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 39c42cccb3a54c12ff697f47bdbf5a258acb2167623183d26b141073ff842adb |
|
MD5 | 0dbaa2a8e1bd3ba6322c720c1f2ed1cc |
|
BLAKE2b-256 | bfcf568e65570f0c6f9e45304ca93c9043f3a90f9d86747fd764b39c49e87436 |
Hashes for voyage_embedders_haystack-1.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 15e9e346f4a1e6d546f0c03f0acb458bf7a6a8df7cdae89d9d8b45e311686815 |
|
MD5 | 6d37f1e73caddf56fb818f1a0b3eb1cc |
|
BLAKE2b-256 | b16a3bd125d5646fa0d127ad5c5e2d98426e772354c76450134b539ace648b28 |