Skip to main content

Haystack 2.x component to embed strings and Documents using VoyageAI Embedding models.

Project description

PyPI PyPI - Downloads PyPI - Python Version GitHub Actions status Coverage Status

Types - Mypy Ruff Code Style - Black

Voyage Embedders - Haystack

Custom component for Haystack (2.x) for creating embeddings using the VoyageAI Embedding Models.

Voyage’s embedding models, voyage-2 and voyage-2-code, are state-of-the-art in retrieval accuracy. These models outperform top performing embedding models like intfloat/e5-mistral-7b-instruct and OpenAI/text-embedding-3-large on the MTEB Benchmark. voyage-2 is current ranked second on the MTEB Leaderboard.

What's New

  • [v1.3.0 - 18/03/24]:

    • Breaking Change: The import path for the embedders has been changed to haystack_integrations.components.embedders.voyage_embedders. Please replace all instances of from voyage_embedders.voyage_document_embedder import VoyageDocumentEmbedder and from voyage_embedders.voyage_text_embedder import VoyageTextEmbedder with
      from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder, VoyageTextEmbedder.
    • The embedders now use the Haystack Secret API for authentication. For more information please see the Secret Management Documentation.
  • [v1.2.0 - 02/02/24]:

    • Breaking Change: VoyageDocumentEmbedder and VoyageTextEmbedder now accept the model parameter instead of model_name.
    • The embedders have been use the new voyageai.Client.embed() method instead of the deprecated get_embedding and get_embeddings methods of the global namespace.
    • Support for the new truncate parameter has been added.
    • Default embedding model has been changed to "voyage-2" from the deprecated "voyage-01".
    • The embedders now return the total number of tokens used as part of the "total_tokens" in the metadata.
  • [v1.1.0 - 13/12/23]: Added support for input_type parameter in VoyageTextEmbedder and VoyageDocument Embedder.

  • [v1.0.0 - 21/11/23]: Added VoyageTextEmbedder and VoyageDocument Embedder to embed strings and documents.

Installation

pip install voyage-embedders-haystack

Usage

You can use Voyage Embedding models with two components: VoyageTextEmbedder and VoyageDocumentEmbedder.

To create semantic embeddings for documents, use VoyageDocumentEmbedder in your indexing pipeline. For generating embeddings for queries, use VoyageTextEmbedder.

Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also set the environment variable VOYAGE_API_KEY instead of passing the API key as an argument.

Information about the supported models, can be found on the Embeddings Documentation.

To get an API key, please see the Voyage AI website.

Example

Below is the example Semantic Search pipeline that uses the Simple Wikipedia Dataset from HuggingFace. You can find more examples in the examples folder.

Load the dataset:

# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Import Voyage Embedders
from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder, VoyageTextEmbedder

# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")

docs = [
    Document(
        content=doc["text"],
        meta={
            "title": doc["title"],
            "url": doc["url"],
        },
    )
    for doc in dataset
]

Index the documents to the InMemoryDocumentStore using the VoyageDocumentEmbedder and DocumentWriter:

doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
doc_writer = DocumentWriter(document_store=doc_store)

doc_embedder = VoyageDocumentEmbedder(
    model="voyage-2",
    input_type="document",
)
text_embedder = VoyageTextEmbedder(model="voyage-2", input_type="query")

# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=doc_writer, name="DocWriter")
indexing_pipeline.connect("DocEmbedder", "DocWriter")

indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

print(f"Number of documents in Document Store: {len(doc_store.filter_documents())}")
print(f"First Document: {doc_store.filter_documents()[0]}")
print(f"Embedding of first Document: {doc_store.filter_documents()[0].embedding}")

Query the Semantic Search Pipeline using the InMemoryEmbeddingRetriever and VoyageTextEmbedder:

text_embedder = VoyageTextEmbedder(model="voyage-2", input_type="query")

# Query Pipeline
query_pipeline = Pipeline()
query_pipeline.add_component(instance=text_embedder, name="TextEmbedder")
query_pipeline.add_component(instance=retriever, name="Retriever")
query_pipeline.connect("TextEmbedder.embedding", "Retriever.query_embedding")

# Search
results = query_pipeline.run({"TextEmbedder": {"text": "Which year did the Joker movie release?"}})

# Print text from top result
top_result = results["Retriever"]["documents"][0].content
print("The top search result is:")
print(top_result)

Contributing

Pull requests are welcome. For major changes, please open an issue first.

Author

Ashwin Mathur

License

voyage-embedders-haystack is distributed under the terms of the Apache-2.0 license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voyage_embedders_haystack-1.3.0.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

voyage_embedders_haystack-1.3.0-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file voyage_embedders_haystack-1.3.0.tar.gz.

File metadata

File hashes

Hashes for voyage_embedders_haystack-1.3.0.tar.gz
Algorithm Hash digest
SHA256 b317c0ca5b8901c598dbe3d3737d80f4a27b1a4ca2471dd284d2ed09483bb5cd
MD5 234446212e615e54c945a674bdb07d05
BLAKE2b-256 c8c42d0dbbd23b5fa191e10834057ed13fc291ff4cb5936736970289336798c8

See more details on using hashes here.

File details

Details for the file voyage_embedders_haystack-1.3.0-py3-none-any.whl.

File metadata

File hashes

Hashes for voyage_embedders_haystack-1.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b71c6c5e16b3f65f05fdd0532a2a9ff4951a14ab3ed9bbf90abe483b21cb2df6
MD5 76a6b67acfd733a06afd5471edf8ae6c
BLAKE2b-256 07e98127d667e75aaa7b6119e133d5b335f8a422555ed09337fb7b99314a2e61

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page