Skip to main content

Haystack component to embed strings and Documents using VoyageAI Embedding models.

Project description

Voyage Embedders and Rankers - Haystack

PyPI Downloads License Tests Coverage pre-commit.ci status Types Ruff

Custom components for Haystack for creating embeddings and reranking documents using the Voyage Models.

Voyage’s embedding models are state-of-the-art in retrieval accuracy. These models outperform top performing embedding models like intfloat/e5-mistral-7b-instruct and OpenAI/text-embedding-3-large on the MTEB Benchmark.

What's New (v1.10.0)

  • Support for Voyage 4 model family (voyage-4, voyage-4-large, voyage-4-lite).
  • Voyage 4 models support flexible output dimensions (256, 512, 1024, 2048) and multiple output data types (float, int8, uint8, binary, ubinary).
  • Updated examples to use voyage-4 as the default model.

See the full Changelog for all releases.

Requirements

Installation

pip install voyage-embedders-haystack

Usage

You can use Voyage Embedding models with multiple components:

The Voyage Reranker models can be used with the VoyageRanker component.

Multimodal Embeddings

The VoyageMultimodalEmbedder uses Voyage's multimodal embedding model (voyage-multimodal-3.5) to encode text, images, and videos into a shared vector space. This enables cross-modal similarity search where you can find images using text queries or find related content across different modalities.

Key features:

  • Supports text, images (PIL Images, ByteStream), and videos
  • Inputs can combine multiple modalities (e.g., text + image)
  • Variable output dimensions: 256, 512, 1024 (default), 2048
  • Recommended model: voyage-multimodal-3.5

Usage example:

from haystack.dataclasses import ByteStream
from haystack_integrations.components.embedders.voyage_embedders import VoyageMultimodalEmbedder
from voyageai.video_utils import Video

# Text-only embedding
embedder = VoyageMultimodalEmbedder(model="voyage-multimodal-3.5")
result = embedder.run(inputs=[["What is in this image?"]])

# Mixed text and image embedding
image_bytes = ByteStream.from_file_path("image.jpg")
result = embedder.run(inputs=[["Describe this image:", image_bytes]])

# Video embedding
video = Video.from_path("video.mp4", model="voyage-multimodal-3.5")
result = embedder.run(inputs=[["Describe this video:", video]])

Contextualized Chunk Embeddings

The VoyageContextualizedDocumentEmbedder uses Voyage's contextualized embedding models to encode document chunks "in context" with other chunks from the same document. This approach preserves semantic relationships between chunks and reduces context loss, leading to improved retrieval accuracy.

Key features:

  • Documents are grouped by a metadata field (default: source_id)
  • Chunks from the same source document are embedded together
  • Maintains semantic connections between related chunks
  • Recommended model: voyage-context-3

For detailed usage examples, see the contextualized embedder example.

Once you've selected the suitable component for your specific use case, initialize the component with the model name and VoyageAI API key. You can also set the environment variable VOYAGE_API_KEY instead of passing the API key as an argument. To get an API key, please see the Voyage AI website.

Information about the supported models, can be found on the Voyage AI Documentation.

Example

You can find all the examples in the examples folder.

Below is the example Semantic Search pipeline that uses the Simple Wikipedia Dataset from HuggingFace.

Load the dataset:

# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

# Import Voyage Embedders
from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder, VoyageTextEmbedder

# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")

docs = [
    Document(
        content=doc["text"],
        meta={
            "title": doc["title"],
            "url": doc["url"],
        },
    )
    for doc in dataset
]

Index the documents to the InMemoryDocumentStore using the VoyageDocumentEmbedder and DocumentWriter:

doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
doc_writer = DocumentWriter(document_store=doc_store)

doc_embedder = VoyageDocumentEmbedder(
    model="voyage-4",
    input_type="document",
)
text_embedder = VoyageTextEmbedder(model="voyage-4", input_type="query")

# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=doc_writer, name="DocWriter")
indexing_pipeline.connect("DocEmbedder", "DocWriter")

indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

print(f"Number of documents in Document Store: {len(doc_store.filter_documents())}")
print(f"First Document: {doc_store.filter_documents()[0]}")
print(f"Embedding of first Document: {doc_store.filter_documents()[0].embedding}")

Query the Semantic Search Pipeline using the InMemoryEmbeddingRetriever and VoyageTextEmbedder:

text_embedder = VoyageTextEmbedder(model="voyage-4", input_type="query")

# Query Pipeline
query_pipeline = Pipeline()
query_pipeline.add_component(instance=text_embedder, name="TextEmbedder")
query_pipeline.add_component(instance=retriever, name="Retriever")
query_pipeline.connect("TextEmbedder.embedding", "Retriever.query_embedding")

# Search
results = query_pipeline.run({"TextEmbedder": {"text": "Which year did the Joker movie release?"}})

# Print text from top result
top_result = results["Retriever"]["documents"][0].content
print("The top search result is:")
print(top_result)

Contributing

We welcome contributions from the community! Please take a look at our contributing guide for more details on how to get started.

Pull requests are welcome. For major changes, please open an issue first to discuss the proposed changes.

License

voyage-embedders-haystack is distributed under the terms of the Apache-2.0 license.

Maintained by Ashwin Mathur.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

voyage_embedders_haystack-1.10.0.tar.gz (14.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

voyage_embedders_haystack-1.10.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file voyage_embedders_haystack-1.10.0.tar.gz.

File metadata

  • Download URL: voyage_embedders_haystack-1.10.0.tar.gz
  • Upload date:
  • Size: 14.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for voyage_embedders_haystack-1.10.0.tar.gz
Algorithm Hash digest
SHA256 d7f4992461e37678794782938ced5651ded9e8b94f233c98d941c3ed7f420e68
MD5 6f952bca3982635eaf6b91f339757644
BLAKE2b-256 6a3393f558bdc2607e3616aa5ebfa88738f0ca200816fb3b95b2e0b95c41b4ff

See more details on using hashes here.

File details

Details for the file voyage_embedders_haystack-1.10.0-py3-none-any.whl.

File metadata

  • Download URL: voyage_embedders_haystack-1.10.0-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.2 {"installer":{"name":"uv","version":"0.10.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for voyage_embedders_haystack-1.10.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e3787ce4208da6927dafd82e19f33d0e8feac74c050bb95185f227b6b3e4589d
MD5 9f6685c9ec0b949b761dc51d6cb15e76
BLAKE2b-256 d94dfec2f70f6868f37c58221b880ff12df943bab67868deee10a90b6a921aee

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page