Skip to main content

AWS S3 (and S3-compatible) ObjectStorage plugin for mistralai-search-toolkit

Project description

AWS S3 Storage Plugin for Search Toolkit

AWS S3 (and S3-compatible) object storage backend for mistralai-search-toolkit.

This plugin implements the Search Toolkit's ObjectStorage interface, enabling the ingestion pipeline to load files directly from S3.

Installation

pip install mistralai-search-toolkit-storage-s3

Or as an optional dependency of the core package:

pip install mistralai-search-toolkit[storage-s3]

Quick Start: Load Files from S3 in Ingestion Pipeline

1. Upload a File to S3

import asyncio
from mistralai.search.toolkit.plugins.storage.s3 import S3BlobStorage

async def upload_file():
    storage = S3BlobStorage(
        bucket_name="your-bucket",
        region_name="us-east-1",
    )

    # Upload a file
    with open("document.pdf", "rb") as f:
        data = f.read()

    await storage.put(key="documents/document.pdf", data=data)

asyncio.run(upload_file())

2. Load Files from S3 in Ingestion Pipeline

import asyncio
import os
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.storage.s3 import S3BlobStorage
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

async def ingest_from_s3():
    # Create S3 storage factory
    def s3_storage_factory():
        return S3BlobStorage(
            bucket_name="your-bucket",
            region_name="us-east-1",
        )

    # Create FileLoader backed by S3
    file_loader = FileLoader(storage_factory=s3_storage_factory)

    # Create ingestion pipeline
    mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY"))
    vespa_config = VespaClientConfig(
        endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:8080"),
    )
    vector_store = app.get_search_index(vespa_config, collection_name="articles")

    pipeline = Pipeline(
        loader=file_loader,
        text_splitter=CharacterTextSplitter(chunk_size=512),
        embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
        stores=vector_store,
    )

    # Ingest documents from S3
    num_chunks = await pipeline.run(documents=[
        "documents/document1.pdf",
        "documents/document2.pdf",
    ])

    print(f"Indexed {num_chunks} chunks")

asyncio.run(ingest_from_s3())

Configuration

Basic Setup

storage = S3BlobStorage(
    bucket_name="your-bucket",
    region_name="us-east-1",
)

With Credentials

storage = S3BlobStorage(
    bucket_name="your-bucket",
    region_name="us-east-1",
    access_key="your-access-key",
    secret_key="your-secret-key",
)

S3-Compatible Services

Works with MinIO, DigitalOcean Spaces, and other S3-compatible services:

storage = S3BlobStorage(
    bucket_name="bucket",
    endpoint_url="https://minio.example.com",
    access_key="minioadmin",
    secret_key="minioadmin",
)

Local Development

For testing without AWS, use MinIO:

docker run -p 9000:9000 -p 9001:9001 minio/minio server /data

Configure to use local MinIO:

storage = S3BlobStorage(
    bucket_name="documents",
    endpoint_url="http://localhost:9000",
    access_key="minioadmin",
    secret_key="minioadmin",
)

License

This plugin is licensed under the Apache License 2.0.

Support

For Search Toolkit issues, refer to the Search Toolkit documentation.

For AWS S3 documentation, visit AWS S3 Docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistralai_search_toolkit_storage_s3-0.0.8.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file mistralai_search_toolkit_storage_s3-0.0.8.tar.gz.

File metadata

File hashes

Hashes for mistralai_search_toolkit_storage_s3-0.0.8.tar.gz
Algorithm Hash digest
SHA256 a48f2f94db975f6349e9127ea3b40d00ceaef4db0b51b3379f48aadfe8e443a6
MD5 2d7afa28866be1c430d59b2497690a6a
BLAKE2b-256 3add0cee2d53afa0f6d58a0ab7ad042b12f6166348b916bdd9f6d795335fd015

See more details on using hashes here.

Provenance

The following attestation bundles were made for mistralai_search_toolkit_storage_s3-0.0.8.tar.gz:

Publisher: search-toolkit-plugins.yaml on mistralai/dashboard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mistralai_search_toolkit_storage_s3-0.0.8-py3-none-any.whl.

File metadata

File hashes

Hashes for mistralai_search_toolkit_storage_s3-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f555b6dd46c3b26864c7d2808bec17af14b73178261d024702d7b32532350a6b
MD5 f43a0dc6c67723cfe2fdfd533161f597
BLAKE2b-256 0e8a40f26296b90290d976e1d9c62be103b2eb360a29307033fcea36a1f4d1ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for mistralai_search_toolkit_storage_s3-0.0.8-py3-none-any.whl:

Publisher: search-toolkit-plugins.yaml on mistralai/dashboard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page