Skip to main content

AWS S3 (and S3-compatible) ObjectStorage plugin for mistralai-search-toolkit

Project description

AWS S3 Storage Plugin for Search Toolkit

AWS S3 (and S3-compatible) object storage backend for mistralai-search-toolkit.

This plugin implements the Search Toolkit's ObjectStorage interface, enabling the ingestion pipeline to load files directly from S3.

Installation

pip install mistralai-search-toolkit-storage-s3

Or as an optional dependency of the core package:

pip install mistralai-search-toolkit[storage-s3]

Quick Start: Load Files from S3 in Ingestion Pipeline

1. Upload a File to S3

import asyncio
from mistralai.search.toolkit.plugins.storage.s3 import S3ObjectStorage

async def upload_file():
    storage = S3ObjectStorage(
        bucket_name="your-bucket",
        region_name="us-east-1",
    )

    # Upload a file
    with open("document.pdf", "rb") as f:
        data = f.read()

    await storage.put(key="documents/document.pdf", data=data)

asyncio.run(upload_file())

2. Load Files from S3 in Ingestion Pipeline

import asyncio
import os
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.storage.s3 import S3ObjectStorage
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

async def ingest_from_s3():
    # Create S3 storage factory
    def s3_storage_factory():
        return S3ObjectStorage(
            bucket_name="your-bucket",
            region_name="us-east-1",
        )

    # Create FileLoader backed by S3
    file_loader = FileLoader(storage_factory=s3_storage_factory)

    # Create ingestion pipeline
    mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY"))
    vespa_config = VespaClientConfig(
        endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:8080"),
    )
    vector_store = app.get_search_index(vespa_config, collection_name="articles")

    pipeline = Pipeline(
        loader=file_loader,
        text_splitter=CharacterTextSplitter(chunk_size=512),
        embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
        stores=vector_store,
    )

    # Ingest documents from S3
    num_chunks = await pipeline.run(documents=[
        "documents/document1.pdf",
        "documents/document2.pdf",
    ])

    print(f"Indexed {num_chunks} chunks")

asyncio.run(ingest_from_s3())

Configuration

Basic Setup

storage = S3ObjectStorage(
    bucket_name="your-bucket",
    region_name="us-east-1",
)

With Credentials

storage = S3ObjectStorage(
    bucket_name="your-bucket",
    region_name="us-east-1",
    access_key="your-access-key",
    secret_key="your-secret-key",
)

S3-Compatible Services

Works with MinIO, DigitalOcean Spaces, and other S3-compatible services:

storage = S3ObjectStorage(
    bucket_name="bucket",
    endpoint_url="https://minio.example.com",
    access_key="minioadmin",
    secret_key="minioadmin",
)

Local Development

For testing without AWS, use MinIO:

docker run -p 9000:9000 -p 9001:9001 minio/minio server /data

Configure to use local MinIO:

storage = S3ObjectStorage(
    bucket_name="documents",
    endpoint_url="http://localhost:9000",
    access_key="minioadmin",
    secret_key="minioadmin",
)

License

This plugin is licensed under the Apache License 2.0.

Support

For Search Toolkit issues, refer to the Search Toolkit documentation.

For AWS S3 documentation, visit AWS S3 Docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistralai_search_toolkit_storage_s3-0.0.6.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file mistralai_search_toolkit_storage_s3-0.0.6.tar.gz.

File metadata

File hashes

Hashes for mistralai_search_toolkit_storage_s3-0.0.6.tar.gz
Algorithm Hash digest
SHA256 15d6d8f680cdac4fd13809beea4da8c24fe7f1af67263e3a39f78f07a755ac26
MD5 9e4c8671f0bd36ca285f85365e81af77
BLAKE2b-256 566fc70c1eb2ec0979bf19fbc3ff4a6fc8a3909636f90b65ce35ba869ecb0ada

See more details on using hashes here.

File details

Details for the file mistralai_search_toolkit_storage_s3-0.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for mistralai_search_toolkit_storage_s3-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 1bd5c5253268e3e672a9437c5b4b8ae04f1072fc8b4c351b834120bfa958bd68
MD5 85378adcf634c367f84dee6121f6c7cf
BLAKE2b-256 3f457701f0872f278500590c2c773ba2eb15d94e6f2aa5dc10989b0be87225b4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page