AWS S3 (and S3-compatible) ObjectStorage plugin for mistralai-search-toolkit
Project description
AWS S3 Storage Plugin for Search Toolkit
AWS S3 (and S3-compatible) object storage backend for mistralai-search-toolkit.
This plugin implements the Search Toolkit's ObjectStorage interface, enabling the ingestion pipeline to load files directly from S3.
Installation
pip install mistralai-search-toolkit-storage-s3
Or as an optional dependency of the core package:
pip install mistralai-search-toolkit[storage-s3]
Quick Start: Load Files from S3 in Ingestion Pipeline
1. Upload a File to S3
import asyncio
from mistralai.search.toolkit.plugins.storage.s3 import S3ObjectStorage
async def upload_file():
storage = S3ObjectStorage(
bucket_name="your-bucket",
region_name="us-east-1",
)
# Upload a file
with open("document.pdf", "rb") as f:
data = f.read()
await storage.put(key="documents/document.pdf", data=data)
asyncio.run(upload_file())
2. Load Files from S3 in Ingestion Pipeline
import asyncio
import os
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.storage.s3 import S3ObjectStorage
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app
async def ingest_from_s3():
# Create S3 storage factory
def s3_storage_factory():
return S3ObjectStorage(
bucket_name="your-bucket",
region_name="us-east-1",
)
# Create FileLoader backed by S3
file_loader = FileLoader(storage_factory=s3_storage_factory)
# Create ingestion pipeline
mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY"))
vespa_config = VespaClientConfig(
endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:8080"),
)
vector_store = app.get_search_index(vespa_config, collection_name="articles")
pipeline = Pipeline(
loader=file_loader,
text_splitter=CharacterTextSplitter(chunk_size=512),
embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
stores=vector_store,
)
# Ingest documents from S3
num_chunks = await pipeline.run(documents=[
"documents/document1.pdf",
"documents/document2.pdf",
])
print(f"Indexed {num_chunks} chunks")
asyncio.run(ingest_from_s3())
Configuration
Basic Setup
storage = S3ObjectStorage(
bucket_name="your-bucket",
region_name="us-east-1",
)
With Credentials
storage = S3ObjectStorage(
bucket_name="your-bucket",
region_name="us-east-1",
access_key="your-access-key",
secret_key="your-secret-key",
)
S3-Compatible Services
Works with MinIO, DigitalOcean Spaces, and other S3-compatible services:
storage = S3ObjectStorage(
bucket_name="bucket",
endpoint_url="https://minio.example.com",
access_key="minioadmin",
secret_key="minioadmin",
)
Local Development
For testing without AWS, use MinIO:
docker run -p 9000:9000 -p 9001:9001 minio/minio server /data
Configure to use local MinIO:
storage = S3ObjectStorage(
bucket_name="documents",
endpoint_url="http://localhost:9000",
access_key="minioadmin",
secret_key="minioadmin",
)
License
This plugin is licensed under the Apache License 2.0.
Support
For Search Toolkit issues, refer to the Search Toolkit documentation.
For AWS S3 documentation, visit AWS S3 Docs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mistralai_search_toolkit_storage_s3-0.0.6.tar.gz.
File metadata
- Download URL: mistralai_search_toolkit_storage_s3-0.0.6.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
15d6d8f680cdac4fd13809beea4da8c24fe7f1af67263e3a39f78f07a755ac26
|
|
| MD5 |
9e4c8671f0bd36ca285f85365e81af77
|
|
| BLAKE2b-256 |
566fc70c1eb2ec0979bf19fbc3ff4a6fc8a3909636f90b65ce35ba869ecb0ada
|
File details
Details for the file mistralai_search_toolkit_storage_s3-0.0.6-py3-none-any.whl.
File metadata
- Download URL: mistralai_search_toolkit_storage_s3-0.0.6-py3-none-any.whl
- Upload date:
- Size: 10.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1bd5c5253268e3e672a9437c5b4b8ae04f1072fc8b4c351b834120bfa958bd68
|
|
| MD5 |
85378adcf634c367f84dee6121f6c7cf
|
|
| BLAKE2b-256 |
3f457701f0872f278500590c2c773ba2eb15d94e6f2aa5dc10989b0be87225b4
|