Skip to main content

Google Cloud Storage ObjectStorage plugin for mistralai-search-toolkit

Project description

Google Cloud Storage Plugin for Search Toolkit

Google Cloud Storage backend for mistralai-search-toolkit.

This plugin implements the Search Toolkit's ObjectStorage interface, enabling the ingestion pipeline to load files directly from Google Cloud Storage.

Installation

pip install mistralai-search-toolkit-storage-gcs

Or as an optional dependency of the core package:

pip install mistralai-search-toolkit[storage-gcs]

Quick Start: Load Files from GCS in Ingestion Pipeline

1. Upload a File to GCS

import asyncio
from mistralai.search.toolkit.plugins.storage.gcs import GCSObjectStorage

async def upload_file():
    storage = GCSObjectStorage(
        bucket_name="your-bucket",
        project_id="your-project",
    )

    # Upload a file
    with open("document.pdf", "rb") as f:
        data = f.read()

    await storage.put(key="documents/document.pdf", data=data)

asyncio.run(upload_file())

2. Load Files from GCS in Ingestion Pipeline

import asyncio
import os
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.storage.gcs import GCSObjectStorage
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

async def ingest_from_gcs():
    # Create GCS storage factory
    def gcs_storage_factory():
        return GCSObjectStorage(
            bucket_name="your-bucket",
            project_id="your-project",
        )

    # Create FileLoader backed by GCS
    file_loader = FileLoader(storage_factory=gcs_storage_factory)

    # Create ingestion pipeline
    mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY"))
    vespa_config = VespaClientConfig(
        endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:8080"),
    )
    vector_store = app.get_search_index(vespa_config, collection_name="articles")

    pipeline = Pipeline(
        loader=file_loader,
        text_splitter=CharacterTextSplitter(chunk_size=512),
        embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
        stores=vector_store,
    )

    # Ingest documents from GCS
    num_chunks = await pipeline.run(documents=[
        "documents/document1.pdf",
        "documents/document2.pdf",
    ])

    print(f"Indexed {num_chunks} chunks")

asyncio.run(ingest_from_gcs())

Configuration

Basic Setup

storage = GCSObjectStorage(
    bucket_name="your-bucket",
    project_id="your-project",
)

Using Service Account

from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    "/path/to/service-account-key.json"
)

storage = GCSObjectStorage(
    bucket_name="your-bucket",
    project_id="your-project",
    credentials=credentials,
)

Authentication

Environment Variables

Set credentials using environment variables:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"

Or authenticate with gcloud CLI:

gcloud auth application-default login

The plugin will automatically use credentials from:

  • GOOGLE_APPLICATION_CREDENTIALS environment variable
  • Application Default Credentials (if running in GCP)

License

This plugin is licensed under the Apache License 2.0.

Support

For Search Toolkit issues, refer to the Search Toolkit documentation.

For Google Cloud Storage documentation, visit GCS Docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistralai_search_toolkit_storage_gcs-0.0.6.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file mistralai_search_toolkit_storage_gcs-0.0.6.tar.gz.

File metadata

File hashes

Hashes for mistralai_search_toolkit_storage_gcs-0.0.6.tar.gz
Algorithm Hash digest
SHA256 8fa197162868cb4caa3e74ce677085a2e6a0d2ef64b5e15b497509f2e455ca18
MD5 7d8e6543f9f57913261523a3d977d843
BLAKE2b-256 4e567cd9927f7a4df59c171ff1efded4f8f2c4c72ef269ef1f1feea02d4d02c6

See more details on using hashes here.

File details

Details for the file mistralai_search_toolkit_storage_gcs-0.0.6-py3-none-any.whl.

File metadata

File hashes

Hashes for mistralai_search_toolkit_storage_gcs-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ca5e15743a5933d5af30643acf583bfa8a0893ab72d4febd091e1b93e41d325b
MD5 7a709bc24b0800471efc97ea391ba6c5
BLAKE2b-256 46ea6dcf1ef7708956a277f009094527ef2ec73ef3cb1b8032d72b5483ac229a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page