Skip to main content

Google Cloud Storage ObjectStorage plugin for mistralai-search-toolkit

Project description

Google Cloud Storage Plugin for Search Toolkit

Google Cloud Storage backend for mistralai-search-toolkit.

This plugin implements the Search Toolkit's ObjectStorage interface, enabling the ingestion pipeline to load files directly from Google Cloud Storage.

Installation

pip install mistralai-search-toolkit-storage-gcs

Or as an optional dependency of the core package:

pip install mistralai-search-toolkit[storage-gcs]

Quick Start: Load Files from GCS in Ingestion Pipeline

1. Upload a File to GCS

import asyncio
from mistralai.search.toolkit.plugins.storage.gcs import GCSBlobStorage

async def upload_file():
    storage = GCSBlobStorage(
        bucket_name="your-bucket",
        project_id="your-project",
    )

    # Upload a file
    with open("document.pdf", "rb") as f:
        data = f.read()

    await storage.put(key="documents/document.pdf", data=data)

asyncio.run(upload_file())

2. Load Files from GCS in Ingestion Pipeline

import asyncio
import os
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.storage.gcs import GCSBlobStorage
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

async def ingest_from_gcs():
    # Create GCS storage factory
    def gcs_storage_factory():
        return GCSBlobStorage(
            bucket_name="your-bucket",
            project_id="your-project",
        )

    # Create FileLoader backed by GCS
    file_loader = FileLoader(storage_factory=gcs_storage_factory)

    # Create ingestion pipeline
    mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY"))
    vespa_config = VespaClientConfig(
        endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:8080"),
    )
    vector_store = app.get_search_index(vespa_config, collection_name="articles")

    pipeline = Pipeline(
        loader=file_loader,
        text_splitter=CharacterTextSplitter(chunk_size=512),
        embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
        stores=vector_store,
    )

    # Ingest documents from GCS
    num_chunks = await pipeline.run(documents=[
        "documents/document1.pdf",
        "documents/document2.pdf",
    ])

    print(f"Indexed {num_chunks} chunks")

asyncio.run(ingest_from_gcs())

Configuration

Basic Setup

storage = GCSBlobStorage(
    bucket_name="your-bucket",
    project_id="your-project",
)

Using Service Account

from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    "/path/to/service-account-key.json"
)

storage = GCSBlobStorage(
    bucket_name="your-bucket",
    project_id="your-project",
    credentials=credentials,
)

Authentication

Environment Variables

Set credentials using environment variables:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"

Or authenticate with gcloud CLI:

gcloud auth application-default login

The plugin will automatically use credentials from:

  • GOOGLE_APPLICATION_CREDENTIALS environment variable
  • Application Default Credentials (if running in GCP)

License

This plugin is licensed under the Apache License 2.0.

Support

For Search Toolkit issues, refer to the Search Toolkit documentation.

For Google Cloud Storage documentation, visit GCS Docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mistralai_search_toolkit_storage_gcs-0.0.8.tar.gz (14.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file mistralai_search_toolkit_storage_gcs-0.0.8.tar.gz.

File metadata

File hashes

Hashes for mistralai_search_toolkit_storage_gcs-0.0.8.tar.gz
Algorithm Hash digest
SHA256 4134cb1c44cb56933fed9d00ae5adcf9ce7afc4aa32738f837014967086db03a
MD5 7fcfd0c096c25ce07cbb62795a2452f2
BLAKE2b-256 7fa48f21fd382b40cef09353f53cb8514b4a1e2aaae4f971f582c929f594b61e

See more details on using hashes here.

Provenance

The following attestation bundles were made for mistralai_search_toolkit_storage_gcs-0.0.8.tar.gz:

Publisher: search-toolkit-plugins.yaml on mistralai/dashboard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file mistralai_search_toolkit_storage_gcs-0.0.8-py3-none-any.whl.

File metadata

File hashes

Hashes for mistralai_search_toolkit_storage_gcs-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 0a90f46ca3ceca85202c572a9f9db298fec6377dbeb5afac129ea12751848e47
MD5 b794bd98d234974b23a1dcd824fa8e11
BLAKE2b-256 7906b88a678f7e4d1584f2d69a538f4639c513866dc074edcaf00a7205c3cb62

See more details on using hashes here.

Provenance

The following attestation bundles were made for mistralai_search_toolkit_storage_gcs-0.0.8-py3-none-any.whl:

Publisher: search-toolkit-plugins.yaml on mistralai/dashboard

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page