Modular framework for building IR systems
Project description
Search Toolkit
Modular, backend-agnostic framework for building and evaluating Information Retrieval systems.
Overview
Search Toolkit provides plug-and-play, extensible components for building production-ready IR pipelines. Every component is swappable and customizable — build exactly what your use case needs.
What's Included
Core Components
- Ingestion: Document loaders, extractors, text splitters, enrichment, and indexing pipelines
- Retrieval: Vector (semantic), keyword (BM25), and hybrid search with RRF fusion
- Query Processing: LLM reformulation and custom preprocessing
- Reranking: Rerank results to surface the most relevant information
- Embedders: Generate embeddings for documents and queries using Mistral's embedding models
- Storage: Abstract object storage interface for document persistence
Backend Agnostic
The toolkit is designed to work with different search backends through plugins. You can use it with any vector database or search engine by installing the appropriate plugin.
Installation
Install the base package:
pip install mistralai-search-toolkit
Install optional components:
# Text extraction from PDFs (requires pymupdf-pro)
pip install mistralai-search-toolkit[extractor-pymupdf]
# HTML to markdown conversion
pip install mistralai-search-toolkit[html-converter-markdownify]
# Email extraction
pip install mistralai-search-toolkit[extractor-email]
# Spreadsheet parsing
pip install mistralai-search-toolkit[extractor-spreadsheet]
# LangChain text splitting
pip install mistralai-search-toolkit[text-splitter-langchain]
Quick Start
1. Load and Process Documents
import os
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.client import Mistral
# Load documents from a directory
loader = FilesystemFileLoader()
documents = loader.load(path="/path/to/documents")
# Split into chunks
splitter = CharacterTextSplitter(chunk_size=512)
chunks = splitter.split(documents)
2. Generate Embeddings
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
# Create embedder (uses Mistral's API)
mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY", "your-api-key"))
embedder = MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING)
# Embed your chunks
embedded_chunks = embedder.embed(chunks)
3. Create an Index and Search
The toolkit supports multiple search backends through plugins. See the Vespa Plugin section below for a complete example.
Vespa Plugin: Creating a Search Index
Vespa is an open-source search engine that integrates seamlessly with the toolkit.
Prerequisites
- The Vespa plugin:
pip install mistralai-search-toolkit-plugins-vespa - Docker for local development
Getting Started with Vespa
Step 1: Bootstrap Your Vespa Application
First, create the application structure with an initial migration:
uv run mistral-vespa generate-migration --app-dir ./vespa_app initial_schema
This creates ./vespa_app/ and generates a migration file. Fill it with your schema definition:
from mistralai.search.toolkit.plugins.vespa.app.schemas.app import SearchMode
from mistralai.search.toolkit.plugins.vespa.migration import VespaMigration, create_default_schema, set_app_name
class InitialSchema(VespaMigration):
def migrate(self) -> None:
set_app_name("articles")
create_default_schema(
name="articles",
mode=SearchMode.INDEX,
embedding_dimensions=1024, # Match your embedder's dimensions
schema_version=1,
)
Step 2: Start a Local Vespa Instance
uv run mistral-vespa local up --query-port 18080 --config-port 19171 --name vespa-dev
Step 3: Deploy Your Application
Deploy the migrations to the running Vespa instance:
uv run mistral-vespa migrate \
--app-dir ./vespa_app \
--config-server http://localhost:19171 \
--query-port 18080
This generates the vespa_app module that you can import.
Step 4: Ingest and Search Documents
After deployment, use the generated vespa_app to index and search:
import os
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from mistralai.search.toolkit.retrieval import QueryEngine
from mistralai.search.toolkit.retrieval.retrievers import VectorRetriever
from vespa_app import app # Generated by migration deployment
# Configuration
mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY", "your-api-key"))
vespa_config = VespaClientConfig(
endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:18080"),
)
collection_name = "articles"
# Connect to Vespa
vector_store = app.get_search_index(vespa_config, collection_name=collection_name)
# INGESTION: Index your documents
pipeline = Pipeline(
loader=FilesystemFileLoader(),
text_splitter=CharacterTextSplitter(chunk_size=512),
embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
stores=vector_store,
)
num_chunks = await pipeline.run(documents=["doc1.pdf", "doc2.pdf"])
# RETRIEVAL: Search your documents
embedder = MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING)
query_engine = QueryEngine(
retriever=[VectorRetriever(client=vector_store, embedder=embedder)],
)
results = await query_engine.search(query="What is RAG?", top_k=5)
# Print results
for result in results.results:
print(f"Score: {result.score}")
print(f"Content: {result.content}\n")
Plugins
Extend the toolkit with specialized backends:
| Plugin | Package | Description |
|---|---|---|
| Vespa Plugin | mistralai-search-toolkit-plugins-vespa |
Vespa search backend |
| AWS S3 Storage | mistralai-search-toolkit-storage-s3 |
AWS S3 storage backend |
| Azure Blob Storage | mistralai-search-toolkit-storage-azure |
Azure Blob Storage backend |
| Google Cloud Storage | mistralai-search-toolkit-storage-gcs |
Google Cloud Storage backend |
License
This package is licensed under the Apache License 2.0.
Support
For more information and examples, visit Vespa documentation.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mistralai_search_toolkit-0.0.8.tar.gz.
File metadata
- Download URL: mistralai_search_toolkit-0.0.8.tar.gz
- Upload date:
- Size: 149.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
da6a78852285a5c90efccb1dbce101018820183bc22d89c7eb17d0ecd8726f51
|
|
| MD5 |
3ddaa39edaf796c26a0ea258d4c8ab62
|
|
| BLAKE2b-256 |
ec5f576b88988690592e99c3a1147b4327c9929062198935c2b4efa4fac0577c
|
Provenance
The following attestation bundles were made for mistralai_search_toolkit-0.0.8.tar.gz:
Publisher:
search-toolkit.yaml on mistralai/dashboard
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mistralai_search_toolkit-0.0.8.tar.gz -
Subject digest:
da6a78852285a5c90efccb1dbce101018820183bc22d89c7eb17d0ecd8726f51 - Sigstore transparency entry: 1602161119
- Sigstore integration time:
-
Permalink:
mistralai/dashboard@332aa6d4009c7344bf659e5f96e6ef904c672fbb -
Branch / Tag:
refs/tags/search-toolkit/v0.0.8 - Owner: https://github.com/mistralai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
search-toolkit.yaml@332aa6d4009c7344bf659e5f96e6ef904c672fbb -
Trigger Event:
push
-
Statement type:
File details
Details for the file mistralai_search_toolkit-0.0.8-py3-none-any.whl.
File metadata
- Download URL: mistralai_search_toolkit-0.0.8-py3-none-any.whl
- Upload date:
- Size: 140.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2197534369c648c585685e70a3cf33cdfed34a852208891a786c2503b6306715
|
|
| MD5 |
77b5abeaecb0e340b88c2d30a55d0c02
|
|
| BLAKE2b-256 |
9190d3c82805f853d255fad510b8b8400dd539451c54707c193fb946bd474a53
|
Provenance
The following attestation bundles were made for mistralai_search_toolkit-0.0.8-py3-none-any.whl:
Publisher:
search-toolkit.yaml on mistralai/dashboard
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mistralai_search_toolkit-0.0.8-py3-none-any.whl -
Subject digest:
2197534369c648c585685e70a3cf33cdfed34a852208891a786c2503b6306715 - Sigstore transparency entry: 1602161127
- Sigstore integration time:
-
Permalink:
mistralai/dashboard@332aa6d4009c7344bf659e5f96e6ef904c672fbb -
Branch / Tag:
refs/tags/search-toolkit/v0.0.8 - Owner: https://github.com/mistralai
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
search-toolkit.yaml@332aa6d4009c7344bf659e5f96e6ef904c672fbb -
Trigger Event:
push
-
Statement type: