Skip to main content

Local RAG-based semantic document search with MCP server interface

Project description

ChunkSilo MCP Server

ChunkSilo is like a local Google for your documents. It uses semantic search — matching by meaning rather than exact keywords — so your LLM can find relevant information across all your files even when the wording differs from your query. Point it at your PDFs, Word docs, Markdown, and text files, and it builds a fully searchable index locally on your machine.

Overview

  • No permissions headache: Each user indexes only the files they already have access to. No centralized access-control system to build or maintain — document permissions stay exactly where they are.
  • No infrastructure required: Runs entirely on the user's own machine as an MCP server. Nothing to deploy, no servers to manage.
  • Easy to set up: Any user with an MCP-compatible LLM client can install, point at their document directories, and have everything indexed and searchable.
  • Works with what you have: Supports PDF, DOCX, DOC, Markdown, and TXT from local folders, network drives, or shared mounts.

Features

  • Local indexing and search: All indexing and search runs on your machine with bundled models — ChunkSilo itself makes no external network calls when offline: true. Note: search results are passed to your MCP client's LLM, which may be cloud-hosted.
  • Incremental indexing: Only reindexes new or changed files, so re-runs are fast even on large document collections.
  • Heading-aware navigation: Extracts headings from PDFs, Word docs, and Markdown so results include the full heading path (e.g. "Chapter 3 > Setup > Prerequisites").
  • Date filtering and recency boost: Search within a date range or let recent documents rank higher automatically.
  • Dual retrieval: Returns both meaning-based chunk matches and keyword-based filename matches separately, so file lookups don't get buried by unrelated content.
  • Multi-directory with per-folder rules: Index multiple directories with individual include/exclude glob patterns — useful for shared drives with mixed content.
  • Confluence integration: Optionally searches your Confluence instance alongside local files, with results returned in the same format.
  • Source links: Each result includes a clickable link back to the source file or Confluence page in supported MCP clients.

Installation

Option A: Install from PyPI (Recommended)

Requires Python 3.11 or later. Models are downloaded automatically on first run (~250MB). The first run may appear to pause while models download — this is normal.

pip install chunksilo

# Or with Confluence support:
pip install chunksilo[confluence]

Then:

  1. Create a config file at ~/.config/chunksilo/config.yaml (see Configuration)
  2. Build the index: chunksilo --build-index
  3. Configure your MCP client (see MCP Client Configuration)

Option B: Offline Bundle

A self-contained package with pre-downloaded models, ideal for air-gapped environments or systems without Python installed.

Download from the Releases page:

  1. Download the chunksilo-vX.Y.Z-manylinux_2_34_x86_64.tar.gz file
  2. Extract and install:
tar -xzf chunksilo-vX.Y.Z-manylinux_2_34_x86_64.tar.gz
cd chunksilo
./setup.sh
  1. Edit config.yaml to set your document directories
  2. Build the index: ./venv/bin/chunksilo --build-index
  3. Configure your MCP client (see MCP Client Configuration)

Configuration

ChunkSilo uses a single configuration file: config.yaml

Configuration File

Edit config.yaml to configure your settings:

# Indexing settings - used by chunksilo --build-index
indexing:
  directories:
    - "./data"
    - "/mnt/nfs/shared-docs"
    - path: "/mnt/samba/engineering"
      include: ["**/*.pdf", "**/*.md"]
      exclude: ["**/archive/**"]
  chunk_size: 1600
  chunk_overlap: 200

# Retrieval settings - used when searching
retrieval:
  embed_top_k: 20
  rerank_top_k: 5
  score_threshold: 0.1

# Confluence integration (optional)
confluence:
  url: "https://confluence.example.com"
  username: "your-username"
  api_token: "your-api-token"

# Storage paths (usually don't need to change)
storage:
  storage_dir: "./storage"
  model_cache_dir: "./models"

All settings are optional and have sensible defaults.

Configuration Reference

Indexing Settings

Setting Default Description
indexing.directories ["./data"] List of directories to index (strings or objects)
indexing.chunk_size 1600 Maximum size of text chunks
indexing.chunk_overlap 200 Overlap between adjacent chunks

Per-directory options (when using object format):

Option Default Description
path (required) Directory path to index
include ["**/*.pdf", "**/*.md", "**/*.txt", "**/*.docx", "**/*.doc"] Glob patterns for files to include
exclude [] Glob patterns for files to exclude
recursive true Whether to recurse into subdirectories
enabled true Whether to index this directory

Retrieval Settings

Setting Default Description
retrieval.embed_model_name BAAI/bge-small-en-v1.5 Embedding model for vector search
retrieval.embed_top_k 20 Candidates from vector search before reranking
retrieval.rerank_model_name ms-marco-MiniLM-L-12-v2 Reranker model
retrieval.rerank_top_k 5 Final results after reranking
retrieval.rerank_candidates 100 Maximum candidates sent to reranker
retrieval.score_threshold 0.1 Minimum score (0.0-1.0) for results
retrieval.recency_boost 0.3 Recency boost weight (0.0-1.0)
retrieval.recency_half_life_days 365 Days until recency boost halves
retrieval.bm25_similarity_top_k 10 Files returned by BM25 filename search
retrieval.offline false Prevent ML library network requests

Confluence Settings (optional)

Note: Confluence integration requires the optional dependency. Install with: pip install chunksilo[confluence]

Setting Default Description
confluence.url "" Confluence base URL (empty = disabled)
confluence.username "" Confluence username
confluence.api_token "" Confluence API token
confluence.timeout 10.0 Request timeout in seconds
confluence.max_results 30 Maximum results per search

SSL Settings (optional)

Setting Default Description
ssl.ca_bundle_path "" Path to custom CA bundle file

Storage Settings

Setting Default Description
storage.storage_dir ./storage Directory for vector index and state
storage.model_cache_dir ./models Directory for model cache

CLI Usage

The chunksilo command provides indexing, searching, and model management:

# Build or update the search index
chunksilo --build-index

# Search for documents
chunksilo "your search query"

# Search with date filtering
chunksilo "quarterly report" --date-from 2024-01-01 --date-to 2024-03-31

# Output results as JSON
chunksilo "search query" --json

# Show verbose output (model loading, search stats)
chunksilo "search query" --verbose

# Pre-download ML models (useful before going offline)
chunksilo --download-models

# Use a custom config file
chunksilo --build-index --config /path/to/config.yaml

CLI Options

Option Description
query Search query text (positional argument)
--build-index Build or update the search index, then exit
--download-models Download required ML models, then exit
--date-from Start date filter (YYYY-MM-DD format, inclusive)
--date-to End date filter (YYYY-MM-DD format, inclusive)
--json Output results as JSON instead of formatted text
-v, --verbose Show diagnostic messages (model loading, search stats)
--config Path to config.yaml (overrides auto-discovery)

MCP Client Configuration

Configure your MCP client to run ChunkSilo. Below are examples for common clients.

Note: For PyPI installs, use chunksilo-mcp directly. For offline bundles, use the full path /path/to/chunksilo/venv/bin/chunksilo-mcp. You can find the PyPI-installed binary location with which chunksilo-mcp.

Claude Code

Add chunksilo as an MCP server using the CLI:

PyPI install:

claude mcp add chunksilo --scope user -- chunksilo-mcp --config ~/.config/chunksilo/config.yaml

Offline bundle:

claude mcp add chunksilo --scope user -- /path/to/chunksilo/venv/bin/chunksilo-mcp --config /path/to/chunksilo/config.yaml

Verify it's connected:

claude mcp list

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

PyPI install:

{
  "mcpServers": {
    "chunksilo": {
      "command": "chunksilo-mcp",
      "args": ["--config", "/path/to/config.yaml"]
    }
  }
}

Offline bundle:

{
  "mcpServers": {
    "chunksilo": {
      "command": "/path/to/chunksilo/venv/bin/chunksilo-mcp",
      "args": ["--config", "/path/to/chunksilo/config.yaml"]
    }
  }
}

Cline (VS Code Extension)

Add to cline_mcp_settings.json (typically in ~/.config/Code/User/globalStorage/saoudrizwan.claude-dev/settings/):

PyPI install:

{
  "mcpServers": {
    "chunksilo": {
      "command": "chunksilo-mcp",
      "args": ["--config", "/path/to/config.yaml"],
      "disabled": false,
      "autoApprove": []
    }
  }
}

Offline bundle:

{
  "mcpServers": {
    "chunksilo": {
      "command": "/path/to/chunksilo/venv/bin/chunksilo-mcp",
      "args": ["--config", "/path/to/chunksilo/config.yaml"],
      "disabled": false,
      "autoApprove": []
    }
  }
}

Roo Code (VS Code Extension)

Add to mcp_settings.json (typically in ~/.config/Code/User/globalStorage/rooveterinaryinc.roo-cline/settings/):

PyPI install:

{
  "mcpServers": {
    "chunksilo": {
      "command": "chunksilo-mcp",
      "args": ["--config", "/path/to/config.yaml"]
    }
  }
}

Offline bundle:

{
  "mcpServers": {
    "chunksilo": {
      "command": "/path/to/chunksilo/venv/bin/chunksilo-mcp",
      "args": ["--config", "/path/to/chunksilo/config.yaml"]
    }
  }
}

Troubleshooting

  • Index missing: Run chunksilo --build-index (PyPI install) or ./venv/bin/chunksilo --build-index (offline bundle).
  • Retrieval errors: Check paths in your MCP client configuration.
  • Offline mode: PyPI installs default to offline: false (models auto-download). The offline bundle includes pre-downloaded models and sets offline: true. Set retrieval.offline: true in config.yaml to prevent network calls after initial model download.
  • Confluence Integration: Install with pip install chunksilo[confluence], then set confluence.url, confluence.username, and confluence.api_token in config.yaml.
  • Custom CA Bundle: Set ssl.ca_bundle_path in config.yaml for custom certificates.
  • Network mounts: Unavailable directories are skipped with a warning; indexing continues with available directories.
  • Legacy .doc files: Requires LibreOffice to be installed for automatic conversion to .docx. If LibreOffice is not found, .doc files are skipped with a warning. Full heading extraction is supported.

License

Apache-2.0. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunksilo-2.0.0.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunksilo-2.0.0-py3-none-any.whl (39.6 kB view details)

Uploaded Python 3

File details

Details for the file chunksilo-2.0.0.tar.gz.

File metadata

  • Download URL: chunksilo-2.0.0.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chunksilo-2.0.0.tar.gz
Algorithm Hash digest
SHA256 e8298c672550ae455670b95e4d4914df795b2b53cbaa0891da611cce9e745880
MD5 dd343bac4d10cfcc5eda99a3e59ee83d
BLAKE2b-256 f35007771d3383e683bb4a31a434fba6a50697de4409b4310543830d1af97fa6

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunksilo-2.0.0.tar.gz:

Publisher: manual-release.yml on Chetic/chunksilo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunksilo-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: chunksilo-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 39.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chunksilo-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fd719614562908172076360e280c6cd5744117d8b6beb39f45c04d2d2eeafe0b
MD5 4864a1dd282c122afff329c22f5efbb1
BLAKE2b-256 c2a32ac9aee72f9e08491f91643f5e1115a09348681cb58a36e96b35ea38f183

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunksilo-2.0.0-py3-none-any.whl:

Publisher: manual-release.yml on Chetic/chunksilo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page