Local RAG-based semantic document search with MCP server interface
Project description
ChunkSilo MCP Server
ChunkSilo is like a local Google for your documents. It uses semantic search — matching by meaning rather than exact keywords — so your LLM can find relevant information across all your files even when the wording differs from your query. Point it at your PDFs, Word docs, Markdown, and text files, and it builds a fully searchable index locally on your machine.
- Runs entirely on your machine — no servers, no infrastructure
- Semantic search + keyword filename matching across PDF, DOCX, DOC, Markdown, and TXT
- Incremental indexing — only reprocesses new or changed files
- Heading-aware results with source links back to the original file
- Date filtering and recency boosting
- Optional Confluence integration
Example search_docs output
{
"matched_files": [
{ "uri": "file:///docs/database-configuration.docx", "score": 0.8432 }
],
"num_matched_files": 1,
"chunks": [
{
"text": "To configure the database connection, set the DATABASE_URL environment variable...",
"score": 0.912,
"location": {
"uri": "file:///docs/setup-guide.pdf",
"page": 12,
"line": null,
"heading_path": ["Getting Started", "Configuration", "Database"]
}
}
],
"num_chunks": 1,
"query": "how to configure the database",
"retrieval_time": "0.42s"
}
Installation
Option A: Install from PyPI (Recommended)
Requires Python 3.11 or later. Models are downloaded automatically on first run (~250MB). The first run may appear to pause while models download — this is normal.
pip install chunksilo
# Or with Confluence support:
pip install chunksilo[confluence]
# Or with Jira support:
pip install chunksilo[jira]
# Or with both Confluence and Jira:
pip install chunksilo[confluence,jira]
Then:
- Create a config file at
~/.config/chunksilo/config.yaml(see Configuration) - Build the index:
chunksilo --build-index - Configure your MCP client (see MCP Client Configuration)
Option B: Offline Bundle
A self-contained package with pre-downloaded models, ideal for air-gapped environments or systems without Python installed.
Download from the Releases page:
- Download the
chunksilo-vX.Y.Z-manylinux_2_34_x86_64.tar.gzfile - Extract and install:
tar -xzf chunksilo-vX.Y.Z-manylinux_2_34_x86_64.tar.gz
cd chunksilo
./setup.sh
- Edit
config.yamlto set your document directories - Build the index:
./venv/bin/chunksilo --build-index - Configure your MCP client (see MCP Client Configuration)
Configuration
ChunkSilo uses a single configuration file: config.yaml
Configuration File
Edit config.yaml to configure your settings:
# Indexing settings - used by chunksilo --build-index
indexing:
directories:
- "./data"
- "/mnt/nfs/shared-docs"
- path: "/mnt/samba/engineering"
include: ["**/*.pdf", "**/*.md"]
exclude: ["**/archive/**"]
chunk_size: 1600
chunk_overlap: 200
# Retrieval settings - used when searching
retrieval:
embed_top_k: 20
rerank_top_k: 5
score_threshold: 0.1
# Confluence integration (optional)
confluence:
url: "https://confluence.example.com"
username: "your-username"
api_token: "your-api-token"
# Storage paths (usually don't need to change)
storage:
storage_dir: "./storage"
model_cache_dir: "./models"
All settings are optional and have sensible defaults.
Configuration Reference
Indexing Settings
| Setting | Default | Description |
|---|---|---|
indexing.directories |
["./data"] |
List of directories to index (strings or objects) |
indexing.chunk_size |
1600 |
Maximum size of text chunks |
indexing.chunk_overlap |
200 |
Overlap between adjacent chunks |
Per-directory options (when using object format):
| Option | Default | Description |
|---|---|---|
path |
(required) | Directory path to index |
include |
["**/*.pdf", "**/*.md", "**/*.txt", "**/*.docx", "**/*.doc"] |
Glob patterns for files to include |
exclude |
[] |
Glob patterns for files to exclude |
recursive |
true |
Whether to recurse into subdirectories |
enabled |
true |
Whether to index this directory |
Retrieval Settings
| Setting | Default | Description |
|---|---|---|
retrieval.embed_model_name |
BAAI/bge-small-en-v1.5 |
Embedding model for vector search |
retrieval.embed_top_k |
20 |
Candidates from vector search before reranking |
retrieval.rerank_model_name |
ms-marco-MiniLM-L-12-v2 |
Reranker model |
retrieval.rerank_top_k |
5 |
Final results after reranking |
retrieval.rerank_candidates |
100 |
Maximum candidates sent to reranker |
retrieval.score_threshold |
0.1 |
Minimum score (0.0-1.0) for results |
retrieval.recency_boost |
0.3 |
Recency boost weight (0.0-1.0) |
retrieval.recency_half_life_days |
365 |
Days until recency boost halves |
retrieval.bm25_similarity_top_k |
10 |
Files returned by BM25 filename search |
retrieval.offline |
false |
Prevent ML library network requests |
Confluence Settings (optional)
Note: Confluence integration requires the optional dependency. Install with:
pip install chunksilo[confluence]
| Setting | Default | Description |
|---|---|---|
confluence.url |
"" |
Confluence base URL (empty = disabled) |
confluence.username |
"" |
Confluence username |
confluence.api_token |
"" |
Confluence API token |
confluence.timeout |
10.0 |
Request timeout in seconds |
confluence.max_results |
30 |
Maximum results per search |
Jira Settings (optional)
Note: Jira integration requires the optional dependency. Install with:
pip install chunksilo[jira]
| Setting | Default | Description |
|---|---|---|
jira.url |
"" |
Jira base URL (empty = disabled) |
jira.username |
"" |
Jira username/email |
jira.api_token |
"" |
Jira API token |
jira.timeout |
10.0 |
Request timeout in seconds |
jira.max_results |
30 |
Maximum results per search |
jira.projects |
[] |
Project keys to search (empty = all) |
jira.include_comments |
true |
Include issue comments in search |
jira.include_custom_fields |
true |
Include custom fields in search |
Creating a Jira API Token:
- Log into Jira
- Go to Account Settings > Security > API Tokens
- Click "Create API Token"
- Copy the token and add it to your config
SSL Settings (optional)
| Setting | Default | Description |
|---|---|---|
ssl.ca_bundle_path |
"" |
Path to custom CA bundle file |
Storage Settings
| Setting | Default | Description |
|---|---|---|
storage.storage_dir |
./storage |
Directory for vector index and state |
storage.model_cache_dir |
./models |
Directory for model cache |
CLI Usage
The chunksilo command provides indexing, searching, and model management:
# Build or update the search index
chunksilo --build-index
# Search for documents
chunksilo "your search query"
# Search with date filtering
chunksilo "quarterly report" --date-from 2024-01-01 --date-to 2024-03-31
# Output results as JSON
chunksilo "search query" --json
# Show verbose output (model loading, search stats)
chunksilo "search query" --verbose
# Pre-download ML models (useful before going offline)
chunksilo --download-models
# Use a custom config file
chunksilo --build-index --config /path/to/config.yaml
CLI Options
| Option | Description |
|---|---|
query |
Search query text (positional argument) |
--build-index |
Build or update the search index, then exit |
--download-models |
Download required ML models, then exit |
--date-from |
Start date filter (YYYY-MM-DD format, inclusive) |
--date-to |
End date filter (YYYY-MM-DD format, inclusive) |
--json |
Output results as JSON instead of formatted text |
-v, --verbose |
Show diagnostic messages (model loading, search stats) |
--config |
Path to config.yaml (overrides auto-discovery) |
MCP Client Configuration
Configure your MCP client to run ChunkSilo. Below are examples for common clients.
Note: For PyPI installs, use
chunksilo-mcpdirectly. For offline bundles, use the full path/path/to/chunksilo/venv/bin/chunksilo-mcp. You can find the PyPI-installed binary location withwhich chunksilo-mcp.
Claude Code
Add chunksilo as an MCP server using the CLI:
PyPI install:
claude mcp add chunksilo --scope user -- chunksilo-mcp --config ~/.config/chunksilo/config.yaml
Offline bundle:
claude mcp add chunksilo --scope user -- /path/to/chunksilo/venv/bin/chunksilo-mcp --config /path/to/chunksilo/config.yaml
Verify it's connected:
claude mcp list
Claude Desktop
Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):
PyPI install:
{
"mcpServers": {
"chunksilo": {
"command": "chunksilo-mcp",
"args": ["--config", "/path/to/config.yaml"]
}
}
}
Offline bundle:
{
"mcpServers": {
"chunksilo": {
"command": "/path/to/chunksilo/venv/bin/chunksilo-mcp",
"args": ["--config", "/path/to/chunksilo/config.yaml"]
}
}
}
Cline (VS Code Extension)
Add to cline_mcp_settings.json (typically in ~/.config/Code/User/globalStorage/saoudrizwan.claude-dev/settings/):
PyPI install:
{
"mcpServers": {
"chunksilo": {
"command": "chunksilo-mcp",
"args": ["--config", "/path/to/config.yaml"],
"disabled": false,
"autoApprove": []
}
}
}
Offline bundle:
{
"mcpServers": {
"chunksilo": {
"command": "/path/to/chunksilo/venv/bin/chunksilo-mcp",
"args": ["--config", "/path/to/chunksilo/config.yaml"],
"disabled": false,
"autoApprove": []
}
}
}
Roo Code (VS Code Extension)
Add to mcp_settings.json (typically in ~/.config/Code/User/globalStorage/rooveterinaryinc.roo-cline/settings/):
PyPI install:
{
"mcpServers": {
"chunksilo": {
"command": "chunksilo-mcp",
"args": ["--config", "/path/to/config.yaml"]
}
}
}
Offline bundle:
{
"mcpServers": {
"chunksilo": {
"command": "/path/to/chunksilo/venv/bin/chunksilo-mcp",
"args": ["--config", "/path/to/chunksilo/config.yaml"]
}
}
}
Troubleshooting
- Index missing: Run
chunksilo --build-index(PyPI install) or./venv/bin/chunksilo --build-index(offline bundle). - Retrieval errors: Check paths in your MCP client configuration.
- Offline mode: PyPI installs default to
offline: false(models auto-download). The offline bundle includes pre-downloaded models and setsoffline: true. Setretrieval.offline: trueinconfig.yamlto prevent network calls after initial model download. - Confluence Integration: Install with
pip install chunksilo[confluence], then setconfluence.url,confluence.username, andconfluence.api_tokeninconfig.yaml. - Jira Integration: Install with
pip install chunksilo[jira], then setjira.url,jira.username, andjira.api_tokeninconfig.yaml. Optionally configurejira.projectsto restrict search to specific project keys. - Custom CA Bundle: Set
ssl.ca_bundle_pathinconfig.yamlfor custom certificates. - Network mounts: Unavailable directories are skipped with a warning; indexing continues with available directories.
- Legacy .doc files: Requires LibreOffice to be installed for automatic conversion to .docx. If LibreOffice is not found, .doc files are skipped with a warning. Full heading extraction is supported.
License
Apache-2.0. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunksilo-2.1.0.tar.gz.
File metadata
- Download URL: chunksilo-2.1.0.tar.gz
- Upload date:
- Size: 71.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bf13d437a061823a3ba162b5d0bb3316b14fad45afb49819ff23edb021e1d96
|
|
| MD5 |
b875b1c6a30038510af85640210bf5ec
|
|
| BLAKE2b-256 |
4f87bc34d8731f5065feec67a46a33703517f48807bdc1a40bc47f2b2481de7a
|
Provenance
The following attestation bundles were made for chunksilo-2.1.0.tar.gz:
Publisher:
manual-release.yml on Chetic/chunksilo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chunksilo-2.1.0.tar.gz -
Subject digest:
9bf13d437a061823a3ba162b5d0bb3316b14fad45afb49819ff23edb021e1d96 - Sigstore transparency entry: 908131319
- Sigstore integration time:
-
Permalink:
Chetic/chunksilo@4588984cc7dfb000e774b7080f935d69f83b4f70 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Chetic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
manual-release.yml@4588984cc7dfb000e774b7080f935d69f83b4f70 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file chunksilo-2.1.0-py3-none-any.whl.
File metadata
- Download URL: chunksilo-2.1.0-py3-none-any.whl
- Upload date:
- Size: 45.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d070d378e9d9cf13069c04ec1ed10384f0eac7df148056dc66d4309f3a764cc4
|
|
| MD5 |
055aee6fa767d4ce2296423d53777665
|
|
| BLAKE2b-256 |
438c3dd25b370318af39f8ea24cf75f36741b370fa4d18a70c5d1b253dd0774a
|
Provenance
The following attestation bundles were made for chunksilo-2.1.0-py3-none-any.whl:
Publisher:
manual-release.yml on Chetic/chunksilo
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chunksilo-2.1.0-py3-none-any.whl -
Subject digest:
d070d378e9d9cf13069c04ec1ed10384f0eac7df148056dc66d4309f3a764cc4 - Sigstore transparency entry: 908131344
- Sigstore integration time:
-
Permalink:
Chetic/chunksilo@4588984cc7dfb000e774b7080f935d69f83b4f70 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/Chetic
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
manual-release.yml@4588984cc7dfb000e774b7080f935d69f83b4f70 -
Trigger Event:
workflow_dispatch
-
Statement type: