DocSIF - Local hybrid search engine for personal knowledge bases
Project description
SIF
SIF is a local CLI search engine for indexing and searching markdown documents. It provides powerful full-text and semantic search capabilities while keeping all data on your machine.
Features
- Full-Text Search: BM25 ranking via SQLite FTS5
- Semantic Search: Vector similarity using configurable embeddings
- Hybrid Search: Combines BM25 and vector search with RRF fusion
- Query Expansion: Automatic query enhancement via pseudo-relevance feedback
- Result Reranking: Cross-encoder reranking with Qwen3-Reranker
- MCP Server: Model Context Protocol support for AI assistants (stdio + HTTP)
- Local-First: All data stays on your machine
- Multiple Backends: sentence-transformers, llama-cpp-python, OpenAI-compatible API, ModelScope
Quick Start
Installation
pip install sif
For full functionality including embeddings:
pip install "sif[all]"
Basic Usage
# Add a collection
sif collection add ~/Documents/notes --name my-notes
# Index your documents
sif index update --collection my-notes
# Search
sif search query "python decorators"
Installation
From PyPI (Recommended)
pip install sif
From Source
git clone https://github.com/zhangtaolab/sif.git
cd sif
pip install -e ".[dev]"
Using pipx
pipx install sif
Usage
Collection Management
# List all collections
sif collection list
# Add a new collection
sif collection add ~/Work/docs --name work-docs --description "Work documentation"
# Show collection details
sif collection show my-notes
# Rename a collection
sif collection rename my-notes personal-notes
# Delete a collection
sif collection remove old-collection --force
Context Management
Add descriptive context to improve search relevance:
# Add context to a collection
sif context add collection my-notes "These are my personal notes about programming and technology."
# Add global context
sif context add global global "I am a software engineer interested in Python and machine learning."
# List all context
sif context list
Indexing
# Update the index
sif index update --collection my-notes
# Force full reindex
sif index update --collection my-notes --force
# Show indexing status
sif status
Search
# Basic search
sif search query "python decorators"
# Search in specific collection
sif search query "python decorators" -c my-notes
# BM25 keyword search
sif search search "python decorators"
# Limit results
sif search query "python decorators" --limit 20
# Search with explanation
sif search query "AI" --explain
MCP Server
Start the MCP server for integration with AI assistants:
# Start with stdio transport (default)
sif mcp stdio
# Start with HTTP transport
sif mcp http --port 8080
Configure in Claude Desktop:
{
"mcpServers": {
"sif": {
"command": "sif",
"args": ["mcp", "stdio"]
}
}
}
Configuration
SIF can be configured via environment variables or a .env file:
# Database location
SIF_DB_PATH=~/.local/share/sif/sif.db
# Embedding model
SIF_MODEL_NAME=Qwen/Qwen3-Embedding-0.6B
SIF_MODEL_PATH=~/models/embedding.gguf
# Chunking settings
SIF_CHUNK_SIZE=512
SIF_CHUNK_OVERLAP=128
# Logging
SIF_LOG_LEVEL=INFO
# MCP server
SIF_MCP_HOST=127.0.0.1
SIF_MCP_PORT=8080
Architecture
SIF follows a layered architecture with clean separation of concerns:
┌─────────────────────────────────────────────────────────────┐
│ Presentation Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ CLI │ │ MCP Server │ │ API (Future) │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Application Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Collection │ │ Index │ │ Search │ │
│ │ Manager │ │ Service │ │ Service │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Domain Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Collection │ │ Document │ │ Context │ │
│ │ Entity │ │ Entity │ │ Entity │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Infrastructure Layer │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Repository │ │ Embedding │ │ Database │ │
│ │ (SQLite) │ │ Model │ │ Connection │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
See docs/architecture.md for detailed architecture documentation.
Documentation
- Installation Guide - Detailed installation instructions
- Configuration - Configuration options
- Quick Start - Get started quickly
- CLI Reference - Complete CLI documentation
- API Reference - Python API documentation
- Architecture - System architecture
- MCP Server - MCP server documentation
- Search Algorithms - Search algorithm details
- Development Guide - Development setup
- Contributing - Contribution guidelines
Search Algorithms
BM25 (Full-Text Search)
BM25 is a probabilistic ranking function that scores documents based on term frequency and inverse document frequency.
Parameters:
k1: Controls term frequency saturation (default: 1.5)b: Controls document length normalization (default: 0.75)
Vector Search
Vector search uses embeddings to find semantically similar documents, even if they don't share exact keywords.
Supported Models:
- Sentence Transformers
- GGUF models via llama-cpp-python
- OpenAI-compatible API
- ModelScope Hub
Hybrid Search
Hybrid search combines BM25 and vector search using Reciprocal Rank Fusion (RRF):
RRF_score(d) = Σ 1 / (k + rank_i(d))
Where k is a constant (default: 60) and rank_i(d) is the rank of document d in result set i.
Development
Setup
# Clone the repository
git clone https://github.com/zhangtaolab/sif.git
cd sif
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=sif
# Run specific test category
pytest -m unit
pytest -m integration
Code Quality
# Format code
ruff format .
# Lint code
ruff check .
# Type checking
mypy src/sif
Roadmap
- Web UI (browser-based search interface)
- Plugin system for custom parsers
- Multi-language support
- REST API
- Query suggestions
- File system watching for auto-indexing
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
MIT License - see LICENSE for details.
Acknowledgments
SIF is a Python reimplementation of the original TypeScript QMD (Query Markup Documents) project.
Support
- GitHub Issues: github.com/zhangtaolab/sif/issues
- GitHub Discussions: github.com/zhangtaolab/sif/discussions
Happy searching!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docsif-0.1.0.tar.gz.
File metadata
- Download URL: docsif-0.1.0.tar.gz
- Upload date:
- Size: 811.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
004fa8ba198310564f5d4e200e5cb9f245736122ae4b88f2fecbeb420848a376
|
|
| MD5 |
8fbafc6cf93d01056d650f65a79ddec9
|
|
| BLAKE2b-256 |
6d0c06576bd5917da5a4647fc70bed12c245c6ae4800612ab8370104d9eb7cf6
|
File details
Details for the file docsif-0.1.0-py3-none-any.whl.
File metadata
- Download URL: docsif-0.1.0-py3-none-any.whl
- Upload date:
- Size: 125.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87c9a5fd17a69801740fb2d036d133ecc8d55f5ad371dd7c40db18887d95d553
|
|
| MD5 |
9b0c1f78509a387463d3e6ce5bd54c8f
|
|
| BLAKE2b-256 |
863724065dfc23b8d2f28341080fce8c64f7c0af0775f4c730d88b772326b563
|