Skip to main content

DocSIF - Local hybrid search engine for personal knowledge bases

Project description

SIF

Python 3.9+ License: MIT Code style: ruff

SIF is a local CLI search engine for indexing and searching markdown documents. It provides powerful full-text and semantic search capabilities while keeping all data on your machine.

Features

  • Full-Text Search: BM25 ranking via SQLite FTS5
  • Semantic Search: Vector similarity using configurable embeddings
  • Hybrid Search: Combines BM25 and vector search with RRF fusion
  • Query Expansion: Automatic query enhancement via pseudo-relevance feedback
  • Result Reranking: Cross-encoder reranking with Qwen3-Reranker
  • MCP Server: Model Context Protocol support for AI assistants (stdio + HTTP)
  • Local-First: All data stays on your machine
  • Multiple Backends: sentence-transformers, llama-cpp-python, OpenAI-compatible API, ModelScope

Quick Start

Installation

pip install sif

For full functionality including embeddings:

pip install "sif[all]"

Basic Usage

# Add a collection
sif collection add ~/Documents/notes --name my-notes

# Index your documents
sif index update --collection my-notes

# Search
sif search query "python decorators"

Installation

From PyPI (Recommended)

pip install sif

From Source

git clone https://github.com/zhangtaolab/sif.git
cd sif
pip install -e ".[dev]"

Using pipx

pipx install sif

Usage

Collection Management

# List all collections
sif collection list

# Add a new collection
sif collection add ~/Work/docs --name work-docs --description "Work documentation"

# Show collection details
sif collection show my-notes

# Rename a collection
sif collection rename my-notes personal-notes

# Delete a collection
sif collection remove old-collection --force

Context Management

Add descriptive context to improve search relevance:

# Add context to a collection
sif context add collection my-notes "These are my personal notes about programming and technology."

# Add global context
sif context add global global "I am a software engineer interested in Python and machine learning."

# List all context
sif context list

Indexing

# Update the index
sif index update --collection my-notes

# Force full reindex
sif index update --collection my-notes --force

# Show indexing status
sif status

Search

# Basic search
sif search query "python decorators"

# Search in specific collection
sif search query "python decorators" -c my-notes

# BM25 keyword search
sif search search "python decorators"

# Limit results
sif search query "python decorators" --limit 20

# Search with explanation
sif search query "AI" --explain

MCP Server

Start the MCP server for integration with AI assistants:

# Start with stdio transport (default)
sif mcp stdio

# Start with HTTP transport
sif mcp http --port 8080

Configure in Claude Desktop:

{
  "mcpServers": {
    "sif": {
      "command": "sif",
      "args": ["mcp", "stdio"]
    }
  }
}

Configuration

SIF can be configured via environment variables or a .env file:

# Database location
SIF_DB_PATH=~/.local/share/sif/sif.db

# Embedding model
SIF_MODEL_NAME=Qwen/Qwen3-Embedding-0.6B
SIF_MODEL_PATH=~/models/embedding.gguf

# Chunking settings
SIF_CHUNK_SIZE=512
SIF_CHUNK_OVERLAP=128

# Logging
SIF_LOG_LEVEL=INFO

# MCP server
SIF_MCP_HOST=127.0.0.1
SIF_MCP_PORT=8080

Architecture

SIF follows a layered architecture with clean separation of concerns:

┌─────────────────────────────────────────────────────────────┐
│                      Presentation Layer                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │     CLI      │  │  MCP Server  │  │  API (Future)    │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│                      Application Layer                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │   Collection │  │    Index     │  │     Search       │  │
│  │   Manager    │  │   Service    │  │    Service       │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│                      Domain Layer                            │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │  Collection  │  │   Document   │  │     Context      │  │
│  │   Entity     │  │   Entity     │  │     Entity       │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│                      Infrastructure Layer                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │  Repository  │  │   Embedding  │  │    Database      │  │
│  │  (SQLite)    │  │    Model     │  │   Connection     │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
└─────────────────────────────────────────────────────────────┘

See docs/architecture.md for detailed architecture documentation.

Documentation

Search Algorithms

BM25 (Full-Text Search)

BM25 is a probabilistic ranking function that scores documents based on term frequency and inverse document frequency.

Parameters:

  • k1: Controls term frequency saturation (default: 1.5)
  • b: Controls document length normalization (default: 0.75)

Vector Search

Vector search uses embeddings to find semantically similar documents, even if they don't share exact keywords.

Supported Models:

  • Sentence Transformers
  • GGUF models via llama-cpp-python
  • OpenAI-compatible API
  • ModelScope Hub

Hybrid Search

Hybrid search combines BM25 and vector search using Reciprocal Rank Fusion (RRF):

RRF_score(d) = Σ 1 / (k + rank_i(d))

Where k is a constant (default: 60) and rank_i(d) is the rank of document d in result set i.

Development

Setup

# Clone the repository
git clone https://github.com/zhangtaolab/sif.git
cd sif

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=sif

# Run specific test category
pytest -m unit
pytest -m integration

Code Quality

# Format code
ruff format .

# Lint code
ruff check .

# Type checking
mypy src/sif

Roadmap

  • Web UI (browser-based search interface)
  • Plugin system for custom parsers
  • Multi-language support
  • REST API
  • Query suggestions
  • File system watching for auto-indexing

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'feat: add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

MIT License - see LICENSE for details.

Acknowledgments

SIF is a Python reimplementation of the original TypeScript QMD (Query Markup Documents) project.

Support


Happy searching!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docsif-0.1.0.tar.gz (811.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docsif-0.1.0-py3-none-any.whl (125.2 kB view details)

Uploaded Python 3

File details

Details for the file docsif-0.1.0.tar.gz.

File metadata

  • Download URL: docsif-0.1.0.tar.gz
  • Upload date:
  • Size: 811.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for docsif-0.1.0.tar.gz
Algorithm Hash digest
SHA256 004fa8ba198310564f5d4e200e5cb9f245736122ae4b88f2fecbeb420848a376
MD5 8fbafc6cf93d01056d650f65a79ddec9
BLAKE2b-256 6d0c06576bd5917da5a4647fc70bed12c245c6ae4800612ab8370104d9eb7cf6

See more details on using hashes here.

File details

Details for the file docsif-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docsif-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 125.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for docsif-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87c9a5fd17a69801740fb2d036d133ecc8d55f5ad371dd7c40db18887d95d553
MD5 9b0c1f78509a387463d3e6ce5bd54c8f
BLAKE2b-256 863724065dfc23b8d2f28341080fce8c64f7c0af0775f4c730d88b772326b563

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page