Private semantic search engine for your local files, powered by Starbucks 2D Matryoshka embeddings
Project description
DeskSearch
Search your files by meaning. Faster than Spotlight. 100% private.
0.83ms search · 13MB binary · 100% offline · 50k+ files · 10× ONNX speedup
DeskSearch is a local semantic search engine that understands what you're looking for — not just the words you type. Index your documents, code, and emails in seconds. Search in under a millisecond. Everything stays on your machine.
Available as a Python package (pip install desksearch) and a standalone Rust binary (13MB, zero dependencies).
More screenshots
Settings — Speed Tiers & File Types
Data Sources — Connectors
File Explorer
Features
| Feature | What it means | |
|---|---|---|
| 🔍 | Semantic search | Finds "Q3 earnings" when you search "quarterly revenue" — understands meaning, not just keywords |
| ⚡ | Sub-millisecond search | 0.83ms p50 latency powered by Starbucks 2D Matryoshka embeddings |
| 🔒 | 100% private | All processing runs locally. No cloud. No telemetry. No data ever leaves your machine |
| 📄 | 30+ file types | PDF, DOCX, PPTX, XLSX, code (20+ languages), email, Jupyter notebooks, archives |
| 👁 | Live reindexing | Built-in file watcher auto-detects changes — your index is always current |
| 🎨 | Beautiful web UI | React frontend with dark mode, live search, file preview, and keyboard shortcuts |
| 🦀 | Rust core | 13MB self-contained binary with PDF/DOCX/PPTX/XLSX parsers, hybrid BM25+dense search, embedded frontend |
| ☕ | 3 speed tiers | Fast (2-layer, 32d) · Middle (4-layer, 64d) · Pro (6-layer, 128d) — you pick the trade-off |
| 🔌 | Connector plugins | Local files, email (.eml/.mbox), Chrome bookmarks, Slack exports — extensible architecture |
| ⚙️ | ONNX acceleration | 10× embedding speedup (171 chunks/sec) with INT8 quantization support |
| 📊 | Advanced filters | Filter by file type, date range, size · Sort by relevance, date, size, name · Export as JSON/CSV/text |
| ⭐ | Favorites & recents | Bookmark important files and track recently opened documents |
Quick Start
Python
pip install desksearch
desksearch
# → opens http://localhost:3777
Rust
curl -fsSL https://github.com/wshuai190/desksearch/releases/latest/download/desksearch -o desksearch
chmod +x desksearch
./desksearch
# → opens http://localhost:3777
On first run, DeskSearch walks you through an onboarding wizard to pick folders and a speed tier. After that, one command is all you need.
CLI Examples
# Semantic search from the terminal
desksearch search "machine learning papers"
desksearch search "budget spreadsheet" --type xlsx --json
# Index specific folders
desksearch index ~/Projects ~/Research
# Check index health
desksearch status
desksearch doctor # full health check
# Manage watched folders
desksearch folders add ~/Notes
desksearch folders list
# Switch speed tier
desksearch config set search_speed pro
# Run as a background daemon
desksearch daemon start
desksearch daemon install # auto-start on login (macOS LaunchAgent)
# Benchmark your setup
desksearch benchmark --files 1000
All commands support --json for scripting and automation.
DeskSearch vs. Alternatives
| DeskSearch | Spotlight | Everything | Alfred | |
|---|---|---|---|---|
| Search type | Semantic + keyword | Keyword | Filename only | Keyword |
| Understands meaning | ✅ Yes | ❌ | ❌ | ❌ |
| File content search | ✅ 30+ formats | Limited | ❌ | Via plugins |
| Search latency | ~1ms | ~50ms | ~1ms | ~50ms |
| Privacy | 100% local | Local (Siri opt-in) | Local | Local |
| Code-aware | ✅ 20+ languages | Minimal | ❌ | ❌ |
| Extensible | Plugins + REST API | ❌ | ❌ | Workflows |
| Open source | ✅ MIT | ❌ | ❌ | ❌ |
| Cross-platform | macOS, Linux | macOS only | Windows only | macOS only |
Powered by Starbucks Embeddings
DeskSearch uses the Starbucks 2D Matryoshka embedding model, which enables flexible layer × dimension truncation for speed/quality trade-offs.
Paper: Starbucks: Improved Training for 2D Matryoshka Embeddings Shengyao Zhuang*, Shuai Wang*, Fabio Zheng, Bevan Koopman, Guido Zuccon — ECIR 2026 (*equal contribution)
Unlike traditional embeddings that require fixed dimensions, Starbucks lets you choose both the number of transformer layers AND the embedding dimension at inference time — no retraining needed. DeskSearch leverages this to offer three speed tiers from a single model.
If you use DeskSearch or the Starbucks model in your research, please cite:
@inproceedings{wang2026starbucks,
title={Starbucks: Improved Training for 2D Matryoshka Embeddings},
author={Zhuang, Shengyao and Wang, Shuai and Zheng, Fabio and Koopman, Bevan and Zuccon, Guido},
booktitle={ECIR},
year={2026}
}
Architecture
DeskSearch runs hybrid retrieval — every query hits a Tantivy BM25 index and a FAISS dense vector index in parallel, then merges results via Reciprocal Rank Fusion (RRF). Embeddings come from the Starbucks 2D Matryoshka model with layer and dimension truncation, running on ONNX Runtime for 10× speedup over PyTorch (171 chunks/sec vs 17). The indexing pipeline parses files across 6 parallel workers, chunks at sentence boundaries, and embeds in batches of 256 for maximum throughput. A connector plugin system (v0.6.0) lets you pull in data from local files, email, Chrome bookmarks, and Slack exports via a unified API.
Your Files (PDF, DOCX, Markdown, Code, ...)
│
▼ Parse → Chunk → Embed
┌──────────────────────────────────────┐
│ 30+ parsers → 512-char chunks │
│ → Starbucks 2D Matryoshka (ONNX) │
└───────────┬──────────────────────────┘
│
┌───────┴────────┐
▼ ▼
BM25 (Tantivy) FAISS (dense)
keyword index semantic index
│ │
└───────┬────────┘
▼
Reciprocal Rank Fusion
│
▼
Ranked Results + Snippets
Connectors (v0.6.0)
DeskSearch supports pluggable data connectors to index content from multiple sources:
| Connector | What it does |
|---|---|
| Local files | File system scanning with scheduled sync and live re-indexing |
Parse .eml and .mbox files with sender, subject, and date extraction |
|
| Chrome bookmarks | Read your Chrome profile's bookmark hierarchy |
| Slack export | Import Slack ZIP exports with username resolution |
Manage connectors via the API (/api/connectors/v2/) or the web UI settings panel. The ConnectorRegistry handles discovery, configuration, and sync scheduling.
Speed Tiers
| Tier | Layers | Dimensions | Best for |
|---|---|---|---|
fast |
2 | 32 | Large corpora, older hardware |
middle |
4 | 64 | Default — balanced speed and quality |
pro |
6 | 128 | Best accuracy, research use |
desksearch config set search_speed pro
API
DeskSearch exposes a REST API on localhost:3777:
curl "http://localhost:3777/api/search?q=quarterly+revenue&limit=5"
curl "http://localhost:3777/api/status"
curl "http://localhost:3777/api/health"
Python SDK
from desksearch import DeskSearch
with DeskSearch() as ds:
results = ds.search("quarterly revenue", limit=5)
for r in results:
print(f"{r.rank}. {r.filename} ({r.score:.3f})")
print(f" {r.snippet}\n")
Contributing
Contributions welcome. DeskSearch is MIT-licensed.
git clone https://github.com/wshuai190/desksearch.git
cd desksearch
pip install -e ".[dev]"
pytest
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file desksearch-0.6.3-py3-none-any.whl.
File metadata
- Download URL: desksearch-0.6.3-py3-none-any.whl
- Upload date:
- Size: 263.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
233e676c9788c07f361fdec1eb3b3f88d5d495ceb034bc48d6faf69545a593a0
|
|
| MD5 |
1b542c53f1e47889e46c83075f8b16f3
|
|
| BLAKE2b-256 |
114899276553dcb8ae3c793c0de96c76acdde43ec2af390739eb8661e2988fd8
|