Local semantic search over your files — BM25 + embeddings (OpenAI or 100% offline via fastembed) with smart chunking, document extraction, and real-time file watching.
Project description
LSS — Local Semantic Search
Hybrid semantic search over local files. BM25 + OpenAI embeddings fused with Reciprocal Rank Fusion. Real-time file watching. Runs on any machine.
lss "authentication JWT" # search current directory
lss "deploy kubernetes" ~/Projects # search a specific path
lss "rate limiting" --json # machine-readable output
0.93 NDCG@10 on our golden set. Beats ColBERTv2, Voyage-2, and Cohere embed-v3 on BEIR SciFact. See EVALS.md for full benchmarks.
Install
# One-liner (auto-detects pipx/uv/pip)
curl -fsSL https://raw.githubusercontent.com/kortix-ai/lss/main/install.sh | bash
Or install directly:
pipx install local-semantic-search # recommended — isolated install
pip install local-semantic-search # classic
uv tool install local-semantic-search # if you use uv
Set your OpenAI API key:
export OPENAI_API_KEY="sk-..." # add to ~/.zshrc or ~/.bashrc
That's it. No other dependencies, no GPU, no Docker.
Usage
Search
lss "Marko" # searches current directory
lss "Marko" ~/Documents # explicit path (last arg if it exists on disk)
lss "Marko" -p ~/Documents # explicit path with flag
lss "auth JWT" "deploy k8s" # multiple queries
lss "database connection" --json # JSON output for scripting
lss "config" -k 5 # top 5 results
lss "error handling" | head # pipe-friendly (colors auto-off)
First search auto-indexes the directory. Subsequent searches use cached embeddings.
Index
lss index ~/Projects # index without searching
lss index . # index current directory
lss index ~/Documents --yes # skip confirmation prompt
Manage
lss status # show DB stats, watched paths, config
lss ls # list all indexed files
lss sweep --clear-all # wipe the database
# Watch paths (for lss-sync daemon)
lss watch add ~/Documents
lss watch add ~/Projects
lss watch list
lss watch remove ~/Documents
# Exclude patterns
lss exclude add "*.log"
lss exclude add "*.min.js"
lss exclude list
File Watcher
lss-sync # watch paths from config
lss-sync --watch ~/Projects # watch specific path
lss-sync --watch ~/a --watch ~/b # multiple paths
Uses FSEvents (macOS) / inotify (Linux) to detect file changes and re-index in real time with debounced batching.
Evaluate
lss eval # run search quality evaluation
lss eval --json # machine-readable
How It Works
query "JWT authentication"
|
┌────┴────┐
v v
BM25 Embedding
(FTS5 + (OpenAI API +
custom cosine sim)
rescore)
| |
└────┬─────┘
v
Reciprocal Rank Fusion
|
Post-fusion boosts
(Jaccard, phrase, digit)
|
MMR re-ranking
(diversity)
|
results
- BM25 — SQLite FTS5 retrieves candidates by keyword, then our custom BM25 re-scorer ranks them with proper TF saturation and IDF weighting (k1=1.2, b=0.75).
- Embedding — Query and top documents are embedded via
text-embedding-3-small(256 dims). Cached in SQLite + LRU — repeated searches hit zero API calls. - RRF — Reciprocal Rank Fusion merges both ranked lists. No score calibration needed.
- Boosts — Jaccard overlap, phrase matching, and digit co-mention features fine-tune ordering.
- MMR — Maximal Marginal Relevance removes near-duplicate chunks for diverse results.
See ARCHITECTURE.md for the full pipeline with timing data.
Search Quality
Golden Set (40 queries, 30-file project corpus)
Method NDCG@10 MRR@10 Recall@10
───────────────────────────────────────────
hybrid 0.932 1.000 0.948
bm25 0.888 0.971 0.895
embedding 0.901 0.988 0.930
BEIR SciFact (5,183 docs, 300 queries) — NDCG@10
lss hybrid 0.729
Cohere embed-v3 0.717
Voyage-2 0.713
text-embedding-3-small 0.694
ColBERTv2 0.693
BM25 (Anserini) 0.665
Full results and methodology: EVALS.md
Performance
| Scenario | Latency |
|---|---|
| Cold search (first query, no cache) | 400-800 ms |
| Warm search (embeddings cached in SQLite) | 100-200 ms |
| Hot search (all in LRU memory) | 50-150 ms |
| Re-index unchanged files | 0.2 ms/file |
| Index 500 files | ~4s |
The OpenAI API call is the bottleneck on cold search. After first search, everything is cached.
Configuration
Environment variables
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
(required) | OpenAI API key |
OPENAI_MODEL |
text-embedding-3-small |
Embedding model |
OPENAI_DIM |
256 |
Embedding dimensions |
LSS_DIR |
~/.lss |
Data directory |
LSS_MAX_FILE_SIZE |
2097152 (2 MB) |
Max file size to index |
BM25_K1 |
1.2 |
BM25 term frequency saturation |
BM25_B |
0.75 |
BM25 document length normalization |
RRF_K |
60 |
RRF smoothing constant |
NO_COLOR |
(unset) | Disable ANSI colors |
Config file (~/.lss/config.json)
{
"watch_paths": ["/home/user/Documents", "/home/user/Projects"],
"exclude_patterns": ["*.log", "*.min.js", "generated"]
}
Programmatic Use
from semantic_search import semantic_search
from lss_store import ingest_many, discover_files
# Index a directory
files = discover_files("/path/to/project")
ingest_many(files)
# Search
results = semantic_search("/path/to/project", ["JWT authentication"])
for hit in results[0]:
print(f" {hit['score']:.3f} {hit['file']} {hit['text'][:80]}")
Project Layout
lss_config.py Config: paths, env vars, load/save
lss_store.py Indexing: file discovery, text extraction, FTS5 storage
lss_cli.py CLI: search, index, status, watch, exclude, eval
lss_sync.py File watcher daemon (watchdog + debounced indexing)
semantic_search.py Search engine: BM25, embeddings, RRF, PRF, MMR
ARCHITECTURE.md Full technical pipeline reference
EVALS.md Search quality benchmarks vs published systems
tests/ 90 tests (unit, e2e, benchmarks, search quality, BEIR)
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file local_semantic_search-0.5.0.tar.gz.
File metadata
- Download URL: local_semantic_search-0.5.0.tar.gz
- Upload date:
- Size: 19.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
423462a1b10cf37105b24f268c6608fffc3450e5cc6cb9d5af692eb3f806f7be
|
|
| MD5 |
c5d1d1f12a37f87fa74843f99eafb285
|
|
| BLAKE2b-256 |
d66e6d5105cc0e0f8880e49828891cd2fcaffb64b40cd9bd651c38f70b69cf53
|
Provenance
The following attestation bundles were made for local_semantic_search-0.5.0.tar.gz:
Publisher:
publish.yml on kortix-ai/lss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
local_semantic_search-0.5.0.tar.gz -
Subject digest:
423462a1b10cf37105b24f268c6608fffc3450e5cc6cb9d5af692eb3f806f7be - Sigstore transparency entry: 929120621
- Sigstore integration time:
-
Permalink:
kortix-ai/lss@c055944341721a5600261811f5e28f805e182b3b -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/kortix-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c055944341721a5600261811f5e28f805e182b3b -
Trigger Event:
push
-
Statement type:
File details
Details for the file local_semantic_search-0.5.0-py3-none-any.whl.
File metadata
- Download URL: local_semantic_search-0.5.0-py3-none-any.whl
- Upload date:
- Size: 56.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f733c3fef1855ec99efbbbeacb064768f95d1e858a7d7451552ee713167b590
|
|
| MD5 |
f205c9640e2b81b43f2cb2a92e9126e1
|
|
| BLAKE2b-256 |
400b2004afa217d8e5a778d984ba78f2a45dd6bd6feacceaa4ad5dd4aeb96889
|
Provenance
The following attestation bundles were made for local_semantic_search-0.5.0-py3-none-any.whl:
Publisher:
publish.yml on kortix-ai/lss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
local_semantic_search-0.5.0-py3-none-any.whl -
Subject digest:
6f733c3fef1855ec99efbbbeacb064768f95d1e858a7d7451552ee713167b590 - Sigstore transparency entry: 929120622
- Sigstore integration time:
-
Permalink:
kortix-ai/lss@c055944341721a5600261811f5e28f805e182b3b -
Branch / Tag:
refs/tags/v0.5.0 - Owner: https://github.com/kortix-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@c055944341721a5600261811f5e28f805e182b3b -
Trigger Event:
push
-
Statement type: