Local semantic search over your files — BM25 + embeddings (OpenAI or 100% offline via fastembed) with smart chunking, document extraction, and real-time file watching.
Project description
LSS — Local Semantic Search
Hybrid semantic search over local files. BM25 + embeddings fused with Reciprocal Rank Fusion. Real-time file watching. Runs 100% offline or with OpenAI — your choice.
lss "authentication JWT" # search current directory
lss "deploy kubernetes" ~/Projects # search a specific path
lss "rate limiting" --json # machine-readable output
0.91 NDCG@10 on our golden set. Beats ColBERTv2, Voyage-2, and Cohere embed-v3 on BEIR SciFact. Works offline with local embeddings or with OpenAI for maximum quality. See EVALS.md for full benchmarks.
Install
# One-liner (auto-detects pipx/uv/pip)
curl -fsSL https://raw.githubusercontent.com/kortix-ai/lss/main/install.sh | bash
Or install directly:
pipx install local-semantic-search # recommended — isolated install
pip install local-semantic-search # classic
uv tool install local-semantic-search # if you use uv
Choose your embedding provider
Option A: 100% offline (no API key needed)
pip install 'local-semantic-search[local]' # installs fastembed
lss config provider local # switch to local embeddings
Uses BAAI/bge-small-en-v1.5 (384 dims, ~125 MB download on first run). No network calls, no API key, no cost.
Option B: OpenAI embeddings (higher quality)
export OPENAI_API_KEY="sk-..." # add to ~/.zshrc or ~/.bashrc
Uses text-embedding-3-small (256 dims). Cached permanently — repeated searches cost zero API calls.
Auto-detection: If no provider is configured, lss checks for OPENAI_API_KEY first, then falls back to fastembed if installed.
Usage
Search
lss "Marko" # searches current directory
lss "Marko" ~/Documents # explicit path (last arg if it exists on disk)
lss "Marko" -p ~/Documents # explicit path with flag
lss "auth JWT" "deploy k8s" # multiple queries
lss "database connection" --json # JSON output for scripting
lss "config" -k 5 # top 5 results
lss "error handling" | head # pipe-friendly (colors auto-off)
# Query-time filters
lss "auth" -e .py -e .ts # only search Python and TypeScript files
lss "config" -E .json -E .yaml # exclude JSON and YAML files
lss "user data" -x '\d{4}-\d{2}-\d{2}' # exclude chunks matching regex
lss "auth" -e .py -x "test_" # combine: only .py, exclude test files
First search auto-indexes the directory. Subsequent searches use cached embeddings.
Index
lss index ~/Projects # index without searching
lss index . # index current directory
lss index ~/Documents --yes # skip confirmation prompt
Manage
lss status # show DB stats, watched paths, provider, config
lss ls # list all indexed files
lss sweep --clear-all # wipe the database
# Watch paths (for lss-sync daemon)
lss watch add ~/Documents
lss watch add ~/Projects
lss watch list
lss watch remove ~/Documents
# Exclude patterns
lss exclude add "*.log"
lss exclude add "*.min.js"
lss exclude list
# Include custom file extensions
lss include add .rst
lss include add .tex
lss include list
Search Filters
Filter results at query time without re-indexing:
# Extension include (-e / --ext): only return results from these file types
lss "authentication" -e .py # only Python files
lss "config" -e .yaml -e .toml -e .json # only config files
# Extension exclude (-E / --exclude-ext): exclude these file types
lss "database" -E .sql # everything except SQL
lss "error" -E .log -E .txt # skip logs and text
# Content regex exclude (-x / --exclude-pattern): filter out matching chunks
lss "user data" -x '\d{4}-\d{2}-\d{2}' # exclude date patterns
lss "auth" -x "test_" -x "mock_" # exclude test/mock code
lss "config" -x "(?i)deprecated" # case-insensitive exclude
# Combine all filters
lss "authentication" -e .py -e .ts -x "test_" -x "fixture"
Extension filters are applied in SQL (efficient, pre-scoring). Content regex exclusion is applied post-scoring.
Configuration
lss config show # display all configuration
lss config provider local # switch to local embeddings
lss config provider openai # switch to OpenAI embeddings
File Watcher
lss-sync # watch paths from config
lss-sync --watch ~/Projects # watch specific path
lss-sync --watch ~/a --watch ~/b # multiple paths
Uses FSEvents (macOS) / inotify (Linux) to detect file changes and re-index in real time with debounced batching.
Evaluate
lss eval # run search quality evaluation
lss eval --json # machine-readable
Update
lss update # check for new version and upgrade
How It Works
query "JWT authentication"
|
┌────┴────┐
v v
BM25 Embedding
(FTS5 + (OpenAI or
custom fastembed +
rescore) cosine sim)
| |
└────┬─────┘
v
Reciprocal Rank Fusion
|
Post-fusion boosts
(Jaccard, phrase, digit)
|
MMR re-ranking
(diversity)
|
results
- BM25 — SQLite FTS5 retrieves candidates by keyword, then our custom BM25 re-scorer ranks them with proper TF saturation and IDF weighting (k1=1.2, b=0.75).
- Embedding — Query and top documents are embedded via OpenAI
text-embedding-3-small(256d) or local fastembedbge-small-en-v1.5(384d). Cached in SQLite + LRU — repeated searches hit zero API calls. - RRF — Reciprocal Rank Fusion merges both ranked lists. No score calibration needed.
- Boosts — Jaccard overlap, phrase matching, and digit co-mention features fine-tune ordering.
- MMR — Maximal Marginal Relevance removes near-duplicate chunks for diverse results.
See ARCHITECTURE.md for the full pipeline with timing data.
Document Formats
LSS indexes all common text, code, and document formats:
| Category | Extensions |
|---|---|
| Code | .py, .js, .ts, .go, .rs, .java, .c, .cpp, .rb, .php, .swift, .kt, and 40+ more |
| Markup | .md, .rst, .tex, .html, .xml, .yaml, .json, .toml |
| Documents | .pdf, .docx, .xlsx, .pptx, .html, .eml |
| Data | .csv, .jsonl, .tsv |
| Config | .env.example, .gitignore, .dockerignore, Makefile, Dockerfile |
Document extraction uses lightweight per-format libraries (pdfminer.six, python-docx, openpyxl, python-pptx, beautifulsoup4). All are optional — if a library isn't installed, that format is skipped silently.
Unknown extensions are skipped by default. Add custom extensions with lss include add .ext.
Smart Chunking
Files are chunked intelligently based on file type:
| File type | Strategy | Split boundaries |
|---|---|---|
Markdown (.md) |
Heading-aware | Splits on # heading lines, preserving document structure |
Python (.py) |
Definition-aware | Splits on def/class boundaries, keeping functions intact |
| Everything else | Word-window | 220 words per chunk, 200-word stride (20-word overlap) |
Each chunk gets an MD5 hash (for dedup and embedding cache), an FTS5 entry (for BM25), and a lazily-computed embedding vector.
Search Quality
Golden Set (40 queries, 33-file project corpus)
OpenAI embeddings (text-embedding-3-small, 256d):
Method NDCG@10 MRR@10 Recall@10
───────────────────────────────────────────
hybrid 0.914 1.000 0.936
embedding 0.886 0.988 0.917
bm25 0.885 0.988 0.893
Local embeddings (bge-small-en-v1.5, 384d):
Method NDCG@10 MRR@10 Recall@10
───────────────────────────────────────────
hybrid 0.911 1.000 0.931
embedding 0.894 1.000 0.923
bm25 0.885 0.988 0.893
Local embeddings are within 0.3% of OpenAI on NDCG@10 — and 8x faster (no network calls).
BEIR SciFact (5,183 docs, 300 queries) — NDCG@10
lss hybrid 0.729
Cohere embed-v3 0.717
Voyage-2 0.713
text-embedding-3-small 0.694
ColBERTv2 0.693
BM25 (Anserini) 0.665
Full results and methodology: EVALS.md
Performance
| Scenario | OpenAI | Local |
|---|---|---|
| Cold search (first query, no cache) | 400-800 ms | 50-200 ms |
| Warm search (embeddings cached) | 100-200 ms | 50-150 ms |
| Hot search (all in LRU memory) | 50-150 ms | 30-100 ms |
| Re-index unchanged files | 0.2 ms/file | 0.2 ms/file |
| Index 500 files | ~4s | ~4s |
With OpenAI, the API call is the bottleneck on cold search. With local embeddings, everything runs on your machine — no network dependency.
File Filtering
LSS uses an inclusion-based approach — only known text/code/document extensions are indexed:
- ~80 known extensions — code, markup, config, documents (see
INDEXED_EXTENSIONSin source) - Known extensionless files —
Makefile,Dockerfile,LICENSE,README, etc. - Directory exclusions —
node_modules/,.git/,__pycache__/,.venv/,dist/,build/, and ~70 more .gitignoreparsing — respects.gitignorepatterns in every subtree- User config —
lss exclude add "*.log"andlss include add .ext - Max file size — 2 MB default (override with
LSS_MAX_FILE_SIZE)
Unknown extensions are skipped by default. This prevents indexing binary blobs, generated files, and other junk.
Configuration
Environment variables
| Variable | Default | Description |
|---|---|---|
LSS_PROVIDER |
(auto-detect) | openai or local — embedding provider |
OPENAI_API_KEY |
— | OpenAI API key (required for openai provider) |
OPENAI_MODEL |
text-embedding-3-small |
OpenAI embedding model |
OPENAI_DIM |
256 |
OpenAI embedding dimensions |
LSS_DIR |
~/.lss |
Data directory |
LSS_MAX_FILE_SIZE |
2097152 (2 MB) |
Max file size to index |
BM25_K1 |
1.2 |
BM25 term frequency saturation |
BM25_B |
0.75 |
BM25 document length normalization |
RRF_K |
60 |
RRF smoothing constant |
NO_COLOR |
(unset) | Disable ANSI colors |
Config file (~/.lss/config.json)
{
"embedding_provider": "local",
"watch_paths": ["/home/user/Documents", "/home/user/Projects"],
"exclude_patterns": ["*.log", "*.min.js", "generated"],
"include_extensions": [".rst", ".tex"]
}
Programmatic Use
from semantic_search import semantic_search
from lss_store import ingest_many, discover_files
# Index a directory
all_files, new_files, _ = discover_files("/path/to/project")
ingest_many(new_files)
# Search
results = semantic_search("/path/to/project", ["JWT authentication"])
for hit in results[0]:
print(f" {hit['score']:.3f} {hit['file']} {hit['text'][:80]}")
Project Layout
lss_config.py Config: paths, env vars, provider detection, load/save
lss_extract.py Document extractors: PDF, DOCX, XLSX, PPTX, HTML, EML, JSON, CSV
lss_store.py Indexing: file discovery, inclusion filtering, smart chunking, FTS5
lss_cli.py CLI: search, index, status, config, watch, include, exclude, eval, update
lss_sync.py File watcher daemon (watchdog + debounced indexing)
semantic_search.py Search: BM25, dual embedding providers, RRF, PRF, MMR
ARCHITECTURE.md Full technical pipeline reference
EVALS.md Search quality benchmarks vs published systems
tests/ 366 tests (unit, e2e, benchmarks, search quality, BEIR)
Tests
366 tests covering extraction, filtering, chunking, storage, CLI, e2e, file watching, embedding providers, and search quality.
python -m pytest tests/ -x -q # run all tests
python -m pytest tests/ -k "not beir" -q # skip BEIR (needs ir-datasets)
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file local_semantic_search-0.5.1.tar.gz.
File metadata
- Download URL: local_semantic_search-0.5.1.tar.gz
- Upload date:
- Size: 19.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5282147bfbcd887b6a38c7fdd0f1d8cccf5cceb5921e26c600d7ad0b8ac9bfa7
|
|
| MD5 |
2c59d39779c3e2b59f1ce5d5d6b37040
|
|
| BLAKE2b-256 |
819e70816bb61cd2ef6e430d5141288e65528a3eb619586b171ce4dfba13b497
|
Provenance
The following attestation bundles were made for local_semantic_search-0.5.1.tar.gz:
Publisher:
publish.yml on kortix-ai/lss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
local_semantic_search-0.5.1.tar.gz -
Subject digest:
5282147bfbcd887b6a38c7fdd0f1d8cccf5cceb5921e26c600d7ad0b8ac9bfa7 - Sigstore transparency entry: 929128533
- Sigstore integration time:
-
Permalink:
kortix-ai/lss@6a31e1776921b39d6433aef35bb42097d23df89b -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/kortix-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6a31e1776921b39d6433aef35bb42097d23df89b -
Trigger Event:
push
-
Statement type:
File details
Details for the file local_semantic_search-0.5.1-py3-none-any.whl.
File metadata
- Download URL: local_semantic_search-0.5.1-py3-none-any.whl
- Upload date:
- Size: 60.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63298a74fb834bc8eb72aff0dff6de21c2eafbd1111e72c0058da4fb84699000
|
|
| MD5 |
0e8fcd58ab75c708c0a6c98c9f1db6b6
|
|
| BLAKE2b-256 |
c82e4a5362a9ff1d8c7eecc4d7eae83ef45d6f5bf1e488128b1e550caee57f51
|
Provenance
The following attestation bundles were made for local_semantic_search-0.5.1-py3-none-any.whl:
Publisher:
publish.yml on kortix-ai/lss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
local_semantic_search-0.5.1-py3-none-any.whl -
Subject digest:
63298a74fb834bc8eb72aff0dff6de21c2eafbd1111e72c0058da4fb84699000 - Sigstore transparency entry: 929128535
- Sigstore integration time:
-
Permalink:
kortix-ai/lss@6a31e1776921b39d6433aef35bb42097d23df89b -
Branch / Tag:
refs/tags/v0.5.1 - Owner: https://github.com/kortix-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6a31e1776921b39d6433aef35bb42097d23df89b -
Trigger Event:
push
-
Statement type: