Local semantic search over your files — BM25 + embeddings (OpenAI or 100% offline via fastembed) with smart chunking, document extraction, and real-time file watching.
Project description
LSS — Local Semantic Search
Hybrid search over local files. BM25 + embeddings + Reciprocal Rank Fusion. Real-time file watching. Runs offline or with OpenAI.
lss "authentication JWT" # search current directory
lss "deploy kubernetes" ~/Projects # search a specific path
lss "rate limiting" --json # machine-readable output
0.91 NDCG@10 on our golden set. Beats ColBERTv2, Voyage-2, and Cohere embed-v3 on BEIR SciFact. See EVALS.md.
Install
# One-liner (auto-detects pipx/uv/pip)
curl -fsSL https://raw.githubusercontent.com/kortix-ai/lss/main/install.sh | bash
Or directly:
pipx install local-semantic-search # recommended
pip install local-semantic-search
uv tool install local-semantic-search
Embedding provider
Default: OpenAI — if OPENAI_API_KEY is set, lss uses it automatically.
export OPENAI_API_KEY="sk-..." # add to ~/.zshrc or ~/.bashrc
Offline alternative:
pip install 'local-semantic-search[local]'
lss config provider local
Uses bge-small-en-v1.5 (384d, ~125 MB). No API key, no network, no cost. Within 0.3% of OpenAI on quality, 8x faster.
Usage
Search
lss "query" # current directory
lss "query" ~/Documents # explicit path
lss "auth JWT" "deploy k8s" # multiple queries
lss "config" --json # JSON output
lss "error" -k 5 # top 5
# Filters (applied without re-indexing)
lss "auth" -e .py -e .ts # only these extensions
lss "config" -E .json -E .yaml # exclude extensions
lss "user data" -x '\d{4}-\d{2}-\d{2}' # exclude chunks matching regex
lss "auth" -e .py -x "test_" # combine filters
First search auto-indexes. Subsequent searches use cached embeddings.
Index & manage
lss index ~/Projects # index without searching
lss status # DB stats, provider, config
lss ls # list indexed files
lss sweep --clear-all # wipe database
lss watch add ~/Documents # for lss-sync daemon
lss exclude add "*.log" # glob exclusion
lss include add .rst # custom extension
lss config provider local # switch provider
lss eval # run quality benchmarks
lss update # check for updates
File watcher
lss-sync # watch configured paths
lss-sync --watch ~/Projects # watch specific path
How it works
query → BM25 (FTS5 + custom rescore) ─┐
→ Embedding (OpenAI or local) ──┤→ RRF → boosts → MMR → results
- BM25 — FTS5 retrieves candidates, custom re-scorer ranks with TF saturation + IDF (k1=1.2, b=0.75)
- Embedding — OpenAI
text-embedding-3-small(256d) or localbge-small-en-v1.5(384d), cached permanently - RRF — Reciprocal Rank Fusion merges both ranked lists
- Boosts — Jaccard overlap, phrase matching, digit co-mention
- MMR — Maximal Marginal Relevance for diversity
See ARCHITECTURE.md for full pipeline detail.
Supported formats
| Category | Extensions |
|---|---|
| Code | .py, .js, .ts, .go, .rs, .java, .c, .cpp, .rb, .php, .swift, .kt, +40 more |
| Markup | .md, .rst, .tex, .html, .xml, .yaml, .json, .toml |
| Documents | .pdf, .docx, .xlsx, .pptx, .eml |
| Data | .csv, .jsonl, .tsv |
Extraction via pdfminer.six, python-docx, openpyxl, python-pptx, beautifulsoup4 (all optional — missing libs skip silently). Unknown extensions skipped by default; add with lss include add .ext.
Search quality
| Method | NDCG@10 | MRR@10 | Provider |
|---|---|---|---|
| hybrid | 0.914 | 1.000 | OpenAI |
| hybrid | 0.911 | 1.000 | Local |
| bm25 | 0.885 | 0.988 | — |
BEIR SciFact (5,183 docs, 300 queries):
| System | NDCG@10 |
|---|---|
| lss hybrid | 0.729 |
| Cohere embed-v3 | 0.717 |
| Voyage-2 | 0.713 |
| ColBERTv2 | 0.693 |
| BM25 (Anserini) | 0.665 |
Full results: EVALS.md
Performance
| Scenario | OpenAI | Local |
|---|---|---|
| Cold search (no cache) | 400-800 ms | 50-200 ms |
| Warm (embeddings cached) | 100-200 ms | 5-50 ms |
| Hot (all in LRU) | 50-150 ms | 2-10 ms |
Configuration
| Variable | Default | Description |
|---|---|---|
OPENAI_API_KEY |
— | Required for OpenAI provider |
LSS_PROVIDER |
auto-detect | openai or local |
LSS_DIR |
~/.lss |
Data directory |
BM25_K1 / BM25_B |
1.2 / 0.75 | BM25 tuning |
NO_COLOR |
unset | Disable ANSI colors |
Config file: ~/.lss/config.json
Programmatic use
from semantic_search import semantic_search
from lss_store import ingest_many, discover_files
all_files, new_files, _ = discover_files("/path/to/project")
ingest_many(new_files)
results = semantic_search("/path/to/project", ["JWT authentication"])
Tests
361+ tests covering extraction, filtering, chunking, CLI, e2e, file watching, providers, and search quality.
python -m pytest tests/ -x -q
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file local_semantic_search-0.5.5.tar.gz.
File metadata
- Download URL: local_semantic_search-0.5.5.tar.gz
- Upload date:
- Size: 19.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
09432245d400d6d623ed1f77e0ca9367323e4f8be913d347c78b791865eda2ad
|
|
| MD5 |
4d96aba2ccd7aeeb9ee1a5fd69fd8e91
|
|
| BLAKE2b-256 |
e735826acf8592d354760b2268d7e0abd4050cb19989351dd4569f99300fad47
|
Provenance
The following attestation bundles were made for local_semantic_search-0.5.5.tar.gz:
Publisher:
publish.yml on kortix-ai/lss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
local_semantic_search-0.5.5.tar.gz -
Subject digest:
09432245d400d6d623ed1f77e0ca9367323e4f8be913d347c78b791865eda2ad - Sigstore transparency entry: 990530144
- Sigstore integration time:
-
Permalink:
kortix-ai/lss@1d9c1d41db7d16709691a58ec861ceef4f341f82 -
Branch / Tag:
refs/tags/v0.5.5 - Owner: https://github.com/kortix-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1d9c1d41db7d16709691a58ec861ceef4f341f82 -
Trigger Event:
push
-
Statement type:
File details
Details for the file local_semantic_search-0.5.5-py3-none-any.whl.
File metadata
- Download URL: local_semantic_search-0.5.5-py3-none-any.whl
- Upload date:
- Size: 61.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2561bd424394b5e09d3bd869461572d384313294a9c047a58c6cf5de8f6425aa
|
|
| MD5 |
3a9bbd4090c8bbc22445d061d46302ac
|
|
| BLAKE2b-256 |
3575f540e738eb5364c4a5dee8be3bcebd51665248da384e64922d6c83143949
|
Provenance
The following attestation bundles were made for local_semantic_search-0.5.5-py3-none-any.whl:
Publisher:
publish.yml on kortix-ai/lss
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
local_semantic_search-0.5.5-py3-none-any.whl -
Subject digest:
2561bd424394b5e09d3bd869461572d384313294a9c047a58c6cf5de8f6425aa - Sigstore transparency entry: 990530158
- Sigstore integration time:
-
Permalink:
kortix-ai/lss@1d9c1d41db7d16709691a58ec861ceef4f341f82 -
Branch / Tag:
refs/tags/v0.5.5 - Owner: https://github.com/kortix-ai
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@1d9c1d41db7d16709691a58ec861ceef4f341f82 -
Trigger Event:
push
-
Statement type: