Skip to main content

Local semantic search over your files — BM25 + OpenAI embeddings with real-time file watching.

Project description

LSS — Local Semantic Search

Quick Demo

Hybrid semantic search over local files. BM25 + OpenAI embeddings fused with Reciprocal Rank Fusion. Real-time file watching. Runs on any machine.

lss "authentication JWT"              # search current directory
lss "deploy kubernetes" ~/Projects    # search a specific path
lss "rate limiting" --json            # machine-readable output

0.93 NDCG@10 on our golden set. Beats ColBERTv2, Voyage-2, and Cohere embed-v3 on BEIR SciFact. See EVALS.md for full benchmarks.


Install

# One-liner (auto-detects pipx/uv/pip)
curl -fsSL https://raw.githubusercontent.com/kortix-ai/lss/main/install.sh | bash

Or install directly:

pipx install local-semantic-search       # recommended — isolated install
pip install local-semantic-search         # classic
uv tool install local-semantic-search     # if you use uv

Set your OpenAI API key:

export OPENAI_API_KEY="sk-..."   # add to ~/.zshrc or ~/.bashrc

That's it. No other dependencies, no GPU, no Docker.


Usage

Search

lss "Marko"                          # searches current directory
lss "Marko" ~/Documents              # explicit path (last arg if it exists on disk)
lss "Marko" -p ~/Documents           # explicit path with flag
lss "auth JWT" "deploy k8s"          # multiple queries
lss "database connection" --json     # JSON output for scripting
lss "config" -k 5                    # top 5 results
lss "error handling" | head          # pipe-friendly (colors auto-off)

First search auto-indexes the directory. Subsequent searches use cached embeddings.

Index

lss index ~/Projects                 # index without searching
lss index .                          # index current directory
lss index ~/Documents --yes          # skip confirmation prompt

Manage

lss status                           # show DB stats, watched paths, config
lss ls                               # list all indexed files
lss sweep --clear-all                # wipe the database

# Watch paths (for lss-sync daemon)
lss watch add ~/Documents
lss watch add ~/Projects
lss watch list
lss watch remove ~/Documents

# Exclude patterns
lss exclude add "*.log"
lss exclude add "*.min.js"
lss exclude list

File Watcher

lss-sync                             # watch paths from config
lss-sync --watch ~/Projects          # watch specific path
lss-sync --watch ~/a --watch ~/b     # multiple paths

Uses FSEvents (macOS) / inotify (Linux) to detect file changes and re-index in real time with debounced batching.

Evaluate

lss eval                             # run search quality evaluation
lss eval --json                      # machine-readable

How It Works

query "JWT authentication"
        |
   ┌────┴────┐
   v          v
  BM25    Embedding
(FTS5 +   (OpenAI API +
 custom    cosine sim)
 rescore)
   |          |
   └────┬─────┘
        v
  Reciprocal Rank Fusion
        |
  Post-fusion boosts
  (Jaccard, phrase, digit)
        |
  MMR re-ranking
  (diversity)
        |
     results
  1. BM25 — SQLite FTS5 retrieves candidates by keyword, then our custom BM25 re-scorer ranks them with proper TF saturation and IDF weighting (k1=1.2, b=0.75).
  2. Embedding — Query and top documents are embedded via text-embedding-3-small (256 dims). Cached in SQLite + LRU — repeated searches hit zero API calls.
  3. RRF — Reciprocal Rank Fusion merges both ranked lists. No score calibration needed.
  4. Boosts — Jaccard overlap, phrase matching, and digit co-mention features fine-tune ordering.
  5. MMR — Maximal Marginal Relevance removes near-duplicate chunks for diverse results.

See ARCHITECTURE.md for the full pipeline with timing data.


Search Quality

Golden Set (40 queries, 30-file project corpus)

Method       NDCG@10   MRR@10   Recall@10
───────────────────────────────────────────
hybrid         0.932    1.000       0.948
bm25           0.888    0.971       0.895
embedding      0.901    0.988       0.930

BEIR SciFact (5,183 docs, 300 queries) — NDCG@10

lss hybrid                  0.729
Cohere embed-v3             0.717
Voyage-2                    0.713
text-embedding-3-small      0.694
ColBERTv2                   0.693
BM25 (Anserini)             0.665

Full results and methodology: EVALS.md


Performance

Scenario Latency
Cold search (first query, no cache) 400-800 ms
Warm search (embeddings cached in SQLite) 100-200 ms
Hot search (all in LRU memory) 50-150 ms
Re-index unchanged files 0.2 ms/file
Index 500 files ~4s

The OpenAI API call is the bottleneck on cold search. After first search, everything is cached.


Configuration

Environment variables

Variable Default Description
OPENAI_API_KEY (required) OpenAI API key
OPENAI_MODEL text-embedding-3-small Embedding model
OPENAI_DIM 256 Embedding dimensions
LSS_DIR ~/.lss Data directory
LSS_MAX_FILE_SIZE 2097152 (2 MB) Max file size to index
BM25_K1 1.2 BM25 term frequency saturation
BM25_B 0.75 BM25 document length normalization
RRF_K 60 RRF smoothing constant
NO_COLOR (unset) Disable ANSI colors

Config file (~/.lss/config.json)

{
  "watch_paths": ["/home/user/Documents", "/home/user/Projects"],
  "exclude_patterns": ["*.log", "*.min.js", "generated"]
}

Programmatic Use

from semantic_search import semantic_search
from lss_store import ingest_many, discover_files

# Index a directory
files = discover_files("/path/to/project")
ingest_many(files)

# Search
results = semantic_search("/path/to/project", ["JWT authentication"])
for hit in results[0]:
    print(f"  {hit['score']:.3f}  {hit['file']}  {hit['text'][:80]}")

Project Layout

lss_config.py          Config: paths, env vars, load/save
lss_store.py           Indexing: file discovery, text extraction, FTS5 storage
lss_cli.py             CLI: search, index, status, watch, exclude, eval
lss_sync.py            File watcher daemon (watchdog + debounced indexing)
semantic_search.py     Search engine: BM25, embeddings, RRF, PRF, MMR
ARCHITECTURE.md        Full technical pipeline reference
EVALS.md               Search quality benchmarks vs published systems
tests/                 90 tests (unit, e2e, benchmarks, search quality, BEIR)

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

local_semantic_search-0.4.1.tar.gz (19.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

local_semantic_search-0.4.1-py3-none-any.whl (47.6 kB view details)

Uploaded Python 3

File details

Details for the file local_semantic_search-0.4.1.tar.gz.

File metadata

  • Download URL: local_semantic_search-0.4.1.tar.gz
  • Upload date:
  • Size: 19.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for local_semantic_search-0.4.1.tar.gz
Algorithm Hash digest
SHA256 d3bbb66d5b769636cad7be3aede5663ed2139d86548db2639a73e0603b8fabb0
MD5 6bbe2c6d3bb535b5f6dcf68db191b0db
BLAKE2b-256 20e2e5e1ebdcf716d4354564294569ee35fa52dd5090d1d8447b3b918d750c46

See more details on using hashes here.

Provenance

The following attestation bundles were made for local_semantic_search-0.4.1.tar.gz:

Publisher: publish.yml on kortix-ai/lss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file local_semantic_search-0.4.1-py3-none-any.whl.

File metadata

File hashes

Hashes for local_semantic_search-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1ee3f33d235dbd8f8d6071448cb0a2d9ea31a869e46c5ba83125f158f5740906
MD5 f44405b0db3351cf31a0251b22ff873a
BLAKE2b-256 b67f323c55ce49463a6b40d2391c9dc5b4268fd55460d22efd0cfb2875a6a49c

See more details on using hashes here.

Provenance

The following attestation bundles were made for local_semantic_search-0.4.1-py3-none-any.whl:

Publisher: publish.yml on kortix-ai/lss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page