Skip to main content

Local semantic search over your files — BM25 + embeddings (OpenAI or 100% offline via fastembed) with smart chunking, document extraction, and real-time file watching.

Project description

LSS — Local Semantic Search

Quick Demo

Hybrid search over local files. BM25 + embeddings + Reciprocal Rank Fusion. Real-time file watching. Runs offline or with OpenAI.

lss "authentication JWT"              # search current directory
lss "deploy kubernetes" ~/Projects    # search a specific path
lss "rate limiting" --json            # machine-readable output

0.91 NDCG@10 on our golden set. Beats ColBERTv2, Voyage-2, and Cohere embed-v3 on BEIR SciFact. See EVALS.md.


Install

# One-liner (auto-detects pipx/uv/pip)
curl -fsSL https://raw.githubusercontent.com/kortix-ai/lss/main/install.sh | bash

Or directly:

pipx install local-semantic-search       # recommended
pip install local-semantic-search
uv tool install local-semantic-search

Embedding provider

Default: OpenAI — if OPENAI_API_KEY is set, lss uses it automatically.

export OPENAI_API_KEY="sk-..."   # add to ~/.zshrc or ~/.bashrc

Offline alternative:

pip install 'local-semantic-search[local]'
lss config provider local

Uses bge-small-en-v1.5 (384d, ~125 MB). No API key, no network, no cost. Within 0.3% of OpenAI on quality, 8x faster.


Usage

Search

lss "query"                              # current directory
lss "query" ~/Documents                  # explicit path
lss "auth JWT" "deploy k8s"              # multiple queries
lss "config" --json                      # JSON output
lss "error" -k 5                         # top 5

# Filters (applied without re-indexing)
lss "auth" -e .py -e .ts                 # only these extensions
lss "config" -E .json -E .yaml           # exclude extensions
lss "user data" -x '\d{4}-\d{2}-\d{2}'  # exclude chunks matching regex
lss "auth" -e .py -x "test_"            # combine filters

First search auto-indexes. Subsequent searches use cached embeddings.

Index & manage

lss index ~/Projects                     # index without searching
lss status                               # DB stats, provider, config
lss ls                                   # list indexed files
lss sweep --clear-all                    # wipe database

lss watch add ~/Documents                # for lss-sync daemon
lss exclude add "*.log"                  # glob exclusion
lss include add .rst                     # custom extension
lss config provider local                # switch provider
lss eval                                 # run quality benchmarks
lss update                               # check for updates

File watcher

lss-sync                                 # watch configured paths
lss-sync --watch ~/Projects              # watch specific path

How it works

query → BM25 (FTS5 + custom rescore) ─┐
      → Embedding (OpenAI or local)  ──┤→ RRF → boosts → MMR → results
  1. BM25 — FTS5 retrieves candidates, custom re-scorer ranks with TF saturation + IDF (k1=1.2, b=0.75)
  2. Embedding — OpenAI text-embedding-3-small (256d) or local bge-small-en-v1.5 (384d), cached permanently
  3. RRF — Reciprocal Rank Fusion merges both ranked lists
  4. Boosts — Jaccard overlap, phrase matching, digit co-mention
  5. MMR — Maximal Marginal Relevance for diversity

See ARCHITECTURE.md for full pipeline detail.


Supported formats

Category Extensions
Code .py, .js, .ts, .go, .rs, .java, .c, .cpp, .rb, .php, .swift, .kt, +40 more
Markup .md, .rst, .tex, .html, .xml, .yaml, .json, .toml
Documents .pdf, .docx, .xlsx, .pptx, .eml
Data .csv, .jsonl, .tsv

Extraction via pdfminer.six, python-docx, openpyxl, python-pptx, beautifulsoup4 (all optional — missing libs skip silently). Unknown extensions skipped by default; add with lss include add .ext.


Search quality

Method NDCG@10 MRR@10 Provider
hybrid 0.914 1.000 OpenAI
hybrid 0.911 1.000 Local
bm25 0.885 0.988

BEIR SciFact (5,183 docs, 300 queries):

System NDCG@10
lss hybrid 0.729
Cohere embed-v3 0.717
Voyage-2 0.713
ColBERTv2 0.693
BM25 (Anserini) 0.665

Full results: EVALS.md


Performance

Scenario OpenAI Local
Cold search (no cache) 400-800 ms 50-200 ms
Warm (embeddings cached) 100-200 ms 5-50 ms
Hot (all in LRU) 50-150 ms 2-10 ms

Configuration

Variable Default Description
OPENAI_API_KEY Required for OpenAI provider
LSS_PROVIDER auto-detect openai or local
LSS_DIR ~/.lss Data directory
BM25_K1 / BM25_B 1.2 / 0.75 BM25 tuning
NO_COLOR unset Disable ANSI colors

Config file: ~/.lss/config.json


Programmatic use

from semantic_search import semantic_search
from lss_store import ingest_many, discover_files

all_files, new_files, _ = discover_files("/path/to/project")
ingest_many(new_files)
results = semantic_search("/path/to/project", ["JWT authentication"])

Tests

361+ tests covering extraction, filtering, chunking, CLI, e2e, file watching, providers, and search quality.

python -m pytest tests/ -x -q

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

local_semantic_search-0.5.2.tar.gz (19.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

local_semantic_search-0.5.2-py3-none-any.whl (57.8 kB view details)

Uploaded Python 3

File details

Details for the file local_semantic_search-0.5.2.tar.gz.

File metadata

  • Download URL: local_semantic_search-0.5.2.tar.gz
  • Upload date:
  • Size: 19.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for local_semantic_search-0.5.2.tar.gz
Algorithm Hash digest
SHA256 e1c4b21ae88776acfe508bf0e3911caa63caf6f83d243792748521989ddbae46
MD5 dddd0c8cff0f36449f518a3ab0e27ece
BLAKE2b-256 089006b9c00215e9e4cd3c3a7a13c28c961d718df73b747715056ec91d48cfa1

See more details on using hashes here.

Provenance

The following attestation bundles were made for local_semantic_search-0.5.2.tar.gz:

Publisher: publish.yml on kortix-ai/lss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file local_semantic_search-0.5.2-py3-none-any.whl.

File metadata

File hashes

Hashes for local_semantic_search-0.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 18217aebcbc48f9f959b16b4f33367218e817b858684c5110e4be1d38ada4d5c
MD5 f3c75a737ff43609b99b4c5641deb7ac
BLAKE2b-256 9960cb00951eda156e21f5c7fbb521763368ef6247376cfded22dbf05da9fc5a

See more details on using hashes here.

Provenance

The following attestation bundles were made for local_semantic_search-0.5.2-py3-none-any.whl:

Publisher: publish.yml on kortix-ai/lss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page