Local semantic search over your files — BM25 + embeddings (OpenAI or 100% offline via fastembed) with smart chunking, document extraction, and real-time file watching.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

markokraemer.com

These details have not been verified by PyPI

Project description

LSS — Local Semantic Search

Quick Demo

Hybrid semantic search over local files. BM25 + embeddings fused with Reciprocal Rank Fusion. Real-time file watching. Runs 100% offline or with OpenAI — your choice.

lss "authentication JWT"              # search current directory
lss "deploy kubernetes" ~/Projects    # search a specific path
lss "rate limiting" --json            # machine-readable output

0.91 NDCG@10 on our golden set. Beats ColBERTv2, Voyage-2, and Cohere embed-v3 on BEIR SciFact. Works offline with local embeddings or with OpenAI for maximum quality. See EVALS.md for full benchmarks.

Install

# One-liner (auto-detects pipx/uv/pip)
curl -fsSL https://raw.githubusercontent.com/kortix-ai/lss/main/install.sh | bash

Or install directly:

pipx install local-semantic-search       # recommended — isolated install
pip install local-semantic-search         # classic
uv tool install local-semantic-search     # if you use uv

Choose your embedding provider

Option A: 100% offline (no API key needed)

pip install 'local-semantic-search[local]'    # installs fastembed
lss config provider local                     # switch to local embeddings

Uses BAAI/bge-small-en-v1.5 (384 dims, ~125 MB download on first run). No network calls, no API key, no cost.

Option B: OpenAI embeddings (higher quality)

export OPENAI_API_KEY="sk-..."   # add to ~/.zshrc or ~/.bashrc

Uses text-embedding-3-small (256 dims). Cached permanently — repeated searches cost zero API calls.

Auto-detection: If no provider is configured, lss checks for OPENAI_API_KEY first, then falls back to fastembed if installed.

Usage

Search

lss "Marko"                          # searches current directory
lss "Marko" ~/Documents              # explicit path (last arg if it exists on disk)
lss "Marko" -p ~/Documents           # explicit path with flag
lss "auth JWT" "deploy k8s"          # multiple queries
lss "database connection" --json     # JSON output for scripting
lss "config" -k 5                    # top 5 results
lss "error handling" | head          # pipe-friendly (colors auto-off)

# Query-time filters
lss "auth" -e .py -e .ts             # only search Python and TypeScript files
lss "config" -E .json -E .yaml       # exclude JSON and YAML files
lss "user data" -x '\d{4}-\d{2}-\d{2}'   # exclude chunks matching regex
lss "auth" -e .py -x "test_"         # combine: only .py, exclude test files

First search auto-indexes the directory. Subsequent searches use cached embeddings.

Index

lss index ~/Projects                 # index without searching
lss index .                          # index current directory
lss index ~/Documents --yes          # skip confirmation prompt

Manage

lss status                           # show DB stats, watched paths, provider, config
lss ls                               # list all indexed files
lss sweep --clear-all                # wipe the database

# Watch paths (for lss-sync daemon)
lss watch add ~/Documents
lss watch add ~/Projects
lss watch list
lss watch remove ~/Documents

# Exclude patterns
lss exclude add "*.log"
lss exclude add "*.min.js"
lss exclude list

# Include custom file extensions
lss include add .rst
lss include add .tex
lss include list

Search Filters

Filter results at query time without re-indexing:

# Extension include (-e / --ext): only return results from these file types
lss "authentication" -e .py              # only Python files
lss "config" -e .yaml -e .toml -e .json  # only config files

# Extension exclude (-E / --exclude-ext): exclude these file types
lss "database" -E .sql                   # everything except SQL
lss "error" -E .log -E .txt             # skip logs and text

# Content regex exclude (-x / --exclude-pattern): filter out matching chunks
lss "user data" -x '\d{4}-\d{2}-\d{2}'  # exclude date patterns
lss "auth" -x "test_" -x "mock_"        # exclude test/mock code
lss "config" -x "(?i)deprecated"         # case-insensitive exclude

# Combine all filters
lss "authentication" -e .py -e .ts -x "test_" -x "fixture"

Extension filters are applied in SQL (efficient, pre-scoring). Content regex exclusion is applied post-scoring.

Configuration

lss config show                      # display all configuration
lss config provider local            # switch to local embeddings
lss config provider openai           # switch to OpenAI embeddings

File Watcher

lss-sync                             # watch paths from config
lss-sync --watch ~/Projects          # watch specific path
lss-sync --watch ~/a --watch ~/b     # multiple paths

Uses FSEvents (macOS) / inotify (Linux) to detect file changes and re-index in real time with debounced batching.

Evaluate

lss eval                             # run search quality evaluation
lss eval --json                      # machine-readable

Update

lss update                           # check for new version and upgrade

How It Works

query "JWT authentication"
        |
   ┌────┴────┐
   v          v
  BM25    Embedding
(FTS5 +   (OpenAI or
 custom    fastembed +
 rescore)  cosine sim)
   |          |
   └────┬─────┘
        v
  Reciprocal Rank Fusion
        |
  Post-fusion boosts
  (Jaccard, phrase, digit)
        |
  MMR re-ranking
  (diversity)
        |
     results

BM25 — SQLite FTS5 retrieves candidates by keyword, then our custom BM25 re-scorer ranks them with proper TF saturation and IDF weighting (k1=1.2, b=0.75).
Embedding — Query and top documents are embedded via OpenAI text-embedding-3-small (256d) or local fastembed bge-small-en-v1.5 (384d). Cached in SQLite + LRU — repeated searches hit zero API calls.
RRF — Reciprocal Rank Fusion merges both ranked lists. No score calibration needed.
Boosts — Jaccard overlap, phrase matching, and digit co-mention features fine-tune ordering.
MMR — Maximal Marginal Relevance removes near-duplicate chunks for diverse results.

See ARCHITECTURE.md for the full pipeline with timing data.

Document Formats

LSS indexes all common text, code, and document formats:

Category	Extensions
Code	`.py`, `.js`, `.ts`, `.go`, `.rs`, `.java`, `.c`, `.cpp`, `.rb`, `.php`, `.swift`, `.kt`, and 40+ more
Markup	`.md`, `.rst`, `.tex`, `.html`, `.xml`, `.yaml`, `.json`, `.toml`
Documents	`.pdf`, `.docx`, `.xlsx`, `.pptx`, `.html`, `.eml`
Data	`.csv`, `.jsonl`, `.tsv`
Config	`.env.example`, `.gitignore`, `.dockerignore`, `Makefile`, `Dockerfile`

Document extraction uses lightweight per-format libraries (pdfminer.six, python-docx, openpyxl, python-pptx, beautifulsoup4). All are optional — if a library isn't installed, that format is skipped silently.

Unknown extensions are skipped by default. Add custom extensions with lss include add .ext.

Smart Chunking

Files are chunked intelligently based on file type:

File type	Strategy	Split boundaries
Markdown (`.md`)	Heading-aware	Splits on `# heading` lines, preserving document structure
Python (`.py`)	Definition-aware	Splits on `def`/`class` boundaries, keeping functions intact
Everything else	Word-window	220 words per chunk, 200-word stride (20-word overlap)

Each chunk gets an MD5 hash (for dedup and embedding cache), an FTS5 entry (for BM25), and a lazily-computed embedding vector.

Search Quality

Golden Set (40 queries, 33-file project corpus)

OpenAI embeddings (text-embedding-3-small, 256d):

Method       NDCG@10   MRR@10   Recall@10
───────────────────────────────────────────
hybrid         0.914    1.000       0.936
embedding      0.886    0.988       0.917
bm25           0.885    0.988       0.893

Local embeddings (bge-small-en-v1.5, 384d):

Method       NDCG@10   MRR@10   Recall@10
───────────────────────────────────────────
hybrid         0.911    1.000       0.931
embedding      0.894    1.000       0.923
bm25           0.885    0.988       0.893

Local embeddings are within 0.3% of OpenAI on NDCG@10 — and 8x faster (no network calls).

BEIR SciFact (5,183 docs, 300 queries) — NDCG@10

lss hybrid                  0.729
Cohere embed-v3             0.717
Voyage-2                    0.713
text-embedding-3-small      0.694
ColBERTv2                   0.693
BM25 (Anserini)             0.665

Full results and methodology: EVALS.md

Performance

Scenario	OpenAI	Local
Cold search (first query, no cache)	400-800 ms	50-200 ms
Warm search (embeddings cached)	100-200 ms	50-150 ms
Hot search (all in LRU memory)	50-150 ms	30-100 ms
Re-index unchanged files	0.2 ms/file	0.2 ms/file
Index 500 files	~4s	~4s

With OpenAI, the API call is the bottleneck on cold search. With local embeddings, everything runs on your machine — no network dependency.

File Filtering

LSS uses an inclusion-based approach — only known text/code/document extensions are indexed:

~80 known extensions — code, markup, config, documents (see INDEXED_EXTENSIONS in source)
Known extensionless files — Makefile, Dockerfile, LICENSE, README, etc.
Directory exclusions — node_modules/, .git/, __pycache__/, .venv/, dist/, build/, and ~70 more
.gitignore parsing — respects .gitignore patterns in every subtree
User config — lss exclude add "*.log" and lss include add .ext
Max file size — 2 MB default (override with LSS_MAX_FILE_SIZE)

Unknown extensions are skipped by default. This prevents indexing binary blobs, generated files, and other junk.

Configuration

Environment variables

Variable	Default	Description
`LSS_PROVIDER`	(auto-detect)	`openai` or `local` — embedding provider
`OPENAI_API_KEY`	—	OpenAI API key (required for openai provider)
`OPENAI_MODEL`	`text-embedding-3-small`	OpenAI embedding model
`OPENAI_DIM`	`256`	OpenAI embedding dimensions
`LSS_DIR`	`~/.lss`	Data directory
`LSS_MAX_FILE_SIZE`	`2097152` (2 MB)	Max file size to index
`BM25_K1`	`1.2`	BM25 term frequency saturation
`BM25_B`	`0.75`	BM25 document length normalization
`RRF_K`	`60`	RRF smoothing constant
`NO_COLOR`	(unset)	Disable ANSI colors

Config file (`~/.lss/config.json`)

{
  "embedding_provider": "local",
  "watch_paths": ["/home/user/Documents", "/home/user/Projects"],
  "exclude_patterns": ["*.log", "*.min.js", "generated"],
  "include_extensions": [".rst", ".tex"]
}

Programmatic Use

from semantic_search import semantic_search
from lss_store import ingest_many, discover_files

# Index a directory
all_files, new_files, _ = discover_files("/path/to/project")
ingest_many(new_files)

# Search
results = semantic_search("/path/to/project", ["JWT authentication"])
for hit in results[0]:
    print(f"  {hit['score']:.3f}  {hit['file']}  {hit['text'][:80]}")

Project Layout

lss_config.py          Config: paths, env vars, provider detection, load/save
lss_extract.py         Document extractors: PDF, DOCX, XLSX, PPTX, HTML, EML, JSON, CSV
lss_store.py           Indexing: file discovery, inclusion filtering, smart chunking, FTS5
lss_cli.py             CLI: search, index, status, config, watch, include, exclude, eval, update
lss_sync.py            File watcher daemon (watchdog + debounced indexing)
semantic_search.py     Search: BM25, dual embedding providers, RRF, PRF, MMR
ARCHITECTURE.md        Full technical pipeline reference
EVALS.md               Search quality benchmarks vs published systems
tests/                 366 tests (unit, e2e, benchmarks, search quality, BEIR)

Tests

366 tests covering extraction, filtering, chunking, storage, CLI, e2e, file watching, embedding providers, and search quality.

python -m pytest tests/ -x -q             # run all tests
python -m pytest tests/ -k "not beir" -q   # skip BEIR (needs ir-datasets)

License

Apache-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

markokraemer.com

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.5.7

Mar 21, 2026

0.5.5

Feb 25, 2026

0.5.4

Feb 14, 2026

0.5.2

Feb 9, 2026

This version

0.5.1

Feb 8, 2026

0.5.0

Feb 8, 2026

0.4.2

Feb 8, 2026

0.4.1

Feb 8, 2026

0.4.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

local_semantic_search-0.5.1.tar.gz (19.5 MB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

local_semantic_search-0.5.1-py3-none-any.whl (60.2 kB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file local_semantic_search-0.5.1.tar.gz.

File metadata

Download URL: local_semantic_search-0.5.1.tar.gz
Upload date: Feb 8, 2026
Size: 19.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for local_semantic_search-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`5282147bfbcd887b6a38c7fdd0f1d8cccf5cceb5921e26c600d7ad0b8ac9bfa7`
MD5	`2c59d39779c3e2b59f1ce5d5d6b37040`
BLAKE2b-256	`819e70816bb61cd2ef6e430d5141288e65528a3eb619586b171ce4dfba13b497`

See more details on using hashes here.

Provenance

The following attestation bundles were made for local_semantic_search-0.5.1.tar.gz:

Publisher: publish.yml on kortix-ai/lss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: local_semantic_search-0.5.1.tar.gz
- Subject digest: 5282147bfbcd887b6a38c7fdd0f1d8cccf5cceb5921e26c600d7ad0b8ac9bfa7
- Sigstore transparency entry: 929128533
- Sigstore integration time: Feb 8, 2026
Source repository:
- Permalink: kortix-ai/lss@6a31e1776921b39d6433aef35bb42097d23df89b
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/kortix-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6a31e1776921b39d6433aef35bb42097d23df89b
- Trigger Event: push

File details

Details for the file local_semantic_search-0.5.1-py3-none-any.whl.

File metadata

Download URL: local_semantic_search-0.5.1-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 60.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for local_semantic_search-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`63298a74fb834bc8eb72aff0dff6de21c2eafbd1111e72c0058da4fb84699000`
MD5	`0e8fcd58ab75c708c0a6c98c9f1db6b6`
BLAKE2b-256	`c82e4a5362a9ff1d8c7eecc4d7eae83ef45d6f5bf1e488128b1e550caee57f51`

See more details on using hashes here.

Provenance

The following attestation bundles were made for local_semantic_search-0.5.1-py3-none-any.whl:

Publisher: publish.yml on kortix-ai/lss

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: local_semantic_search-0.5.1-py3-none-any.whl
- Subject digest: 63298a74fb834bc8eb72aff0dff6de21c2eafbd1111e72c0058da4fb84699000
- Sigstore transparency entry: 929128535
- Sigstore integration time: Feb 8, 2026
Source repository:
- Permalink: kortix-ai/lss@6a31e1776921b39d6433aef35bb42097d23df89b
- Branch / Tag: refs/tags/v0.5.1
- Owner: https://github.com/kortix-ai
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@6a31e1776921b39d6433aef35bb42097d23df89b
- Trigger Event: push

local-semantic-search 0.5.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

LSS — Local Semantic Search

Install

Choose your embedding provider

Usage

Search

Index

Manage

Search Filters

Configuration

File Watcher

Evaluate

Update

How It Works

Document Formats

Smart Chunking

Search Quality

Golden Set (40 queries, 33-file project corpus)

BEIR SciFact (5,183 docs, 300 queries) — NDCG@10

Performance

File Filtering

Configuration

Environment variables

Config file (~/.lss/config.json)

Programmatic Use

Project Layout

Tests

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Config file (`~/.lss/config.json`)