Semantic embedding validation tool for ADRs and code

These details have not been verified by PyPI

Project description

gundog

gundog demo

Gundog is a local semantic retrieval engine for your high volume corpus. It finds relevant code and documentation by matching the semantics of your query and not just matching keywords.

Point it at your docs or code or both. It embeds everything into vectors, builds a similarity graph connecting related files, and combines semantic search with keyword matching. Ask "how does auth work?" and it retrieves the login handler, session middleware, and the ADR that explains why you chose JWT even if none of them contain the word "auth".

Why?

I wanted a clean map of all related data chunks from wide spread data sources based on a natural language query. SeaGOAT provides rather a ranked but flat and accurate pointer to specific data chunks from a single git repository. Basically, I wanted a Obsidian graph view of my docs controlled based on a natural language query without having to go through the pain of using.. well.. Obsidian.

Gundog builds these connections across repositories/data sources automatically. Vector search finds semantically related content, BM25 catches exact keyword matches, and graph expansion surfaces files you didn't know to look for.

Performance

Gundog uses ONNX Runtime and HNSW indexing by default for fast queries:

Metric	Value
Query latency	~15ms (after model warmup)
First query	~200-300ms (model loading)
Accuracy	96-100%
Index time	~1 min per 100 files

Based on personal testing with 60-120 files and 50 queries. Not extensively validated at scale. Your mileage may vary. See benchmark/BENCHMARK.md for details.

Install

pip install gundog

Or from source

git clone https://github.com/adhityaravi/gundog.git
cd gundog
uv sync
uv run gundog --help

Quick Start

1. Index your stuff:

gundog index

First run downloads the embedding model (~130MB) and converts it to ONNX format (cached at ~/.cache/gundog/onnx/ for reuse across projects using the same model). Subsequent runs are incremental and only re-index changed files.

2. Start the daemon and register your index:

gundog daemon start
gundog daemon add myproject .

3. Search:

gundog query "database connection pooling"

# stop the daemon if you will
gundog daemon stop

Returns ranked results with file paths and relevance scores. The daemon keeps the model loaded for instant queries (~15ms).

Commands

`gundog index`

Scans your configured sources, embeds the content, and builds a searchable index.

gundog index                    # uses .gundog/config.yaml
gundog index -c /path/to.yaml   # custom config file
gundog index --rebuild          # fresh index from scratch

`gundog daemon`

Runs a persistent background service for fast queries. The daemon keeps the embedding model loaded in memory, making subsequent queries instant (~15ms vs ~300ms cold start).

gundog daemon start                           # start daemon (bootstraps config if needed)
gundog daemon start --foreground              # run in foreground (for debugging)
gundog daemon stop                            # stop daemon
gundog daemon status                          # check if daemon is running

# Index management
gundog daemon add myproject /path/to/project  # register an index
gundog daemon remove myproject                # unregister an index
gundog daemon list                            # list registered indexes

The daemon also serves a web UI at the same address for interactive queries with a visual graph. File links are auto-detected from git repos - files in a git repo with a remote get clickable links to GitHub/GitLab.

`gundog query`

Finds relevant files for a natural language query. Requires the daemon to be running.

gundog query "error handling strategy"
gundog query "authentication" --top 5        # limit results
gundog query "auth" --index myproject        # use specific registered index

The gundog query command requires the daemon to be running. Daemon settings are stored at ~/.config/gundog/daemon.yaml.

How It Works

Embedding: Files are converted to vectors using sentence-transformers. Similar concepts end up as nearby vectors.
Hybrid Search: Combines semantic (vector) search with keyword (BM25) search using Reciprocal Rank Fusion. Queries like "UserAuthService" find exact matches even when embeddings might miss them.
Storage: Vectors stored locally using a vector DB: plain numpy; or HNSW. No external services.
Two-Stage Ranking: Coarse retrieval via vector+BM25 fusion, then fine-grained ranking using per-line TF-IDF scores to pinpoint the best matching line within each chunk.
Graph: Documents above a similarity threshold get connected, enabling traversal from direct matches to related files.
Query: Your query gets embedded, compared against stored vectors, fused with keyword results, and ranked. Scores are rescaled so 0% = baseline, 100% = perfect match. Irrelevant queries return nothing.

Configuration

Gundog uses two config files:

File	Scope	Purpose
`.gundog/config.yaml`	Per-project	Index settings (sources, model, storage)
`~/.config/gundog/daemon.yaml`	Per-user	Daemon settings (host, port, registered indexes)

Project config

Each project has its own .gundog/config.yaml that defines what to index and how:

sources:
  - path: ./docs
    glob: "**/*.md"
  - path: ./src
    glob: "**/*.py"
    type: code                    # optional - for filtering with --type
    ignore_preset: python         # optional - predefined ignores
    ignore:                       # optional - additional patterns to skip
      - "**/test_*"
    use_gitignore: true           # default - auto-read .gitignore

embedding:
  # Any sentence-transformers model works: https://sbert.net/docs/sentence_transformer/pretrained_models.html
  model: BAAI/bge-small-en-v1.5   # default (~130MB), good balance of speed/quality
  enable_onnx: true               # default. forces ONNX conversion

storage:
  use_hnsw: true                  # default - O(log n) search, scales to millions. Uses numpy if false.
  path: .gundog/index

graph:
  similarity_threshold: 0.7  # min similarity to create edge
  expand_threshold: 0.5      # min edge weight for query expansion
  max_expand_depth: 2        # how far to traverse during expansion

hybrid:
  enabled: true       # combine vector + keyword search (default: on)
  bm25_weight: 0.5    # keyword search weight
  vector_weight: 0.5  # semantic search weight

recency:
  enabled: false      # boost recently modified files (opt-in, requires git)
  weight: 0.15        # how much recency affects score (0-1)
  half_life_days: 30  # days until recency boost decays to 50%

chunking:
  enabled: true       # default - split files into chunks for better precision
  max_tokens: 512     # tokens per chunk
  overlap_tokens: 50  # overlap between chunks

The type field is optional. If you want to filter results by category, assign types to your sources. Any string works.

Embedding options

Option	Default	Description
`model`	`BAAI/bge-small-en-v1.5`	Any sentence-transformers model
`enable_onnx`	`true`	Use ONNX Runtime

ONNX models are automatically and forcefully converted on first use and cached at ~/.cache/gundog/onnx/. This cache is shared across all your projects that use the same model.

Storage options

Option	Default	Description
`use_hnsw`	`true`	Use HNSW index for O(log n) search
`path`	`.gundog/index`	Where to store the index

Ignore patterns

Control which files are excluded from indexing:

ignore: List of glob patterns to skip (e.g., **/test_*, **/__pycache__/*)
ignore_preset: Predefined patterns for common languages: python, javascript, typescript, go, rust, java
use_gitignore: Auto-read .gitignore from source directory (default: true)

Chunking

Enabled by default for better search precision. Instead of embedding whole files (which dilutes signal), chunking splits files into overlapping segments:

chunking:
  enabled: true
  max_tokens: 512   # ~2000 characters per chunk
  overlap_tokens: 50

Results are automatically deduplicated by file, showing the best-matching chunk with line numbers.

Recency boost

For codebases where recent changes matter more, enable recency boosting. Files modified recently get a score boost based on their git commit history:

recency:
  enabled: true
  weight: 0.15        # boost multiplier (0.15 = up to 15% boost)
  half_life_days: 30  # file modified 30 days ago gets 50% of max boost

Uses exponential decay: a file modified today gets full boost, one modified half_life_days ago gets half, and older files approach zero. Requires files to be in a git repository.

Daemon config

The daemon config at ~/.config/gundog/daemon.yaml controls the background service:

daemon:
  host: 127.0.0.1       # bind address
  port: 7676            # port number
  serve_ui: true        # serve web UI at root path
  auth:
    enabled: false      # require API key
    api_key: null       # set via GUNDOG_API_KEY env var or here
  cors:
    allowed_origins: [] # CORS origins (empty = allow all)

# Registered indexes (managed via `gundog daemon add/remove`)
indexes:
  myproject: /path/to/project/.gundog

default_index: myproject  # used when --index not specified

Development

Fork the repo
Create a PR to gundog's main
Make sure the CI passes
Profit

To run checks locally

uv run tox               # run all checks (lint, fmt, static, unit)
uv run tox -e lint       # linting only
uv run tox -e fmt        # format check only
uv run tox -e static     # type check only
uv run tox -e unit       # tests with coverage

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.4.1

Dec 20, 2025

0.4.0

Dec 20, 2025

0.3.1

Dec 18, 2025

This version

0.3.0

Dec 18, 2025

0.2.0

Dec 17, 2025

0.1.2

Dec 14, 2025

0.1.1

Dec 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gundog-0.3.0.tar.gz (1.5 MB view details)

Uploaded Dec 18, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gundog-0.3.0-py3-none-any.whl (65.9 kB view details)

Uploaded Dec 18, 2025 Python 3

File details

Details for the file gundog-0.3.0.tar.gz.

File metadata

Download URL: gundog-0.3.0.tar.gz
Upload date: Dec 18, 2025
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gundog-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`fb9816c7a6f39f3ea4c6e351fb3c41eb5ab1ac0498f391e370f0451c67534b54`
MD5	`4c046910cb5bdbc9af73de214356229b`
BLAKE2b-256	`45ceaac2170a986308ae01026098d91bae6f7c158b3b732932a90833df7beeb9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gundog-0.3.0.tar.gz:

Publisher: release.yaml on adhityaravi/gundog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gundog-0.3.0.tar.gz
- Subject digest: fb9816c7a6f39f3ea4c6e351fb3c41eb5ab1ac0498f391e370f0451c67534b54
- Sigstore transparency entry: 770640073
- Sigstore integration time: Dec 18, 2025
Source repository:
- Permalink: adhityaravi/gundog@a9173a5069f3cc646d045db68d41953ddce1f676
- Branch / Tag: refs/heads/main
- Owner: https://github.com/adhityaravi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@a9173a5069f3cc646d045db68d41953ddce1f676
- Trigger Event: workflow_dispatch

File details

Details for the file gundog-0.3.0-py3-none-any.whl.

File metadata

Download URL: gundog-0.3.0-py3-none-any.whl
Upload date: Dec 18, 2025
Size: 65.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for gundog-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a0de2abfac754656d09a5e50d0b073610dd5c816dc0c60ab83e31b9c24147cd1`
MD5	`7e561f40f69495416ca97804c728a03d`
BLAKE2b-256	`15b2eefff4929a4abc2094cc431c35ff7ad184b0d2190af12019734e8173c997`

See more details on using hashes here.

Provenance

The following attestation bundles were made for gundog-0.3.0-py3-none-any.whl:

Publisher: release.yaml on adhityaravi/gundog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: gundog-0.3.0-py3-none-any.whl
- Subject digest: a0de2abfac754656d09a5e50d0b073610dd5c816dc0c60ab83e31b9c24147cd1
- Sigstore transparency entry: 770640077
- Sigstore integration time: Dec 18, 2025
Source repository:
- Permalink: adhityaravi/gundog@a9173a5069f3cc646d045db68d41953ddce1f676
- Branch / Tag: refs/heads/main
- Owner: https://github.com/adhityaravi
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yaml@a9173a5069f3cc646d045db68d41953ddce1f676
- Trigger Event: workflow_dispatch

gundog 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

gundog

Why?

Performance

Install

Or from source

Quick Start

Commands

gundog index

gundog daemon

gundog query

How It Works

Configuration

Project config

Embedding options

Storage options

Ignore patterns

Chunking

Recency boost

Daemon config

Development

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`gundog index`

`gundog daemon`

`gundog query`