Skip to main content

Local-first semantic code search with vector and regex capabilities for AI assistants via MCP

Project description

ChunkHound

Deep Research for Code & Files

Tests License: MIT 100% AI Generated Discord

Transform your codebase into a searchable knowledge base for AI assistants using semantic search via cAST algorithm and regex search. Integrates with AI assistants via the Model Context Protocol (MCP).

Features

  • cAST Algorithm - Research-backed semantic code chunking
  • Multi-Hop Semantic Search - Discovers interconnected code relationships beyond direct matches
  • Semantic search - Natural language queries like "find authentication code"
  • Regex search - Pattern matching without API keys
  • Local-first - Your code stays on your machine
  • 29 languages with structured parsing
    • Programming (via Tree-sitter): Python, JavaScript, TypeScript, JSX, TSX, Java, Kotlin, Groovy, C, C++, C#, Go, Rust, Haskell, Swift, Bash, MATLAB, Makefile, Objective-C, PHP, Vue, Zig
    • Configuration (via Tree-sitter): JSON, YAML, TOML, HCL, Markdown
    • Text-based (custom parsers): Text files, PDF
  • MCP integration - Works with Claude, VS Code, Cursor, Windsurf, Zed, etc

Documentation

Visit chunkhound.github.io for complete guides:

Requirements

Installation

# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ChunkHound
uv tool install chunkhound

Quick Start

  1. Create .chunkhound.json in project root file
{
  "embedding": {
    "provider": "openai",
    "api_key": "your-api-key-here"
  }
}
  1. Index your codebase
chunkhound index

For configuration, IDE setup, and advanced usage, see the documentation.

YAML Parsing Benchmarks

Use the reproducible benchmark harness to compare PyYAML, tree-sitter/cAST, and RapidYAML bindings on representative YAML workloads.

# Default synthetic cases with all available backends
uv run python scripts/bench_yaml.py

# Use your own fixtures or disable specific backends
uv run python scripts/bench_yaml.py \
  --cases-dir ./benchmarks/yaml \
  --backends pyyaml_safe_load tree_sitter_universal \
  --iterations 10

Real-Time Indexing

Automatic File Watching: MCP servers monitor your codebase and update the index automatically as you edit files. No manual re-indexing required.

Smart Content Diffs: Only changed code chunks get re-processed. Unchanged chunks keep their existing embeddings, making updates efficient even for large codebases.

Seamless Branch Switching: When you switch git branches, ChunkHound automatically detects and re-indexes only the files that actually changed between branches.

Live Memory Systems: Index markdown notes or documentation that updates in real-time while you work, creating a dynamic knowledge base.

Why ChunkHound?

Research Foundation: Built on the cAST (Chunking via Abstract Syntax Trees) algorithm from Carnegie Mellon University, providing:

  • 4.3 point gain in Recall@5 on RepoEval retrieval
  • 2.67 point gain in Pass@1 on SWE-bench generation
  • Structure-aware chunking that preserves code meaning

Local-First Architecture:

  • Your code never leaves your machine
  • Works offline with Ollama local models
  • No per-token charges for large codebases

Universal Language Support:

  • Structured parsing for 29 languages (Tree-sitter + custom parsers)
  • Same semantic concepts across all programming languages

Intelligent Code Discovery:

  • Multi-hop search follows semantic relationships to find related implementations
  • Automatically discovers complete feature patterns: find "authentication" to get password hashing, token validation, session management
  • Convergence detection prevents semantic drift while maximizing discovery

License

MIT

Startup profile (discovery diagnostics)

Use --profile-startup to emit a JSON block with discovery and startup timing diagnostics to stderr. This works for both simulate and full index runs.

Examples:

# Simulate (discovery only) — file list on stdout, JSON profile on stderr
CHUNKHOUND_NO_RICH=1 \
chunkhound index --simulate . --sort path --profile-startup 2>profile.json

# Full run (no embeddings) — JSON profile on stderr
CHUNKHOUND_NO_RICH=1 \
chunkhound index . --no-embeddings --profile-startup 2>profile.json

Fields in startup_profile (JSON):

  • discovery_ms — discovery time in milliseconds
  • cleanup_ms — orphan cleanup time in milliseconds
  • change_scan_ms — change-scan time in milliseconds
  • resolved_backend — discovery backend actually used: python | git | git_only
  • resolved_reasons — reasons for the decision (e.g., no_repos, all_repos, mixed, explicit)
  • git_rows_tracked — number of paths from git ls-files (tracked)
  • git_rows_others — number of paths from git ls-files --others --exclude-standard
  • git_rows_total — sum of the two above
  • git_pathspecs — number of pathspecs (:(glob) ...) pushed down to Git for pre-filtering
    • CAP: set CHUNKHOUND_INDEXING__GIT_PATHSPEC_CAP (default: 128). If the number of synthesized specs would exceed the cap, ChunkHound falls back to a subtree-only pathspec to guarantee coverage. The profile reflects the actual git_pathspecs used; an optional git_pathspecs_capped: true may appear.

Notes:

  • git_* counters appear only when the backend is git or git_only.
  • In auto mode, the backend is chosen heuristically (git_only for all‑repo trees, git for mixed trees, python when no repos are found).
  • For scripting, set CHUNKHOUND_NO_RICH=1 and read stderr; each JSON block appears on its own line near the end of the run.

Example snippet:

{
  "startup_profile": {
    "discovery_ms": 154.2,
    "cleanup_ms": 12.7,
    "change_scan_ms": 3.1,
    "resolved_backend": "git_only",
    "resolved_reasons": ["all_repos"],
    "git_rows_tracked": 420,
    "git_rows_others": 17,
    "git_rows_total": 437,
    "git_pathspecs": 4
  }
}

Exclusions (gitignore, config, defaults)

ChunkHound combines repository–aware ignores with safe defaults. The behavior depends on how you set indexing.exclude in .chunkhound.json:

  • Not set (default) → gitignore only
    • The .gitignore files inside repositories are honored (repo‑aware engine). Default ChunkHound excludes (e.g., .git/, node_modules/, .chunkhound/, caches) always apply to prevent self‑indexing and noise.
  • String sentinel .gitignore → gitignore only
    • Same as the default: only .gitignore rules are used as the exclusion source (plus ChunkHound’s default excludes).
  • Explicit list (array) → combined (gitignore + config) [default]
    • Your glob patterns in indexing.exclude are layered on top of .gitignore rather than replacing it. ChunkHound’s default excludes are also applied. This avoids surprising loss of .gitignore behavior when you accept prompts to add slow files to excludes.
    • To restore legacy behavior, set indexing.exclude_mode: "config_only". To force only gitignore even when a list exists, set indexing.exclude_mode: "gitignore_only" (rare).

Workspace overlay for non‑repo paths (default: on)

  • When the directory you index contains non‑repo subtrees, ChunkHound can apply the root workspace .gitignore only to those non‑repo paths. This is controlled by indexing.workspace_gitignore_nonrepo (default: true).
  • Repository subtrees always use their own .gitignore and Git’s native semantics.

Examples

// Default: gitignore only (+ safe defaults)
{
  "indexing": {
    // exclude omitted
    "workspace_gitignore_nonrepo": true
  }
}

// Gitignore only (explicit sentinel)
{
  "indexing": {
    "exclude": ".gitignore",
    "workspace_gitignore_nonrepo": true
  }
}

// Explicit list layered ON TOP of gitignore (default)
{
  "indexing": {
    "exclude": ["**/dist/**", "**/*.min.js"],
    "workspace_gitignore_nonrepo": false
  }
}

// Legacy behavior: config only (gitignore ignored)
{
  "indexing": {
    "exclude": ["**/dist/**", "**/*.min.js"],
    "exclude_mode": "config_only"
  }
}

// Force gitignore-only even with a list (rare)
{
  "indexing": {
    "exclude": ["**/dist/**"],
    "exclude_mode": "gitignore_only"
  }
}

### Root semantics for config patterns

- Config `include` and `exclude` patterns are always evaluated relative to the ChunkHound root (the directory you pass to `chunkhound index`).
- Git’s own `.gitignore` patterns remain repo‑aware (anchored to their respective repository roots), but your config overlay applies uniformly from the CH root across all subtrees (including Git repos).
- Examples:
  - CH root is `/workspaces`; Git repo lives under `/workspaces/monorepo`. To exclude a file inside that repo using config, prefer a CH‑root‑relative path (e.g., `"**/monorepo/path/inside/repo/file.txt"`).
  - When using anchored includes like `"src/**/*.ts"`, ensure the anchor is correct from the CH root perspective (e.g., `"monorepo/src/**/*.ts"` when the repo is nested).

CLI toggle for the workspace overlay

  • --nonrepo-gitignore enables the root .gitignore overlay for non‑repo paths for the current run.
  • To disable overlay persistently, set "workspace_gitignore_nonrepo": false in .chunkhound.json.

Simulate and diagnostics

Simulate a discovery run without writing to the database. Useful for verifying include/exclude rules, sorting, and sizes.

# List discovered files (sorted by path)
chunkhound index --simulate . --sort path

# Show sizes and sort by size (descending)
chunkhound index --simulate . --show-sizes --sort size_desc

# Emit JSON instead of plain text
chunkhound index --simulate . --json > files.json

# Add discovery timing/profile to stderr (JSON)
CHUNKHOUND_NO_RICH=1 chunkhound index --simulate . --profile-startup 2>profile.json

# Print debug info about ignores (to stderr): CH root, sources, first N defaults
chunkhound index --simulate . --debug-ignores

Diagnostics (ignore decisions):

# Compare ChunkHound’s ignore decision vs Git for the current tree
chunkhound index --check-ignores --vs git --json > ignore_diff.json

Notes:

  • When piping simulate output to tools like head, BrokenPipe is handled gracefully; prefer CHUNKHOUND_NO_RICH=1 for easy JSON parsing.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkhound-4.0.1.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkhound-4.0.1-py3-none-any.whl (616.8 kB view details)

Uploaded Python 3

File details

Details for the file chunkhound-4.0.1.tar.gz.

File metadata

  • Download URL: chunkhound-4.0.1.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for chunkhound-4.0.1.tar.gz
Algorithm Hash digest
SHA256 e12719464e7002b8f66781bd976b1d31d101f5473f246d82f8dc6a44bdff826b
MD5 ba3165ea260d4e6f472cf009136389d1
BLAKE2b-256 b85e387130b2ed3f065c65358d1a54b6f4ec6b507c60dba7a555fd180cb7c53d

See more details on using hashes here.

File details

Details for the file chunkhound-4.0.1-py3-none-any.whl.

File metadata

  • Download URL: chunkhound-4.0.1-py3-none-any.whl
  • Upload date:
  • Size: 616.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for chunkhound-4.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 483f97d55c5c3ba5ccdf0a25fb05edba06363e1df3d017ac84b5d1f535ca144f
MD5 1664679207d373166b060f9f83c2a314
BLAKE2b-256 51f1f19a853996d1e10f4368b55d5e372cdfdfba46c97e3f3a478b68146ff70f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page