Skip to main content

Codebase indexing and semantic search using tree-sitter parsing, vector embeddings, and SQLite graph storage.

Project description

trailhead

Command-line code indexing and semantic search tool. It parses source files into a property graph (modules, classes, functions, and their relationships) stored in SQLite, generates text embeddings with sentence-transformers, and exposes everything through a CLI and HTTP API.

  • Single CLI command: th
  • Text embeddings powered by sentence-transformers (models cached locally)
  • Polyglot code indexing via tree-sitter (Python built-in; 12 additional languages optional)
  • Property graph persisted in a single SQLite file with optional vector search
  • Warm-model FastAPI server keeps the embedding model loaded in memory
  • Background file watcher incrementally re-indexes on change
  • Interactive browser UI for querying and visualizing the code graph

Requirements

  • Python 3.10+

Install

pip install trailhead

Language support

Python is supported out of the box. Additional languages are installed as optional extras. Only the packages you install will be active; missing ones are silently skipped at startup.

Install individual languages:

pip install "trailhead[javascript]"
pip install "trailhead[typescript]"
pip install "trailhead[rust]"
pip install "trailhead[go]"
pip install "trailhead[java]"
pip install "trailhead[csharp]"
pip install "trailhead[c]"
pip install "trailhead[cpp]"
pip install "trailhead[ruby]"
pip install "trailhead[php]"
pip install "trailhead[bash]"
pip install "trailhead[html]"

Or install everything at once:

pip install "trailhead[all-languages]"

Development install

To get the latest unreleased changes, install directly from the repository:

git clone https://github.com/McIndi/trailhead.git
cd trailhead
python -m venv .venv
source .venv/bin/activate  # Windows: .\.venv\Scripts\Activate.ps1
pip install -e ".[dev]"
Extra Language File extensions
python (built-in) Python .py
javascript JavaScript .js .mjs .cjs
typescript TypeScript / TSX .ts .tsx
rust Rust .rs
go Go .go
java Java .java
csharp C# .cs
c C .c .h
cpp C++ .cpp .cc .cxx .hpp .hxx .h++
ruby Ruby .rb
php PHP .php
bash Bash / Shell .sh .bash
html HTML .html .htm

Quick start

The typical workflow is: index your source tree once, then serve and query it.

# 1. Index a project (writes .trailhead/db.sqlite by default)
th index . --sqlite-db ./.trailhead/graph.db --embed-model sentence-transformers/all-MiniLM-L6-v2

# 2. Start the server (watches for changes, keeps embeddings warm)
th serve . --sqlite-db ./.trailhead/graph.db --model sentence-transformers/all-MiniLM-L6-v2

# 3. Open the browser UI
start http://localhost:8000

# 4. Or query from the CLI or another terminal
th query similar "HTTP route registration"
curl "http://localhost:8000/api/query/similar?text=HTTP+route+registration"

The server re-indexes changed files automatically in the background. You do not need to re-run th index while the server is running.

Usage

embed

Generate an embedding for a piece of text:

th embed "A short sentence to embed"
th embed "A short sentence to embed" --model sentence-transformers/all-mpnet-base-v2

The command prints the embedding as a JSON array of floats.

Optional cache override:

$env:TRAILHEAD_CACHE_DIR = "C:\models\cache"
th embed "A short sentence to embed"
th embed "A short sentence to embed" --cache-dir "C:\another\cache"

index

Index a directory of source files. The graph is persisted to .trailhead/db.sqlite by default (smart sync: full build on first run, incremental on subsequent runs):

th index .

Use --in-memory to build the graph without writing to disk and print a summary:

th index . --in-memory
th index . --in-memory --output json

Preview which files would be indexed without parsing or writing any SQLite state:

th index . --dry-run
th index . --dry-run --output json

Watch for file changes and reindex incrementally (Ctrl-C to stop):

th index . --watch
th index . --sqlite-db ./.trailhead/graph.db --embed-model sentence-transformers/all-MiniLM-L6-v2 --watch

Use a custom database path or add embeddings:

th index . --sqlite-db ./.trailhead/graph.db
th index . --sqlite-db ./.trailhead/graph.db --embed-model sentence-transformers/all-MiniLM-L6-v2
th index . --sqlite-db ./.trailhead/graph.db --embed-model sentence-transformers/all-MiniLM-L6-v2 --embed-cache-dir C:\models\cache

When sqlite-vector can be loaded, trailhead also initializes vector search for the vertex_embeddings.embedding column. If extension loading is unavailable on your platform build, embeddings are still stored as Float32 BLOBs in SQLite.

Source discovery respects .gitignore and .trailheadignore in the indexed directory. Both use gitignore-style patterns, and .trailheadignore is applied after .gitignore so Trailhead-specific rules take precedence. If you change either ignore file, delete .trailhead/db.sqlite and run th index again to rebuild the index with the new rules.

th index --dry-run --output json returns a file-preview payload with this schema:

{
  "root": "C:/path/to/project",
  "count": 2,
  "files": ["src/app.py", "src/lib/util.py"]
}

serve

Run the warm-model API server with a background indexer. The server watches the source tree, keeps the SQLite graph fresh, and reuses the loaded embedding model across index updates. The database defaults to .trailhead/db.sqlite under the watched directory:

th serve .
th serve . --model sentence-transformers/all-MiniLM-L6-v2
th serve . --sqlite-db ./.trailhead/graph.db --model sentence-transformers/all-MiniLM-L6-v2

The browser UI is available at http://localhost:8000 once the server starts.

query

Run a read-only SQL query against the SQLite database (defaults to ./.trailhead/db.sqlite):

th query sql --sql "SELECT label, COUNT(*) AS n FROM vertices GROUP BY label ORDER BY label"
th query sql --sqlite-db ./.trailhead/graph.db --sql "SELECT label, COUNT(*) AS n FROM vertices GROUP BY label ORDER BY label"

Run a semantic similarity query against stored vertex embeddings:

th query similar "find sqlite vector initialization code"
th query similar "find sqlite vector initialization code" --sqlite-db ./.trailhead/graph.db
th query similar "graph persistence" --sqlite-db ./.trailhead/graph.db --label function --k 5 --output json

HTTP API

When the server is running, the full API schema is available at:

http://localhost:8000/openapi.json
http://localhost:8000/docs

The schema documents every endpoint, parameter name, type, default, and constraint. Check it first before probing endpoints manually.

Endpoints

Method Path Description
GET / Browser UI
GET /api/health Server status and configuration
POST /api/embed Embed a single text string
POST /api/embed/batch Embed multiple texts
POST /api/query/sql Run a read-only SQL query
GET /api/query/templates List built-in query templates
GET /api/query/templates/{name} Get a template's SQL
POST /api/query/templates/{name}/run Run a template against the database
GET /api/query/similar Semantic similarity search (parameter: text, k)
GET /api/graph/vertices Search vertices by name, label, or path
GET /api/graph/traverse Traverse the graph from a vertex

SQL schema

The two core tables are:

vertices — one row per code symbol:

Column Type Notes
id TEXT UUID, used for graph traversal
label TEXT module, class, function, external
name TEXT Symbol name
path TEXT Absolute file path
line INTEGER Line number (null for modules)
complexity INTEGER McCabe complexity (functions only)
properties_json TEXT JSON blob with source, docstring, and all other properties

edges — relationships between vertices:

Column Type Notes
id TEXT UUID
label TEXT defines, has_method, imports, calls
out_v_id TEXT Source vertex id
in_v_id TEXT Target vertex id
properties_json TEXT Always {} currently

Edge labels and their meaning:

Label Meaning
defines Module → class or function it defines
has_method Class → method
imports Module → external symbol it imports
calls Function → function it calls

source and docstring live inside properties_json rather than as top-level columns. To filter on them in SQL, use json_extract:

-- Functions whose source mentions "HTTPException"
SELECT name, path, line
FROM vertices
WHERE label = 'function'
  AND json_extract(properties_json, '$.source') LIKE '%HTTPException%'

-- Functions with a docstring
SELECT name, path
FROM vertices
WHERE label = 'function'
  AND json_extract(properties_json, '$.docstring') IS NOT NULL

HTTP query examples

# Health check
curl http://localhost:8000/api/health

# Embed text
curl -X POST http://localhost:8000/api/embed \
  -H "Content-Type: application/json" \
  -d '{"text": "hello world"}'

# Semantic search — note the parameter is "k", not "limit"
curl "http://localhost:8000/api/query/similar?text=route+registration&k=5"

# Filter semantic search to functions only
curl "http://localhost:8000/api/query/similar?text=route+registration&k=5&label=function"

# SQL query
curl -X POST http://localhost:8000/api/query/sql \
  -H "Content-Type: application/json" \
  -d '{"sql": "SELECT label, COUNT(*) AS n FROM vertices GROUP BY label"}'

# Find a vertex by name, then get its id for traversal
curl "http://localhost:8000/api/graph/vertices?name=ui_dashboard&label=function"

# Traverse outward along call edges only (shows what a function calls)
curl "http://localhost:8000/api/graph/traverse?vertex_id=<id>&direction=out&depth=2&edge_labels=calls"

# Traverse inward along call edges only (shows what calls a function)
curl "http://localhost:8000/api/graph/traverse?vertex_id=<id>&direction=in&depth=2&edge_labels=calls"

# Run a built-in query template
curl http://localhost:8000/api/query/templates
curl -X POST http://localhost:8000/api/query/templates/function_complexity/run

Built-in query templates

Templates are pre-built SQL queries runnable without writing any SQL. Categories:

Category Templates
quality function_complexity, missing_docstrings, undocumented_public_api, todo_fixme_inventory
testing symbols_not_represented_by_tests, test_coverage_ratio_by_file, largest_untested_symbols
architecture duplicate_symbol_names, dependency_hotspots, external_dependency_pressure
calls most_called_functions, call_graph_hubs
data_health missing_source_for_functions, orphan_edges

Typical workflow for code exploration

  1. Find a starting point — use semantic search or /api/graph/vertices?name=... to locate a vertex and grab its id.
  2. Understand its call chain — traverse outward with edge_labels=calls to see what it calls; inward to see its callers.
  3. Understand its structure — traverse with edge_labels=defines,has_method to see what a module or class contains.
  4. Run quality checks — use the built-in templates for complexity, missing docs, or dependency hotspots without writing SQL.
  5. Ad-hoc queries — use /api/query/sql with json_extract to filter on source content, docstrings, or any property.

Tests

pytest

Project Layout

.
|-- pyproject.toml
|-- README.md
|-- src/
|   `-- trailhead/
|       |-- __init__.py
|       |-- __main__.py
|       |-- cli/
|       |   |-- __init__.py
|       |   |-- __main__.py
|       |   |-- app.py
|       |   `-- commands/
|       |       |-- __init__.py
|       |       |-- embed.py
|       |       |-- index.py
|       |       |-- query.py
|       |       `-- serve.py
|       |-- server/
|       |   |-- __init__.py
|       |   |-- __main__.py
|       |   |-- app.py
|       |   `-- templates/
|       |       `-- query_ui.html
|       `-- services/
|           |-- config/
|           |   `-- cache.py
|           |-- indexing/
|           |   |-- __init__.py
|           |   |-- graph.py
|           |   |-- graph_query.py
|           |   |-- live_indexer.py
|           |   |-- parser.py          # re-exports parse_python_file (backwards compat)
|           |   |-- query.py
|           |   |-- sqlite_store.py
|           |   |-- walker.py
|           |   `-- adapters/          # language adapter registry
|           |       |-- __init__.py    # auto-registers available adapters
|           |       |-- base.py        # LanguageAdapter ABC + shared utilities
|           |       |-- registry.py    # extension → adapter map, parse_file()
|           |       |-- python.py      # Python (built-in)
|           |       |-- javascript.py  # JavaScript (optional)
|           |       |-- typescript.py  # TypeScript / TSX (optional)
|           |       |-- rust.py        # Rust (optional)
|           |       |-- go.py          # Go (optional)
|           |       |-- java.py        # Java (optional)
|           |       |-- csharp.py      # C# (optional)
|           |       |-- c.py           # C (optional)
|           |       |-- cpp.py         # C++ (optional)
|           |       |-- ruby.py        # Ruby (optional)
|           |       |-- php.py         # PHP (optional)
|           |       |-- bash.py        # Bash / Shell (optional)
|           |       `-- html.py        # HTML (optional)
|           `-- embeddings/
|               |-- generator.py
|               `-- model_store.py
`-- tests/
    |-- conftest.py
    |-- test_indexing.py
    |-- test_query.py
    |-- test_server.py
    `-- test_smoke.py

Adding a custom language adapter

Any language with a tree-sitter Python binding can be supported in three steps:

# 1. Create your adapter (e.g. my_adapters/kotlin.py)
from trailhead.services.indexing.adapters.base import LanguageAdapter, _node_text, _complexity
from trailhead.services.indexing.graph import PropertyGraph, Vertex
from pathlib import Path

class KotlinAdapter(LanguageAdapter):
    extensions = frozenset({".kt", ".kts"})

    @classmethod
    def is_available(cls) -> bool:
        try:
            import tree_sitter_kotlin  # noqa: F401
            return True
        except ImportError:
            return False

    def parse(self, path: Path, graph: PropertyGraph) -> Vertex:
        import tree_sitter_kotlin as tskotlin
        from tree_sitter import Language, Parser
        source = path.read_bytes()
        language = Language(tskotlin.language())
        parser = Parser(language)
        tree = parser.parse(source)
        module_v = graph.add_vertex("module", name=path.stem, path=str(path))
        # ... walk tree and add vertices/edges ...
        return module_v

# 2. Register it at startup (e.g. in your app's __init__ or conftest)
from trailhead.services.indexing.adapters import register
register(KotlinAdapter())

# 3. Done — th index, serve, and query all pick it up automatically.

What each adapter should produce:

Vertex label Meaning Required properties
module one per source file name, path
class class / struct / interface / trait name, path, line
function function / method name, path, line, source, complexity
external imported module name name

Edges: defines (module→class, module→function), has_method (class→function), imports (module→external), calls (function→function).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trailhead-0.1.2.tar.gz (67.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trailhead-0.1.2-py3-none-any.whl (93.0 kB view details)

Uploaded Python 3

File details

Details for the file trailhead-0.1.2.tar.gz.

File metadata

  • Download URL: trailhead-0.1.2.tar.gz
  • Upload date:
  • Size: 67.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for trailhead-0.1.2.tar.gz
Algorithm Hash digest
SHA256 94f09db7367f4f7b4699652bef2ec32c9987c66f86d0ebfaaa08d9eb3979f409
MD5 f107de85102dd4c1674615426b4cd991
BLAKE2b-256 3b84a93d5fe26a8d8426d29af7a3164b8bf5cd887816bdc112f89aa0375e62c8

See more details on using hashes here.

Provenance

The following attestation bundles were made for trailhead-0.1.2.tar.gz:

Publisher: publish.yml on McIndi/trailhead

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file trailhead-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: trailhead-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 93.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for trailhead-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 e966de326c0bf494a2571af3df7f8ff81e1aca289aff53f6cf4eefea4f44b1f7
MD5 f1a26e1d406f98758184704a50b29d27
BLAKE2b-256 395fbd678b2eac4f7d44ca3974d1ec1b1f0436e9f11fcc07a22a2d609eb080a8

See more details on using hashes here.

Provenance

The following attestation bundles were made for trailhead-0.1.2-py3-none-any.whl:

Publisher: publish.yml on McIndi/trailhead

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page