Codebase indexing and semantic search using tree-sitter parsing, vector embeddings, and SQLite graph storage.
Project description
trailhead
Command-line code indexing and semantic search tool. It parses source files into a property graph (modules, classes, functions, and their relationships) stored in SQLite, generates text embeddings with sentence-transformers, and exposes everything through a CLI and HTTP API.
- Single CLI command:
th - Text embeddings powered by sentence-transformers (models cached locally)
- Polyglot code indexing via tree-sitter (Python built-in; 12 additional languages optional)
- Property graph persisted in a single SQLite file with optional vector search
- Warm-model FastAPI server keeps the embedding model loaded in memory
- Background file watcher incrementally re-indexes on change
- Interactive browser UI for querying and visualizing the code graph
Requirements
- Python 3.10+
Install
pip install trailhead
Language support
Python is supported out of the box. Additional languages are installed as optional extras. Only the packages you install will be active; missing ones are silently skipped at startup.
Install individual languages:
pip install "trailhead[javascript]"
pip install "trailhead[typescript]"
pip install "trailhead[rust]"
pip install "trailhead[go]"
pip install "trailhead[java]"
pip install "trailhead[csharp]"
pip install "trailhead[c]"
pip install "trailhead[cpp]"
pip install "trailhead[ruby]"
pip install "trailhead[php]"
pip install "trailhead[bash]"
pip install "trailhead[html]"
Or install everything at once:
pip install "trailhead[all-languages]"
Development install
To get the latest unreleased changes, install directly from the repository:
git clone https://github.com/McIndi/trailhead.git
cd trailhead
python -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\Activate.ps1
pip install -e ".[dev]"
| Extra | Language | File extensions |
|---|---|---|
python (built-in) |
Python | .py |
javascript |
JavaScript | .js .mjs .cjs |
typescript |
TypeScript / TSX | .ts .tsx |
rust |
Rust | .rs |
go |
Go | .go |
java |
Java | .java |
csharp |
C# | .cs |
c |
C | .c .h |
cpp |
C++ | .cpp .cc .cxx .hpp .hxx .h++ |
ruby |
Ruby | .rb |
php |
PHP | .php |
bash |
Bash / Shell | .sh .bash |
html |
HTML | .html .htm |
Quick start
The typical workflow is: index your source tree once, then serve and query it.
# 1. Index a project (writes .trailhead/db.sqlite by default)
th index . --sqlite-db ./.trailhead/graph.db --embed-model sentence-transformers/all-MiniLM-L6-v2
# 2. Start the server (watches for changes, keeps embeddings warm)
th serve . --sqlite-db ./.trailhead/graph.db --model sentence-transformers/all-MiniLM-L6-v2
# 3. Open the browser UI
start http://localhost:8000
# 4. Or query from the CLI or another terminal
th query similar "HTTP route registration"
curl "http://localhost:8000/api/query/similar?text=HTTP+route+registration"
The server re-indexes changed files automatically in the background. You do not need to re-run th index while the server is running.
Usage
embed
Generate an embedding for a piece of text:
th embed "A short sentence to embed"
th embed "A short sentence to embed" --model sentence-transformers/all-mpnet-base-v2
The command prints the embedding as a JSON array of floats.
Optional cache override:
$env:TRAILHEAD_CACHE_DIR = "C:\models\cache"
th embed "A short sentence to embed"
th embed "A short sentence to embed" --cache-dir "C:\another\cache"
index
Index a directory of source files. The graph is persisted to .trailhead/db.sqlite by default (smart sync: full build on first run, incremental on subsequent runs):
th index .
Use --in-memory to build the graph without writing to disk and print a summary:
th index . --in-memory
th index . --in-memory --output json
Preview which files would be indexed without parsing or writing any SQLite state:
th index . --dry-run
th index . --dry-run --output json
Watch for file changes and reindex incrementally (Ctrl-C to stop):
th index . --watch
th index . --sqlite-db ./.trailhead/graph.db --embed-model sentence-transformers/all-MiniLM-L6-v2 --watch
Use a custom database path or add embeddings:
th index . --sqlite-db ./.trailhead/graph.db
th index . --sqlite-db ./.trailhead/graph.db --embed-model sentence-transformers/all-MiniLM-L6-v2
th index . --sqlite-db ./.trailhead/graph.db --embed-model sentence-transformers/all-MiniLM-L6-v2 --embed-cache-dir C:\models\cache
When sqlite-vector can be loaded, trailhead also initializes vector search for the vertex_embeddings.embedding column. If extension loading is unavailable on your platform build, embeddings are still stored as Float32 BLOBs in SQLite.
Source discovery respects .gitignore and .trailheadignore in the indexed directory. Both use gitignore-style patterns, and .trailheadignore is applied after .gitignore so Trailhead-specific rules take precedence. If you change either ignore file, delete .trailhead/db.sqlite and run th index again to rebuild the index with the new rules.
th index --dry-run --output json returns a file-preview payload with this schema:
{
"root": "C:/path/to/project",
"count": 2,
"files": ["src/app.py", "src/lib/util.py"]
}
serve
Run the warm-model API server with a background indexer. The server watches the source tree, keeps the SQLite graph fresh, and reuses the loaded embedding model across index updates. The database defaults to .trailhead/db.sqlite under the watched directory:
th serve .
th serve . --model sentence-transformers/all-MiniLM-L6-v2
th serve . --sqlite-db ./.trailhead/graph.db --model sentence-transformers/all-MiniLM-L6-v2
The browser UI is available at http://localhost:8000 once the server starts.
query
Run a read-only SQL query against the SQLite database (defaults to ./.trailhead/db.sqlite):
th query sql --sql "SELECT label, COUNT(*) AS n FROM vertices GROUP BY label ORDER BY label"
th query sql --sqlite-db ./.trailhead/graph.db --sql "SELECT label, COUNT(*) AS n FROM vertices GROUP BY label ORDER BY label"
Run a semantic similarity query against stored vertex embeddings:
th query similar "find sqlite vector initialization code"
th query similar "find sqlite vector initialization code" --sqlite-db ./.trailhead/graph.db
th query similar "graph persistence" --sqlite-db ./.trailhead/graph.db --label function --k 5 --output json
HTTP API
When the server is running, the full API schema is available at:
http://localhost:8000/openapi.json
http://localhost:8000/docs
The schema documents every endpoint, parameter name, type, default, and constraint. Check it first before probing endpoints manually.
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/ |
Browser UI |
GET |
/api/health |
Server status and configuration |
POST |
/api/embed |
Embed a single text string |
POST |
/api/embed/batch |
Embed multiple texts |
POST |
/api/query/sql |
Run a read-only SQL query |
GET |
/api/query/templates |
List built-in query templates |
GET |
/api/query/templates/{name} |
Get a template's SQL |
POST |
/api/query/templates/{name}/run |
Run a template against the database |
GET |
/api/query/similar |
Semantic similarity search (parameter: text, k) |
GET |
/api/graph/vertices |
Search vertices by name, label, or path |
GET |
/api/graph/traverse |
Traverse the graph from a vertex |
SQL schema
The two core tables are:
vertices — one row per code symbol:
| Column | Type | Notes |
|---|---|---|
id |
TEXT | UUID, used for graph traversal |
label |
TEXT | module, class, function, external |
name |
TEXT | Symbol name |
path |
TEXT | Absolute file path |
line |
INTEGER | Line number (null for modules) |
complexity |
INTEGER | McCabe complexity (functions only) |
properties_json |
TEXT | JSON blob with source, docstring, and all other properties |
edges — relationships between vertices:
| Column | Type | Notes |
|---|---|---|
id |
TEXT | UUID |
label |
TEXT | defines, has_method, imports, calls |
out_v_id |
TEXT | Source vertex id |
in_v_id |
TEXT | Target vertex id |
properties_json |
TEXT | Always {} currently |
Edge labels and their meaning:
| Label | Meaning |
|---|---|
defines |
Module → class or function it defines |
has_method |
Class → method |
imports |
Module → external symbol it imports |
calls |
Function → function it calls |
source and docstring live inside properties_json rather than as top-level columns. To filter on them in SQL, use json_extract:
-- Functions whose source mentions "HTTPException"
SELECT name, path, line
FROM vertices
WHERE label = 'function'
AND json_extract(properties_json, '$.source') LIKE '%HTTPException%'
-- Functions with a docstring
SELECT name, path
FROM vertices
WHERE label = 'function'
AND json_extract(properties_json, '$.docstring') IS NOT NULL
HTTP query examples
# Health check
curl http://localhost:8000/api/health
# Embed text
curl -X POST http://localhost:8000/api/embed \
-H "Content-Type: application/json" \
-d '{"text": "hello world"}'
# Semantic search — note the parameter is "k", not "limit"
curl "http://localhost:8000/api/query/similar?text=route+registration&k=5"
# Filter semantic search to functions only
curl "http://localhost:8000/api/query/similar?text=route+registration&k=5&label=function"
# SQL query
curl -X POST http://localhost:8000/api/query/sql \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT label, COUNT(*) AS n FROM vertices GROUP BY label"}'
# Find a vertex by name, then get its id for traversal
curl "http://localhost:8000/api/graph/vertices?name=ui_dashboard&label=function"
# Traverse outward along call edges only (shows what a function calls)
curl "http://localhost:8000/api/graph/traverse?vertex_id=<id>&direction=out&depth=2&edge_labels=calls"
# Traverse inward along call edges only (shows what calls a function)
curl "http://localhost:8000/api/graph/traverse?vertex_id=<id>&direction=in&depth=2&edge_labels=calls"
# Run a built-in query template
curl http://localhost:8000/api/query/templates
curl -X POST http://localhost:8000/api/query/templates/function_complexity/run
Built-in query templates
Templates are pre-built SQL queries runnable without writing any SQL. Categories:
| Category | Templates |
|---|---|
quality |
function_complexity, missing_docstrings, undocumented_public_api, todo_fixme_inventory |
testing |
symbols_not_represented_by_tests, test_coverage_ratio_by_file, largest_untested_symbols |
architecture |
duplicate_symbol_names, dependency_hotspots, external_dependency_pressure |
calls |
most_called_functions, call_graph_hubs |
data_health |
missing_source_for_functions, orphan_edges |
Typical workflow for code exploration
- Find a starting point — use semantic search or
/api/graph/vertices?name=...to locate a vertex and grab itsid. - Understand its call chain — traverse outward with
edge_labels=callsto see what it calls; inward to see its callers. - Understand its structure — traverse with
edge_labels=defines,has_methodto see what a module or class contains. - Run quality checks — use the built-in templates for complexity, missing docs, or dependency hotspots without writing SQL.
- Ad-hoc queries — use
/api/query/sqlwithjson_extractto filter on source content, docstrings, or any property.
Tests
pytest
Project Layout
.
|-- pyproject.toml
|-- README.md
|-- src/
| `-- trailhead/
| |-- __init__.py
| |-- __main__.py
| |-- cli/
| | |-- __init__.py
| | |-- __main__.py
| | |-- app.py
| | `-- commands/
| | |-- __init__.py
| | |-- embed.py
| | |-- index.py
| | |-- query.py
| | `-- serve.py
| |-- server/
| | |-- __init__.py
| | |-- __main__.py
| | |-- app.py
| | `-- templates/
| | `-- query_ui.html
| `-- services/
| |-- config/
| | `-- cache.py
| |-- indexing/
| | |-- __init__.py
| | |-- graph.py
| | |-- graph_query.py
| | |-- live_indexer.py
| | |-- parser.py # re-exports parse_python_file (backwards compat)
| | |-- query.py
| | |-- sqlite_store.py
| | |-- walker.py
| | `-- adapters/ # language adapter registry
| | |-- __init__.py # auto-registers available adapters
| | |-- base.py # LanguageAdapter ABC + shared utilities
| | |-- registry.py # extension → adapter map, parse_file()
| | |-- python.py # Python (built-in)
| | |-- javascript.py # JavaScript (optional)
| | |-- typescript.py # TypeScript / TSX (optional)
| | |-- rust.py # Rust (optional)
| | |-- go.py # Go (optional)
| | |-- java.py # Java (optional)
| | |-- csharp.py # C# (optional)
| | |-- c.py # C (optional)
| | |-- cpp.py # C++ (optional)
| | |-- ruby.py # Ruby (optional)
| | |-- php.py # PHP (optional)
| | |-- bash.py # Bash / Shell (optional)
| | `-- html.py # HTML (optional)
| `-- embeddings/
| |-- generator.py
| `-- model_store.py
`-- tests/
|-- conftest.py
|-- test_indexing.py
|-- test_query.py
|-- test_server.py
`-- test_smoke.py
Adding a custom language adapter
Any language with a tree-sitter Python binding can be supported in three steps:
# 1. Create your adapter (e.g. my_adapters/kotlin.py)
from trailhead.services.indexing.adapters.base import LanguageAdapter, _node_text, _complexity
from trailhead.services.indexing.graph import PropertyGraph, Vertex
from pathlib import Path
class KotlinAdapter(LanguageAdapter):
extensions = frozenset({".kt", ".kts"})
@classmethod
def is_available(cls) -> bool:
try:
import tree_sitter_kotlin # noqa: F401
return True
except ImportError:
return False
def parse(self, path: Path, graph: PropertyGraph) -> Vertex:
import tree_sitter_kotlin as tskotlin
from tree_sitter import Language, Parser
source = path.read_bytes()
language = Language(tskotlin.language())
parser = Parser(language)
tree = parser.parse(source)
module_v = graph.add_vertex("module", name=path.stem, path=str(path))
# ... walk tree and add vertices/edges ...
return module_v
# 2. Register it at startup (e.g. in your app's __init__ or conftest)
from trailhead.services.indexing.adapters import register
register(KotlinAdapter())
# 3. Done — th index, serve, and query all pick it up automatically.
What each adapter should produce:
| Vertex label | Meaning | Required properties |
|---|---|---|
module |
one per source file | name, path |
class |
class / struct / interface / trait | name, path, line |
function |
function / method | name, path, line, source, complexity |
external |
imported module name | name |
Edges: defines (module→class, module→function), has_method (class→function), imports (module→external), calls (function→function).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trailhead-0.1.2.tar.gz.
File metadata
- Download URL: trailhead-0.1.2.tar.gz
- Upload date:
- Size: 67.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94f09db7367f4f7b4699652bef2ec32c9987c66f86d0ebfaaa08d9eb3979f409
|
|
| MD5 |
f107de85102dd4c1674615426b4cd991
|
|
| BLAKE2b-256 |
3b84a93d5fe26a8d8426d29af7a3164b8bf5cd887816bdc112f89aa0375e62c8
|
Provenance
The following attestation bundles were made for trailhead-0.1.2.tar.gz:
Publisher:
publish.yml on McIndi/trailhead
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
trailhead-0.1.2.tar.gz -
Subject digest:
94f09db7367f4f7b4699652bef2ec32c9987c66f86d0ebfaaa08d9eb3979f409 - Sigstore transparency entry: 1247101846
- Sigstore integration time:
-
Permalink:
McIndi/trailhead@e91ba0ff32460357785eee4fcf73f46db518ec93 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/McIndi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e91ba0ff32460357785eee4fcf73f46db518ec93 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file trailhead-0.1.2-py3-none-any.whl.
File metadata
- Download URL: trailhead-0.1.2-py3-none-any.whl
- Upload date:
- Size: 93.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e966de326c0bf494a2571af3df7f8ff81e1aca289aff53f6cf4eefea4f44b1f7
|
|
| MD5 |
f1a26e1d406f98758184704a50b29d27
|
|
| BLAKE2b-256 |
395fbd678b2eac4f7d44ca3974d1ec1b1f0436e9f11fcc07a22a2d609eb080a8
|
Provenance
The following attestation bundles were made for trailhead-0.1.2-py3-none-any.whl:
Publisher:
publish.yml on McIndi/trailhead
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
trailhead-0.1.2-py3-none-any.whl -
Subject digest:
e966de326c0bf494a2571af3df7f8ff81e1aca289aff53f6cf4eefea4f44b1f7 - Sigstore transparency entry: 1247101852
- Sigstore integration time:
-
Permalink:
McIndi/trailhead@e91ba0ff32460357785eee4fcf73f46db518ec93 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/McIndi
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e91ba0ff32460357785eee4fcf73f46db518ec93 -
Trigger Event:
workflow_dispatch
-
Statement type: