KGModule for file tree knowledge graphs
Project description
FTreeKG — A Knowledge Graph for Filesystem Hierarchies with Semantic Indexing and Per-Format Metadata Extraction
Author: Eric G. Suchanek, PhD Flux-Frontiers, Liberty TWP, OH
Overview
FTreeKG turns any directory tree into a knowledge graph you can talk to. It walks the filesystem, classifies every entry as a file, directory, or symlink, captures the cheap stat the OS already exposes — size, mtime, mode, symlink target — and reaches one step further to lift per-format metadata (image EXIF today; audio, video, and PDF reserved) into a JSON blob that travels with each node. The skeleton is persisted to SQLite; a LanceDB vector index sits on top for semantic search.
The point is to make a filesystem askable. "Where do we keep
configuration?" "Which photos came from the iPhone in 2023?" "What
changed in src/ since the last release?" — questions that normally
require some combination of find, mdfind, manual inspection, and
guesswork get a single, ranked answer. The graph is small, fast to
build, and cheap to rebuild, so it works equally well as a one-shot
analysis tool, an LLM context source, and a structural complement to
codebase- and document-level knowledge graphs in the same workflow.
FTreeKG is a member of the KGRAG
family of knowledge graphs. It uses the same hybrid SQLite-plus-LanceDB
architecture as its sister projects PyCodeKG
(Python source) and DocKG
(document corpora), exposes itself to KGRAG's federated query layer with
the dedicated KGKind.FILETREE kind, and is built on the shared
KGModule primitives so it
slots cleanly into the same agents and pipelines.
The technical reading list:
- docs/SCHEMA.md — node kinds, edge types, node-ID format, full SQLite and LanceDB column reference, per-format metadata fields.
- docs/CHEATSHEET.md — query patterns, EXIF search recipes, snapshot workflows, common questions answered with one-liners.
- docs/CLI.md — flag-by-flag reference for every
ftreekgsubcommand, plus thepyproject.tomlconfiguration surface. - docs/pipeline.md — the build and query pipelines as flowing prose, with diagram hints suitable for PaperBanana or any other generator.
- docs/MCP.md — how the local
.mcp.jsonwires PyCodeKG and DocKG into AI agents working on this repo, and what a dedicated FTreeKG MCP server would look like.
Quick start
After installing the package, point ftreekg build at any directory.
The first run wipes any existing index and produces a fresh
.filetreekg/ folder with the SQLite graph and the LanceDB vector
index inside it. Subsequent commands operate against that store with no
further setup.
ftreekg build --repo /path/to/project # walk + extract + embed
ftreekg query "configuration files" # natural-language search
ftreekg query "iPhone photos from 2023" # EXIF-grounded search
ftreekg status # live dashboard
ftreekg analyze # full Markdown report
The query and pack commands are deliberately the primary surface:
query returns a ranked list of nodes, pack returns the same nodes
with their metadata rendered as paste-ready blocks for LLM context.
status is the orientation tool — run it whenever you want to know
what's in the index right now. analyze writes a longer report to
analysis/filetreekg_analysis.md with a summary table, an
ASCII size-by-top-directory bar chart, a depth-3 directory tree, and
per-kind/per-relation breakdowns.
For the full set of commands and flags see docs/CLI.md; for query recipes see docs/CHEATSHEET.md.
Installation
FTreeKG requires Python 3.12 or 3.13. The core install pulls Click,
Rich, LanceDB, Pillow, and the shared kgmodule-utils package:
pip install ftree-kg # core runtime
pip install 'ftree-kg[kgdeps]' # add PyCodeKG + DocKG for federation
poetry add ftree-kg # Poetry equivalent
The kgdeps extra is what you want if you're working in a repo where
PyCodeKG and DocKG are also indexing alongside FTreeKG — it pins
compatible versions of both. For a complete development setup
(linting, tests, pre-commit, all extras), see
docs/CLI.md#development-setup.
How it works
A build runs three meaningful passes. The first walks the tree with
Path.rglob, applies the include/exclude/dotdir rules, and inserts a
node row per entry plus a CONTAINS edge from each parent. The second
re-stats every file to fill in size_bytes. The third — the per-format
metadata pass — calls into the dispatcher in ftree_kg.metadata, which
opens images with Pillow and decodes camera, lens, capture timestamp,
GPS coordinates, and dimensions; the resulting dict is JSON-serialized
into the metadata column. A final embedding step builds a canonical
two-line text document for each node — "{kind} {basename} at {path}"
plus a keyword line that includes path components, basename token
splits, the file extension, and projected metadata tokens (camera
make/model, year, year-month, GPS) — embeds them in batches via
kg_utils.embedder, and writes the vectors to a single LanceDB table.
That metadata projection is what makes EXIF-grounded queries work
without any filename hints. A photo whose path is just
photos/IMG_0042.jpg ends up with an embed line that mentions
apple iphone 14 pro 2023 2023-07 gps:37.7749,-122.4194, so a query
like "iPhone photos from 2023" matches it directly. The schema doc
walks through the embed-text format end-to-end:
docs/SCHEMA.md#embed-text-format.
Querying is intentionally simple: the query string is embedded with the
same model used at build time, LanceDB returns the top-k nodes ranked
by cosine distance, and that's the answer. There is no graph expansion
phase — filesystem nodes have only CONTAINS, which is structural and
not semantically informative for hop-style retrieval. When the LanceDB
table is missing or the embedder fails to load, query falls back to a
substring LIKE search across qualname, kind, docstring, and
metadata, so it always returns something useful even on a freshly
extracted tree with no embeddings.
The full pipeline is described in flowing prose, layer by layer, in docs/pipeline.md, which doubles as the input format for diagram generators.
Python API
Everything the CLI does is one method call away on FileTreeKG:
from ftree_kg import FileTreeKG
kg = FileTreeKG(repo_root="/path/to/project")
kg.build() # wipe=True, embed=True, metadata=True
result = kg.query("configuration files", k=5)
for node in result.nodes:
print(f"{node['kind']:12} {node['qualname']} ({node['score']:.3f})")
stats = kg.stats()
print(f"{stats['total_nodes']:,} paths, {stats['total_size_bytes']:,} bytes")
print(kg.analyze()) # Markdown report as a string
kg.close()
build() accepts embed=False and metadata=False to skip the
expensive passes — useful when you want a fast structural index for
testing, or when the embedder isn't available in CI.
Configuration
Indexing scope is configurable from pyproject.toml. The block lives
under [tool.filetreekg] and has two keys, both optional:
[tool.filetreekg]
include = ["src", "docs"] # restrict to these top-level directories
exclude = ["archives"] # skip in addition to the built-in skip list
include is a whitelist — when it's non-empty, only paths under one of
the listed directories are indexed. exclude is additive on top of
the always-skipped names (venv, env, __pycache__, build,
dist, egg-info, node_modules). All dotdirs (.git, .venv,
.codekg, …) are skipped automatically unless you explicitly list them
in include.
CLI flags --include-dir and --exclude-dir override the config when
specified. Full precedence rules and per-command examples live in
docs/CLI.md.
Storage
A built tree gets a single hidden directory:
.filetreekg/
graph.sqlite # canonical knowledge graph (nodes + edges + metadata)
lancedb/ # derived vector index (kg_nodes.lance)
snapshots/ # temporal metric snapshots, keyed by git tree hash
manifest.json
<tree-hash>.json
SQLite is canonical — it is the source of truth. LanceDB is
derived and disposable: deleting .filetreekg/lancedb/ and
re-running ftreekg build reproduces it without re-walking the tree
(the embed pass reads from SQLite). Snapshots are append-only and
keyed by the git tree hash of the staged index, so they form a
deterministic timeline you can diff between commits or releases.
ftreekg install-hooks writes a pre-commit hook that captures a
snapshot on every commit.
For column-level details — node IDs, every SQLite column, every LanceDB column, every per-format metadata field — see docs/SCHEMA.md.
KGRAG federation
FileTreeKG.kind() returns "filetree" (the dedicated
KGKind.FILETREE enum value), and FileTreeKGAdapter exposes the
module to the KGRAG
federation layer. That means a single federated query can combine
filesystem context with code and document context from PyCodeKG and
DocKG indexed against the same repo:
from kg_rag import KGRAG
kgrag = KGRAG()
result = kgrag.query("how do we ship releases",
kinds=["code", "doc", "filetree"])
For working with the repo through MCP-compatible AI agents — including
the .mcp.json shipped in this checkout and the federated alternative
to a dedicated FTreeKG MCP server — see docs/MCP.md.
Citation
If you use FTreeKG in research or a project, please cite it:
APA
Suchanek, E. G. (2026). FTreeKG: Knowledge Graph for Filesystem Hierarchies (Version 0.8.0) [Software]. Flux-Frontiers. https://doi.org/10.5281/zenodo.1182124358
BibTeX
@software{suchanek_ftree_kg,
author = {Suchanek, Eric G.},
title = {{FTreeKG}: Knowledge Graph for Filesystem Hierarchies},
version = {0.8.0},
year = {2026},
publisher = {Flux-Frontiers},
url = {https://github.com/Flux-Frontiers/FTreeKG},
doi = {10.5281/zenodo.1182124358},
}
License
Elastic License 2.0 — free for non-commercial and internal use; commercial redistribution requires a license from Flux-Frontiers.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ftree_kg-0.8.0.tar.gz.
File metadata
- Download URL: ftree_kg-0.8.0.tar.gz
- Upload date:
- Size: 38.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
503fcb306f27fe234a490c49ea0a650f0640a49ca0710f7551a06a27f2dd38da
|
|
| MD5 |
1db3b47dab4a737ed183c2405bbde8da
|
|
| BLAKE2b-256 |
f23940d84efd491210866ae7f40e1b92bc0d01134d5a3de6581846cab0bbe515
|
File details
Details for the file ftree_kg-0.8.0-py3-none-any.whl.
File metadata
- Download URL: ftree_kg-0.8.0-py3-none-any.whl
- Upload date:
- Size: 42.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.3.2 CPython/3.12.13 Darwin/25.4.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58ee8673576620fe1edf63659396c92fee9452cce153c8ed9e778233954e44cd
|
|
| MD5 |
e81bc80b2e50208974a1bca252393a08
|
|
| BLAKE2b-256 |
79c31a3d3db7ccdbd48b893abb59f8a2473575ab05f2993c224bd80d5f4bc0ec
|