Skip to main content

Language-agnostic Code Property Graph library — weave syntax trees into queryable, analyzable graphs

Project description

treeloom

A language-agnostic Code Property Graph (CPG) library for Python. treeloom parses source code via tree-sitter, builds a unified graph combining AST, control flow, data flow, and call graph layers, and provides query and analysis APIs on top of it.

Features

  • Multi-language parsing -- Python, JavaScript, TypeScript, Go, Java, C, C++, and Rust via tree-sitter grammars
  • Unified graph model -- AST structure, control flow, data flow, and call graphs in a single queryable graph
  • Taint analysis -- generic label-propagation engine for tracking data flow from sources to sinks, with sanitizer support
  • Pattern matching -- chain-based pattern queries for finding code patterns across the graph
  • Visualization -- export to JSON, Graphviz DOT, or interactive HTML (Cytoscape.js)
  • Consumer annotations -- attach arbitrary metadata to nodes without modifying the structural graph
  • Overlay system -- inject visual styling for domain-specific visualization (e.g., security analysis results)
  • Serialization -- full round-trip JSON serialization including annotations

Quick Start

from pathlib import Path
from treeloom import CPGBuilder, NodeKind, EdgeKind

# Build a CPG from a directory of source files
cpg = CPGBuilder().add_directory(Path("src/")).build()

# Inspect the graph
print(f"{cpg.node_count} nodes, {cpg.edge_count} edges")
print(f"Files: {[str(f) for f in cpg.files]}")

# Find all function definitions
for func in cpg.nodes(kind=NodeKind.FUNCTION):
    print(f"  {func.name} at {func.location}")

# Find all call sites targeting a specific function
for call in cpg.nodes(kind=NodeKind.CALL):
    if call.name == "eval":
        print(f"  eval() called at {call.location}")

# Query: what nodes are reachable from a function via data flow?
func_node = next(cpg.nodes(kind=NodeKind.FUNCTION))
reachable = cpg.query().reachable_from(
    func_node.id, edge_kinds=frozenset({EdgeKind.DATA_FLOWS_TO})
)

Installation

pip install treeloom              # core only (networkx + tree-sitter)
pip install treeloom[languages]   # with all language grammars
pip install treeloom[all]         # everything (grammars + dev tools)

For development:

git clone https://github.com/rdwj/treeloom.git
cd treeloom
pip install -e ".[all]"

Supported Languages

Language Extensions Grammar Package
Python .py, .pyi tree-sitter-python
JavaScript .js, .mjs, .cjs tree-sitter-javascript
TypeScript .ts, .tsx tree-sitter-typescript
Go .go tree-sitter-go
Java .java tree-sitter-java
C .c, .h tree-sitter-c
C++ .cpp, .cc, ... tree-sitter-cpp
Rust .rs tree-sitter-rust

Grammar packages are optional dependencies. The core library works without them -- you just can't parse files without the appropriate grammar installed. Missing grammars produce clear error messages, not crashes.

Architecture

treeloom builds a Code Property Graph -- a single directed graph that unifies four views of source code.

AST layer. Module, class, function, parameter, variable, call, and literal nodes connected by containment edges (CONTAINS, HAS_PARAMETER). This gives you the structural hierarchy of the code.

Control flow layer. Statement-level flow between nodes within functions. FLOWS_TO edges represent sequential execution; BRANCHES_TO edges represent conditional or loop branching.

Data flow layer. Tracks where variables are defined and used, and how data propagates through assignments, function calls, and return values. Edges: DATA_FLOWS_TO, DEFINED_BY, USED_BY.

Call graph layer. Links call sites to their resolved function definitions. CALLS edges connect a call node to the function it invokes. Resolution is best-effort (no full type inference).

API Overview

Class / Function Purpose
CPGBuilder Fluent builder -- add files/directories, call build()
CodePropertyGraph Central graph object -- node/edge access, annotations, traversal, serialization
GraphQuery Path queries, reachability, subgraph extraction, pattern matching
TaintPolicy Consumer-defined source/sink/sanitizer callbacks
TaintResult Taint analysis output -- paths, labels, filtering
ChainPattern Declarative pattern for matching node chains
Overlay Per-node/edge visual styling for HTML export
to_json / from_json JSON serialization with full round-trip support
to_dot Graphviz DOT export
generate_html Interactive HTML visualization with Cytoscape.js

For full API details, see CLAUDE.md.

Taint Analysis

treeloom's taint engine propagates labels through data flow edges. It is generic -- the labels can represent anything (security-sensitive data, PII, environment variables). What they mean is up to you.

from treeloom import (
    CPGBuilder, CodePropertyGraph, TaintPolicy, TaintLabel, NodeKind,
)
from pathlib import Path

cpg = CPGBuilder().add_directory(Path("myapp/")).build()

# Define what constitutes a source, sink, and sanitizer
policy = TaintPolicy(
    sources=lambda node: (
        TaintLabel("user_input", node.id)
        if node.kind == NodeKind.PARAMETER and node.name == "user_data"
        else None
    ),
    sinks=lambda node: (
        node.kind == NodeKind.CALL and node.name in ("exec", "eval", "os.system")
    ),
    sanitizers=lambda node: (
        node.kind == NodeKind.CALL and node.name == "sanitize"
    ),
)

result = cpg.taint(policy)

for path in result.unsanitized_paths():
    print(f"Unsanitized: {path.source.name} -> {path.sink.name}")
    print(f"  Labels: {[l.name for l in path.labels]}")
    for node in path.intermediates:
        print(f"    {node.kind.value}: {node.name} at {node.location}")

Export and Visualization

JSON

Full round-trip serialization, including annotations:

from treeloom import to_json, from_json

json_str = to_json(cpg)
restored = from_json(json_str)  # equivalent graph

Graphviz DOT

from treeloom import to_dot, EdgeKind

# Full graph
dot = to_dot(cpg)

# Only data flow edges
dot = to_dot(cpg, edge_kinds=frozenset({EdgeKind.DATA_FLOWS_TO}))

with open("graph.dot", "w") as f:
    f.write(dot)

Interactive HTML

Self-contained HTML with Cytoscape.js. Includes layer toggles, search, click-to-inspect, and overlay support.

from treeloom import generate_html, Overlay, OverlayStyle

html = generate_html(cpg, title="My Project CPG")

with open("cpg.html", "w") as f:
    f.write(html)

Development

Set up a local development environment:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[all]"

Run tests:

pytest
pytest --cov=treeloom --cov-report=html

Lint and type-check:

ruff check src/ tests/
mypy src/treeloom/

Changelog

Version 0.2.3

  • Fixed data flow through chained method calls (.format().fetchone() pattern)
  • New treeloom edges command for querying edges by kind, source/target name
  • treeloom diff --match-by-basename and --strip-prefix for cross-directory comparison
  • treeloom query --scope, --count, --annotation, --annotation-value filters
  • Fixed --json-errors flag (errors now propagate to main handler for JSON formatting)
  • Build --progress skips unsupported file types, --language filter restricts parsing
  • DOT --edge-kind filter prunes disconnected nodes
  • Import nodes hidden by default in HTML visualization (togglable "Imports" layer)
  • treeloom viz --exclude-kind for consumer-controlled node filtering
  • Large graph warning (>500 nodes) suggesting subgraph extraction
  • 821 tests

Version 0.2.2

  • Fixed data flow tracking through string formatting (.format(), % operator, f-strings)
  • Fixed parameter references not generating data flow edges (root cause of taint false negatives)
  • Implemented CFG edge generation (flows_to, branches_to) connecting statements within functions
  • Implemented inter-procedural data flow: call-site arguments flow to callee parameters, return values flow back
  • Taint analysis on vulpy (deliberately vulnerable Flask app) went from 0 to 12 findings including cross-file HTTP-input-to-SQL-injection traces
  • 776 tests

Version 0.2.1

  • New CLI commands: annotate, diff, pattern, subgraph, watch, serve, completions
  • --json-errors global flag for machine-readable error output
  • --progress flag for build command
  • Multiple --policy files for taint policy composition
  • TaintResult.apply_to(cpg) stamps taint annotations onto the graph
  • --apply flag for taint command writes annotated CPG directly
  • Fixed variable scoping in all visitors (ScopeStack replaces flat dict)
  • Fixed import alias capture in Python, JavaScript, TypeScript visitors
  • Fixed taint sanitizer tracking on convergent paths (per-origin intersection)
  • Shell completions for bash, zsh, fish
  • HTTP JSON API server (treeloom serve) with query, node, edges, subgraph endpoints
  • 750 tests

Version 0.2.0

  • CLI with 7 subcommands: build, info, query, taint, viz, dot, config
  • YAML-based taint policies for CLI-driven analysis (sources, sinks, sanitizers, propagators)
  • Project and user configuration via .treeloom.yaml and ~/.config/treeloom/config.yaml
  • Works with pip install treeloom, uvx treeloom, and uv tool install treeloom
  • 585 tests

Version 0.1.0

  • Initial release
  • Code Property Graph with four layers: AST, control flow, data flow, call graph
  • Language visitors: Python, JavaScript, TypeScript/TSX, Go, Java, C, C++, Rust
  • Worklist-based taint analysis engine with inter-procedural propagation
  • Pattern matching query API with wildcard support
  • Export to JSON (round-trip), Graphviz DOT, and interactive HTML (Cytoscape.js)
  • Consumer annotation and overlay system for domain-specific visualization
  • 539 tests

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

treeloom-0.2.3.tar.gz (168.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

treeloom-0.2.3-py3-none-any.whl (119.5 kB view details)

Uploaded Python 3

File details

Details for the file treeloom-0.2.3.tar.gz.

File metadata

  • Download URL: treeloom-0.2.3.tar.gz
  • Upload date:
  • Size: 168.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for treeloom-0.2.3.tar.gz
Algorithm Hash digest
SHA256 2620f7ce855c90172d3dbe621e52ac178c8442e19e5e1f08acd5f88a4dad1757
MD5 c4f6084e722eecc5db5f8eb066ee7b87
BLAKE2b-256 7a9ccf6573e077b1bcb4648e5ecf67cb0513edae72d31d574fd8fbfe9127639f

See more details on using hashes here.

Provenance

The following attestation bundles were made for treeloom-0.2.3.tar.gz:

Publisher: release.yml on rdwj/treeloom

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file treeloom-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: treeloom-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 119.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for treeloom-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 577a7bbbd9cc5137e86e2c1dd1abde3bee18726127e4f0648129be508da6d9b9
MD5 f144b9b748675a0f9d9ba0ba5b05a0c2
BLAKE2b-256 4d37c223e0676af78cfd14aa0f0fcbe8d5a3bac1a8cff0478b9e922af473d96d

See more details on using hashes here.

Provenance

The following attestation bundles were made for treeloom-0.2.3-py3-none-any.whl:

Publisher: release.yml on rdwj/treeloom

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page