Export codebase structure and contents for AI/LLM context

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nikolayer

These details have not been verified by PyPI

Project description

TreeMapper

Ultimate Goal

CRITICAL: This is the guiding star of the entire project. Every feature, every design decision, every line of code must serve this goal. It is an asymptotic ideal — not a finish line to cross, but a direction to relentlessly pursue.

Maximize the speed and depth of understanding textual information — for any reader, in any scenario.

Whether the consumer is an LLM processing a context window or a human reviewing a code change, TreeMapper's job is the same: extract the maximum signal from a codebase and present it in the clearest, most information-dense form possible. Every design decision optimizes for comprehension-per-token — the ratio of understanding gained to attention spent. This metric is the single lens through which all trade-offs are evaluated.

Export your codebase for AI/LLM context in one command.

pip install treemapper                    # core (no native extensions)
pip install 'treemapper[tree-sitter]'     # + AST parsing for 10 languages
treemapper . -o context.yaml              # paste into ChatGPT/Claude

Why TreeMapper?

Unlike tree or find, TreeMapper exports structure + file contents in a format optimized for fast comprehension:

name: myproject
type: directory
children:
  - name: main.py
    type: file
    content: |
      def hello():
          print("Hello, World!")
  - name: utils/
    type: directory
    children:
      - name: helpers.py
        type: file
        content: |
          def add(a, b):
              return a + b

Usage

treemapper                               # current dir, YAML to stdout
treemapper .                          # YAML to stdout + token count
treemapper . -o tree.yaml             # save to file
treemapper . -o                       # save to tree.yaml (default)
treemapper . -o -                     # explicit stdout output
treemapper . -f json                  # JSON format
treemapper . -f txt                   # plain text with indentation
treemapper . -f md                    # Markdown with fenced code
treemapper . -f yml                   # YAML (alias)
treemapper . --no-content             # structure only
treemapper . --max-depth 3            # limit depth (0=root, 1=children)
treemapper . --max-file-bytes 10000   # skip files > 10KB (default: 10 MB)
treemapper . --max-file-bytes 0       # no limit
treemapper . -i custom.ignore         # custom ignore patterns
treemapper . --no-default-ignores     # disable .gitignore + defaults
treemapper . --log-level info         # log level (default: error)
treemapper . -c                       # copy to clipboard
treemapper . -c -o tree.yaml          # clipboard + save to file
treemapper -v                         # show version

Diff Context Mode

Smart context selection for git diffs — automatically finds the minimal set of code fragments needed to understand a change:

treemapper . --diff HEAD~1..HEAD      # recent changes
treemapper . --diff main..feature     # feature branch
treemapper . --diff HEAD~1 --budget 30000  # limit tokens
treemapper . --diff HEAD~1 --full     # all changed code

Uses graph-based relevance propagation (Personalized PageRank) to select the most important context. Output size is controlled by algorithm convergence (τ-stopping) by default, or an explicit --budget token limit. Understands imports, type references, config dependencies, and co-change patterns across 15+ programming languages.

Output format:

name: myproject
type: diff_context
fragment_count: 5
fragments:
  - path: src/main.py
    lines: "10-25"
    kind: function
    symbol: process_data
    content: |
      def process_data(items):
          ...

Options:

Flag	Default	Description
`--budget`	none	Token limit (convergence-based by default)
`--alpha`	0.60	PPR damping factor
`--tau`	0.08	Stopping threshold
`--full`	false	Include all changed code

Token Counting

Token count and size are always displayed on stderr:

12,847 tokens (o200k_base), 52.3 KB

For large outputs (>1MB), approximate counts with ~ prefix:

~125,000 tokens (o200k_base), 5.2 MB

Uses tiktoken with o200k_base encoding (GPT-4o tokenizer).

Clipboard Support

Copy output directly to clipboard with -c or --copy:

treemapper . -c                       # copy (no stdout)
treemapper . -c -o tree.yaml          # copy + save to file

System Requirements:

macOS: pbcopy (pre-installed)
Windows: clip (pre-installed)
Linux (Wayland): wl-copy
Linux (X11): xclip or xsel

Python API

from treemapper import map_directory
from treemapper import to_yaml, to_json, to_text, to_markdown

tree = map_directory(
    path,                    # directory path
    max_depth=None,          # limit traversal depth
    no_content=False,        # exclude file contents
    max_file_bytes=None,     # skip large files
    ignore_file=None,        # custom ignore file
    no_default_ignores=False,# disable default ignores
)

yaml_str = to_yaml(tree)
json_str = to_json(tree)
text_str = to_text(tree)
md_str = to_markdown(tree)

Ignore Patterns

Respects .gitignore and .treemapperignore automatically. Use --no-default-ignores to disable all ignore processing (.gitignore, .treemapperignore, and built-in defaults).

Hierarchical: nested ignore files at each directory level
Negation patterns: !important.log un-ignores a file
Anchored patterns: /root_only.txt matches only in root
Output file is always auto-ignored

Content Placeholders

<file too large: N bytes> — exceeds --max-file-bytes
<binary file: N bytes> — binary file detected
<unreadable content: not utf-8> — not valid UTF-8
<unreadable content> — permission denied or I/O error

Development

pip install -e ".[dev,tree-sitter]"
pytest
pre-commit run --all-files

Testing

Integration tests only — test against real filesystem and real git repos. No mocking.

The diff context tests use a YAML-based declarative framework: each test case defines initial files, changed files, and expected output assertions. A dedicated test runner creates a real git repo per test, commits the files, runs the full diffctx pipeline, and verifies results.

Negative testing via garbage injection: every test case automatically includes ~10 unrelated "garbage" files with distinctive markers. Tests verify the algorithm excludes this noise, catching regressions in relevance filtering. Each garbage file uses unique prefixed identifiers (e.g. GARBAGE_*) so leaks are unambiguously detectable.

Two Modes of Operation

TreeMapper operates in two fundamentally different modes that share output formatting, token counting, and file reading infrastructure:

Tree Mapping Mode (treemapper .) — Filesystem-focused. Walks the directory tree respecting hierarchical ignore patterns, reads file contents with binary/encoding detection, and serializes to YAML/JSON/text/Markdown. Deterministic, side-effect-free.

Diff Context Mode (treemapper . --diff) — Semantics-focused. Analyzes a git diff to intelligently select the minimal set of code fragments needed to understand a change. This is the core intellectual property of the project — a graph-based relevance engine described in detail below. For the formal theoretical foundation, see the research paper: Context-Selection for Git Diff.

Diff Context: Architecture & Design

The Problem

When reviewing a code change, the diff alone is rarely sufficient. A developer needs surrounding context: the function being called, the interface being implemented, the config driving deployment. But naively including "everything related" explodes the context window. The challenge is selecting the minimal, sufficient context within a token budget.

The Approach: Graph-Based Relevance Propagation

The diffctx engine models a codebase as a weighted directed graph where nodes are semantic code fragments and edges represent dependencies between them. Changed code seeds the graph, relevance propagates through edges via Personalized PageRank, and a budget-aware greedy algorithm selects the best fragments.

This approach was chosen over simpler alternatives (call-graph depth, grep-based expansion, file-level inclusion) because:

Transitive importance decays naturally — a function calling a modified function is relevant; a function calling that function is less so. PPR captures this without manual depth limits.
Heterogeneous relationships combine gracefully — imports, type references, config links, test patterns, and lexical similarity all contribute edges with different weights. No single signal captures all dependencies.
Budget optimization is principled — submodular utility maximization with lazy greedy selection gives near-optimal coverage per token spent.

Pipeline Stages

The engine operates as a 7-stage pipeline:

Diff Parsing — Extract changed file paths and exact line ranges from git diff output.
Core Fragment Identification — Break changed files into semantic units (functions, classes, config blocks, doc sections) using language-aware parsers, then identify which fragments cover the actual changed lines.
Concept Extraction — Extract identifiers from added/removed diff lines. These "diff concepts" represent the vocabulary of the change and drive relevance scoring.
Universe Expansion — Discover related files beyond those directly changed. Edge builders scan for imports, config references, naming patterns. Rare identifiers (appearing in ≤3 files) trigger targeted file discovery.
Graph Construction — Build fragment-level dependency graph. 26 edge builders contribute weighted edges across 6 categories (see below). Edges are aggregated via max — if any builder thinks two fragments are related, the strongest signal wins. Hub suppression downweights over-connected nodes (e.g. common utilities) to prevent them from dominating the graph.
Relevance Scoring (PPR) — Run Personalized PageRank seeded from core (changed) fragments. The damping factor α=0.60 controls propagation depth: 60% chance of following an edge, 40% chance of teleporting back to changed code. Convergence produces a relevance score per fragment.
Budget-Aware Selection — A lazy greedy algorithm selects fragments maximizing density (marginal utility per token). Core fragments are selected first, then expansion candidates ordered by a max-heap. A τ-based stopping threshold (relative to baseline density median) prevents noise accumulation. When no explicit --budget is set, τ-stopping alone controls output size — the algorithm converges naturally without a hard token cap.

Edge Taxonomy: Six Perspectives on Code Relationships

The system intentionally models relationships from multiple independent perspectives. Each catches blind spots the others miss.

Semantic Edges — Language-aware code dependencies. Import/export resolution, function calls, type references, symbol usage. 11 language-specific builders (Python, JavaScript/TypeScript, Go, Rust, Java/Kotlin/Scala, C/C++, C#/.NET, Ruby, PHP, Swift, Shell). Weights reflect type-system reliability: Rust symbol refs (0.95) are trusted more than Python calls (0.55) because static analysis is more reliable in strict type systems. All semantic edges are asymmetric — "A imports B" is a stronger signal than "B is imported by A" — modeled via reverse weight factors (0.4–0.7).

Configuration Edges — Infrastructure-to-code dependencies that don't appear in source. Docker COPY/FROM to source files, Kubernetes manifests to application code, Terraform modules to infrastructure scripts, CI/CD workflows to tested code, Helm templates to services, build system configs to compiled sources, generic config keys to code referencing them. 7 specialized builders covering the DevOps ecosystem.

Structural Edges — Filesystem and organizational proximity. Containment (parent-child directory nesting), test-code associations (naming heuristics like test_foo.py to foo.py), sibling files in the same directory. These are weak signals (0.05–0.60) that prevent blind spots in code without explicit imports.

Document Edges — Non-code content relationships. Section-to-section flow within Markdown, anchor link references, cross-document citations. Enable following documentation dependencies when docs change alongside code.

Similarity Edges — Content-based relationships via TF-IDF lexical matching. Finds code with similar vocabulary/structure even without explicit references. Weight bounds are language-specific: wider for dynamic languages (Python 0.20–0.35), narrower for typed (Rust 0.10–0.15) where semantic edges are more reliable.

History Edges — Temporal co-change patterns from git log. Files repeatedly committed together have implicit coupling. Capped at 500 recent commits with noise filtering (ignoring large commits with >30 files).

Selection: Submodular Utility Maximization

The greedy selector optimizes a submodular utility function under a token budget constraint:

Concept coverage — Each diff concept (identifier from the change) has a "best coverage score" across selected fragments. Adding a fragment that covers new concepts yields high marginal gain; covering already-covered concepts yields diminishing returns (modeled via square-root scaling).

Relatedness bonus — High-PPR fragments receive minimum guaranteed utility even without concept overlap, ensuring structurally related code is included.

Density ordering — Candidates are ranked by utility-per-token (density), not raw utility. A 10-token fragment covering 2 concepts beats a 500-token fragment covering 3. Lazy heap evaluation avoids recomputing stale density values until a candidate is popped.

τ-stopping — After establishing a baseline from the first 5 selected fragments, stop when density drops below τ × median(baseline). This relative threshold adapts to the codebase: dense code triggers earlier stopping, sparse code allows broader inclusion.

Fragment Granularity

Files are decomposed into semantic fragments using a priority-ordered parser pipeline. Language-specific parsers (tree-sitter for 10 languages, Python AST, Mistune for Markdown) produce function/class/section-level fragments. Fallback parsers handle config files (key-value boundaries), text (sentence-aware splitting), and generic content (line-count limits). The granularity choice means PPR reasons at the right level — a changed line in a function selects that function as a unit, not the whole file.

Key Design Decisions

Why Personalized PageRank over call-graph BFS? BFS requires arbitrary depth limits and treats all edges equally. PPR provides natural exponential decay, respects edge weights, and converges to a principled relevance distribution.

Why max-aggregation for edge combination? Multiple edge types often agree on the same relationship. Taking the max avoids inflating weights through redundant signals while preserving the strongest evidence from any perspective.

Why submodular greedy over knapsack? Submodular functions guarantee that greedy gives (1 - 1/e) ≈ 63% of optimal. With lazy evaluation and density ordering, the algorithm runs in near-linear time while achieving strong coverage.

Why asymmetric edge weights? Code dependencies are directional. "A imports B" means A needs B for context; B doesn't necessarily need A. Reverse factors (0.4–0.7 of forward weight) enable bidirectional graph search while respecting this asymmetry.

Why hub suppression? Common utility modules (logging, helpers, config) receive edges from everywhere. Without dampening, they dominate PPR scores and pull in unrelated code. Log-scaled in-degree suppression at the 95th percentile keeps them accessible without letting them dominate.

Tunable Parameters

Parameter	Default	Controls
`--budget`	none	Token limit (convergence-based by default)
`--alpha`	0.60	PPR damping — broader propagation
`--tau`	0.08	Stopping — stricter = less noise
`--full`	false	Bypass smart selection

Technology Choices

Decision	Choice	Rationale
Output	YAML	LLM-readable, literal blocks
Tokens	tiktoken o200k	GPT-4o standard, exact BPE
Ignores	pathspec	gitignore-compatible
Parsing	tree-sitter	10 languages, AST-level
Ranking	PPR	Relevance with natural decay
Selection	Lazy greedy	Near-optimal, linear time
Git	subprocess UTF-8	Platform-safe, non-ASCII
Diff	git diff unified=0	Exact line ranges

License

Apache 2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

nikolayer

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.6.1

May 15, 2026

1.6.0

Apr 25, 2026

1.5.0

Apr 18, 2026

1.4.4

Apr 8, 2026

1.4.3

Mar 23, 2026

1.4.2

Mar 21, 2026

1.4.1

Mar 16, 2026

1.4.0

Mar 14, 2026

1.3.7

Mar 8, 2026

1.3.6

Mar 1, 2026

1.3.5

Mar 1, 2026

1.3.4

Feb 28, 2026

1.3.3

Feb 27, 2026

1.3.2

Feb 22, 2026

1.3.1

Feb 21, 2026

This version

1.3.0

Feb 21, 2026

1.2.2

Feb 8, 2026

1.2.1

Jan 28, 2026

1.2.0

Jan 3, 2026

1.1.3

Dec 28, 2025

1.1.2

Dec 25, 2025

1.1.1

Nov 30, 2025

1.1.0

Oct 16, 2025

1.0.4

Nov 30, 2025

1.0.3

Aug 17, 2025

1.0.2

Jul 17, 2025

1.0.1

May 10, 2025

1.0.0

Apr 8, 2025

0.0.10

Nov 25, 2024

0.0.9

Oct 20, 2024

0.0.8

Oct 20, 2024

0.0.7

Oct 19, 2024

0.0.6

Oct 18, 2024

0.0.5

Oct 18, 2024

0.0.4

Oct 18, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

treemapper-1.3.0.tar.gz (151.9 kB view details)

Uploaded Feb 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

treemapper-1.3.0-py3-none-any.whl (138.5 kB view details)

Uploaded Feb 21, 2026 Python 3

File details

Details for the file treemapper-1.3.0.tar.gz.

File metadata

Download URL: treemapper-1.3.0.tar.gz
Upload date: Feb 21, 2026
Size: 151.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for treemapper-1.3.0.tar.gz
Algorithm	Hash digest
SHA256	`60dd0e41ee532c427bcb59c35b8fd3a3a13954ce642fd11b04660830f7471e55`
MD5	`0eed4b7820a210b62e9ed4200b06a566`
BLAKE2b-256	`096335fa20863ac6c0793fe8ae2818a60f8c124eb8fb8d1879d9e176cb587ba6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for treemapper-1.3.0.tar.gz:

Publisher: cd.yml on nikolay-e/treemapper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: treemapper-1.3.0.tar.gz
- Subject digest: 60dd0e41ee532c427bcb59c35b8fd3a3a13954ce642fd11b04660830f7471e55
- Sigstore transparency entry: 975927532
- Sigstore integration time: Feb 21, 2026
Source repository:
- Permalink: nikolay-e/treemapper@c2d5845456dd248bc3ad630fa4edbea1e7a88407
- Branch / Tag: refs/heads/main
- Owner: https://github.com/nikolay-e
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd.yml@c2d5845456dd248bc3ad630fa4edbea1e7a88407
- Trigger Event: workflow_dispatch

File details

Details for the file treemapper-1.3.0-py3-none-any.whl.

File metadata

Download URL: treemapper-1.3.0-py3-none-any.whl
Upload date: Feb 21, 2026
Size: 138.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for treemapper-1.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`911bb7bd88bce79020b4e157d85f81dcc9f8942dee8b9dc03756a199ebb4b0d4`
MD5	`401591fb6989c35212485b46a650d8bd`
BLAKE2b-256	`f1d76af3fc0e8e8a2e7a0e68ff355b92b4a98fb222636af6787316e603a8ce69`

See more details on using hashes here.

Provenance

The following attestation bundles were made for treemapper-1.3.0-py3-none-any.whl:

Publisher: cd.yml on nikolay-e/treemapper

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: treemapper-1.3.0-py3-none-any.whl
- Subject digest: 911bb7bd88bce79020b4e157d85f81dcc9f8942dee8b9dc03756a199ebb4b0d4
- Sigstore transparency entry: 975927534
- Sigstore integration time: Feb 21, 2026
Source repository:
- Permalink: nikolay-e/treemapper@c2d5845456dd248bc3ad630fa4edbea1e7a88407
- Branch / Tag: refs/heads/main
- Owner: https://github.com/nikolay-e
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: cd.yml@c2d5845456dd248bc3ad630fa4edbea1e7a88407
- Trigger Event: workflow_dispatch

treemapper 1.3.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

TreeMapper

Ultimate Goal

Why TreeMapper?

Usage

Diff Context Mode

Token Counting

Clipboard Support

Python API

Ignore Patterns

Content Placeholders

Development

Testing

Two Modes of Operation

Diff Context: Architecture & Design

The Problem

The Approach: Graph-Based Relevance Propagation

Pipeline Stages

Edge Taxonomy: Six Perspectives on Code Relationships

Selection: Submodular Utility Maximization

Fragment Granularity

Key Design Decisions

Tunable Parameters

Technology Choices

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance