Skip to main content

Sparse geometry and proximity tools for tree ensemble models

Project description

ForestGeom ๐ŸŒณ

license: GPL-3.0-or-later python: 3.10+ pkg: uv PyPI: forestgeom tests paper: arXiv 2601.02735

     x_i โ— โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ— x_j
                        โ–ผ     โ–ผ
               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚     TREE ENSEMBLES      โ”‚
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ”‚
               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚                       โ”‚
               โ–ผ                       โ–ผ
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚ same decision   โ”‚     โ”‚ divergent       โ”‚
      โ”‚ paths           โ”‚     โ”‚ decision paths  โ”‚
      โ”‚                 โ”‚     โ”‚                 โ”‚
      โ”‚        โ—        โ”‚     โ”‚        โ—        โ”‚
      โ”‚       / \       โ”‚     โ”‚       / \       โ”‚
      โ”‚      โ—   โ—      โ”‚     โ”‚      โ—   โ—      โ”‚
      โ”‚     /     \     โ”‚     โ”‚     /     \     โ”‚
      โ”‚    โ—       โ—    โ”‚     โ”‚    โ—       โ—    โ”‚
      โ”‚   / \     / \   โ”‚     โ”‚   / \     / \   โ”‚
      โ”‚  โ—   โ—   โ—   โ—  โ”‚     โ”‚  โ—   โ—   โ—   โ—  โ”‚
      โ”‚      โ–ฒ          โ”‚     โ”‚  โ–ฒ       โ–ฒ      โ”‚
      โ”‚     x_i         โ”‚     โ”‚ x_i     x_j     โ”‚
      โ”‚     x_j         โ”‚     โ”‚                 โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚                       โ”‚
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                           โ–ผ     
            unified forest-induced geometry

forestgeom implements the sparse leaf-incidence kernel framework developed in โ€œRevisiting Forest Proximities via Sparse Leaf-Incidence Kernelsโ€ [1]. The package treats a fitted tree ensemble as a reusable geometric object: samples are encoded by the leaves they reach, and forest proximities are represented through sparse linear maps rather than dense pairwise matrices.

Since their original formulation by Leo Breiman in the early 2000s, forest proximities have given already-powerful decision forest models a geometric perspective. This intuitive notion of semi-supervised relationships between points, based on decision-path similarity across trees, has long been treated as a fixed procedure whose direct computation is expensive in the number of samples. forestgeom removes this burden with efficient sparse linear algebra through sparse leaf-collision kernels; more importantly, it directly exposes sparse forest-leaf-induced maps for matrix-free, forest-guided downstream representation learning. See [1] for details.

Forest geometry has since been used in a wide range of applications that need a graded, context-aware notion of similarity beyond class-conditional Euclidean distances or black-box deep representation models. This makes it especially useful for modality-agnostic pipelines, tabular data, and sparse, noisy, high-dimensional settings such as single-cell analysis. Reference [1] provides a comprehensive literature overview, and this package aims to encourage further work in these directions by unifying a collection of forest models and geometric perspectives behind a single API.

The implementation includes several proximity constructions within this leaf-incidence view, including standard forest kernels, KeRF-style leaf-size-normalized kernels, boosted tree-weighted kernels, and GAP/OOB proximities from โ€œGeometry- and Accuracy-Preserving Random Forest Proximitiesโ€ [2].

The project is intended to evolve beyond leaf-incidence maps into a broader framework for forest-induced representation learning. Natural extensions include node/path-based geometry, additional base forest families, and GPU-accelerated pipelines.

Installation

The recommended installer is uv. To install into the active environment:

uv pip install forestgeom

Optional dependencies are grouped by feature:

# LightGBM and XGBoost adapters
uv pip install "forestgeom[boosted]"

# Visualization and embedding tools
uv pip install "forestgeom[viz]"

# Experiment dependencies
uv pip install "forestgeom[experiments]"

# Test dependencies
uv pip install "forestgeom[test]"

# Everything above
uv pip install "forestgeom[all]"

To try unreleased features from the GitHub repository, install directly from a branch, tag, or commit:

# latest main branch
uv pip install git+https://github.com/JakeSRhodesLab/ForestGeom.git

# specific branch or tag
uv pip install git+https://github.com/JakeSRhodesLab/ForestGeom.git@main

# GitHub install with extras
uv pip install 'git+https://github.com/JakeSRhodesLab/ForestGeom.git@main#egg=forestgeom[boosted]'

If you are adding forestgeom to an existing uv-managed project, use uv add instead:

uv add forestgeom
uv add "forestgeom[boosted]"

For local development from a cloned checkout:

uv sync --extra test
pip also works
pip install forestgeom
pip install "forestgeom[boosted]"
pip install "forestgeom[viz]"
pip install "forestgeom[experiments]"
pip install "forestgeom[test]"
pip install "forestgeom[all]"
pip install git+https://github.com/JakeSRhodesLab/ForestGeom.git
pip install -e ".[test]"

Architecture

ForestGeom is organized around one central object, ForestProximity. The class wraps a fitted tree ensemble and turns it into a reusable geometry object built from sparse leaf-incidence maps.

        RandomForest / ExtraTrees / GBT / LightGBM / XGBoost
                                    |
                                    v
   X_train, y_train --> +------------------------+
   fit(...)             |    ForestProximity     |
                        +------------------------+
                                    |
                                    v
                      fitted adapter + ForestCache
                                    |
                                    v
                  +------------------------------------+
                  | sparse forest representation       |
                  | leaf incidence + scheme weights    |
                  +------------------------------------+
                                    |
                      +-------------+-------------+
                      |                           |
                      v                           v
            +----------------------+    +------------------------+
            | separable schemes    |    | corrected schemes      |
            | P = Q W^T            |    | P = normalize(QW^T)    |
            +----------------------+    +------------------------+
                      |                           |
                      v                           |
            +----------------------+              |
            | Q: query_map(X=None) |              |
            | W: reference_map()   |              |
            +----------------------+              |
                      |                           |
                      +-------------+-------------+
                                    |
                                    v
                          +-------------------+
                          | transform(X_new)  |
                          | P(X_new, X_train) |
                          +-------------------+
                                    |
                                    v
                      forest-induced proximity geometry

The adapter layer hides backend-specific details such as leaf indexing, bootstrap masks, in-bag counts, and boosted tree weights. The map-building layer then uses those quantities to construct the sparse geometry for the selected weighting scheme (uniform, kerf, oob, gap, or boosted).

The important distinction is:

  • Symmetric schemes such as uniform, kerf, and boosted use the same leaf-incidence geometry on both sides, so the resulting proximity is a kernel.
  • Asymmetric schemes such as gap expose distinct query and reference maps, which induce a bilinear form P(i, j) = <Q(i), W(j)> rather than a kernel.
  • The true Breiman OOB scheme is pairwise-normalized and is computed directly as a sparse proximity matrix; it does not factor into a single reusable Q/W pair.

Usage

ForestProximity wraps a tree ensemble estimator and clones/fits it during fit(...). It supports a unified set of forest backends and weighting schemes:

Supported base forest classes include:

  • sklearn.ensemble.RandomForestClassifier
  • sklearn.ensemble.RandomForestRegressor
  • sklearn.ensemble.ExtraTreesClassifier
  • sklearn.ensemble.ExtraTreesRegressor
  • sklearn.ensemble.GradientBoostingClassifier
  • sklearn.ensemble.GradientBoostingRegressor
  • lightgbm.LGBMClassifier and lightgbm.LGBMRegressor with forestgeom[boosted]
  • xgboost.XGBClassifier and xgboost.XGBRegressor with forestgeom[boosted]

Supported leaf-weighting schemes include:

  • uniform: symmetric leaf co-occurrence factorization of the standard forest kernel.
  • kerf: symmetric leaf-size-normalized factorization of the KeRF kernel.
  • oob: pairwise-normalized Breiman OOB proximity computed directly in sparse form.
  • gap: asymmetric query/reference factorization that combines OOB-side query weights with in-bag reference weights to recover the GAP proximity definition.
  • boosted: symmetric tree-weighted leaf kernel for supported boosted ensembles.

Not every estimator supports every weighting scheme. Random Forests and ExtraTrees estimators support uniform and kerf; they support oob and gap only when fitted with bootstrap=True. Boosted estimators support uniform, kerf, and boosted.

Use fit(...) when you want to train and keep the fitted geometry, and use fit_transform(...) when you want the fitted train-train proximity matrix right away. Use query_map(...) and reference_map(...) when you need the actual leaf-incidence factors Q and W for matrix-free applications, and use transform(...) for the proximity block from new samples to the fitted training set.

For schemes that are not symmetric kernels, fit_transform(...) and fit(...).transform(...) are not necessarily the same. If you need the training geometry, use fit_transform(...) directly or call training_proximity(...) on the fitted estimator.

For symmetric weighting schemes such as uniform, kerf, and boosted, the query map is typically the leaf-space feature matrix. For asymmetric schemes such as gap, keep both Q and W if you want to work directly with the geometry. For oob, use training_proximity(...) or transform(...) directly; there is no separate query/reference factorization.

The sparse geometry can be used directly in proximity-based workflows such as manifold learning, dimensionality reduction, visualization, imputation, and custom downstream estimators.

Quick Start

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC

from forestgeom import ForestProximity

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
  X,
  y,
  test_size=0.2,
  stratify=y,
  random_state=0,
)

forest = RandomForestClassifier(
  n_estimators=200,
  bootstrap=True,
  random_state=0,
  n_jobs=-1,
)

geometry = ForestProximity(forest=forest, weight_scheme="uniform").fit(X_train, y_train)

# Query/reference maps define the symmetric geometry.
Q_train = geometry.query_map()
W_train = geometry.reference_map()  # This is the same a Q_train for symmetric schemes such as 'uniform', 'kerf' and 'boosted'.
Q_test = geometry.query_map(X_test)  # Leaf-incidence representations of the test set.

# Matrix-free forest kernel SVM using the leaf maps directly as sparse features.
svm = LinearSVC()
svm.fit(Q_train, y_train)
pred = svm.predict(Q_test)
print(f"leaf-map SVM accuracy: {accuracy_score(y_test, pred):.3f}")

# Comparison with the base forest classifier.
pred = geometry.forest_.predict(X_test)
print(f"base-forest accuracy: {accuracy_score(y_test, pred):.3f}")

# To run the boosted example, install optional dependencies first:
# uv pip install "forestgeom[boosted]"
from xgboost import XGBClassifier

forest = XGBClassifier(n_estimators=200, random_state=0)
boosted_geometry = ForestProximity(forest=forest, weight_scheme="boosted")
K_train = boosted_geometry.fit_transform(X_train, y_train)
K_test = boosted_geometry.transform(X_test)

Demos and Experiments

The repository includes notebook demos for common workflows:

  • demos/demo_iris.ipynb: general-purpose introduction on the Iris dataset.
  • demos/demo_leaf_pca.ipynb: matrix-free supervised manifold learning with leaf PCA using the leaf-incidence maps in kernel proximities.
  • demos/demo_boosted.ipynb: boosted-tree examples using the optional boosted adapters.

The experiments/ directory contains Python scripts and notebooks used to reproduce experiments and compile results from โ€œRevisiting Forest Proximities via Sparse Leaf-Incidence Kernelsโ€.

Citation

If you use this software in your research or experiments, please cite the leaf-incidence kernel framework paper [1]:

[1] Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels.

@misc{aumon2026revisitingforestproximitiessparse,
      title={Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels}, 
      author={Adrien Aumon and Guy Wolf and Kevin R. Moon and Jake S. Rhodes},
      year={2026},
      eprint={2601.02735},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2601.02735}}

If you specifically use the gap weighting scheme, please also cite the GAP proximity paper [2]:

[2] Geometry- and Accuracy-Preserving Random Forest Proximities.

@ARTICLE{10089875,
  author={Rhodes, Jake S. and Cutler, Adele and Moon, Kevin R.},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Geometry- and Accuracy-Preserving Random Forest Proximities}, 
  year={2023},
  volume={45},
  number={9},
  pages={10947-10959},
  keywords={Random forests;Forestry;Geometry;Data visualization;Decision trees;Task analysis;Anomaly detection;Proximities;random forests;supervised learning},
  doi={10.1109/TPAMI.2023.3263774}}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forestgeom-0.1.0.tar.gz (4.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forestgeom-0.1.0-py3-none-any.whl (38.9 kB view details)

Uploaded Python 3

File details

Details for the file forestgeom-0.1.0.tar.gz.

File metadata

  • Download URL: forestgeom-0.1.0.tar.gz
  • Upload date:
  • Size: 4.8 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for forestgeom-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0b6ddfafea06060eeec6a3dbd16411266be6cf1edf4490ac6259b1674c8644de
MD5 7c2bffbaa00e8fac12a6da009b90c508
BLAKE2b-256 8cd9ba17c7e220144087b2df18e91a24e52cd9f103e073c9894a2cceac3c6c7d

See more details on using hashes here.

File details

Details for the file forestgeom-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: forestgeom-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 38.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for forestgeom-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b46e853d1d5f272ac3f1b62125483fd1e64705e7dd89f89caec3c36f0bb3800f
MD5 6f02d7bae946808c93af85ca2f4a0ef7
BLAKE2b-256 26332c8c159c2846ad6a18b8c065180a8cbe3a422f4ebc4b9c78c65ee17778ca

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page