Sparse geometry and proximity tools for tree ensemble models
Project description
ForestGeom ๐ณ
x_i โ โโโโโโโโโโโโโโ โโโโโโโโโโโโโ โ x_j
โผ โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TREE ENSEMBLES โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ same decision โ โ divergent โ
โ paths โ โ decision paths โ
โ โ โ โ
โ โ โ โ โ โ
โ / \ โ โ / \ โ
โ โ โ โ โ โ โ โ
โ / \ โ โ / \ โ
โ โ โ โ โ โ โ โ
โ / \ / \ โ โ / \ / \ โ
โ โ โ โ โ โ โ โ โ โ โ โ
โ โฒ โ โ โฒ โฒ โ
โ x_i โ โ x_i x_j โ
โ x_j โ โ โ
โโโโโโโโโโฌโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโ
โ โ
โโโโโโโโโโโโโฌโโโโโโโโโโโโ
โผ
unified forest-induced geometry
forestgeom implements the sparse leaf-incidence kernel framework developed in
โRevisiting Forest Proximities via Sparse Leaf-Incidence Kernelsโ [1]. The
package treats a fitted tree ensemble as a reusable geometric object: samples
are encoded by the leaves they reach, and forest proximities are represented
through sparse linear maps rather than dense pairwise matrices.
Since their original formulation by Leo Breiman in the early 2000s, forest
proximities have given already-powerful decision forest models a geometric
perspective. This intuitive notion of semi-supervised relationships between
points, based on decision-path similarity across trees, has long been treated as
a fixed procedure whose direct computation is expensive in the number of
samples. forestgeom removes this burden with efficient sparse linear algebra
through sparse leaf-collision kernels; more importantly, it directly exposes
sparse forest-leaf-induced maps for matrix-free, forest-guided downstream
representation learning. See [1] for details.
Forest geometry has since been used in a wide range of applications that need a graded, context-aware notion of similarity beyond class-conditional Euclidean distances or black-box deep representation models. This makes it especially useful for modality-agnostic pipelines, tabular data, and sparse, noisy, high-dimensional settings such as single-cell analysis. Reference [1] provides a comprehensive literature overview, and this package aims to encourage further work in these directions by unifying a collection of forest models and geometric perspectives behind a single API.
The implementation includes several proximity constructions within this leaf-incidence view, including standard forest kernels, KeRF-style leaf-size-normalized kernels, boosted tree-weighted kernels, and GAP/OOB proximities from โGeometry- and Accuracy-Preserving Random Forest Proximitiesโ [2].
The project is intended to evolve beyond leaf-incidence maps into a broader framework for forest-induced representation learning. Natural extensions include node/path-based geometry, additional base forest families, and GPU-accelerated pipelines.
Installation
The recommended installer is uv. To install
into the active environment:
uv pip install forestgeom
Optional dependencies are grouped by feature:
# LightGBM and XGBoost adapters
uv pip install "forestgeom[boosted]"
# Visualization and embedding tools
uv pip install "forestgeom[viz]"
# Experiment dependencies
uv pip install "forestgeom[experiments]"
# Test dependencies
uv pip install "forestgeom[test]"
# Everything above
uv pip install "forestgeom[all]"
To try unreleased features from the GitHub repository, install directly from a branch, tag, or commit:
# latest main branch
uv pip install git+https://github.com/JakeSRhodesLab/ForestGeom.git
# specific branch or tag
uv pip install git+https://github.com/JakeSRhodesLab/ForestGeom.git@main
# GitHub install with extras
uv pip install 'git+https://github.com/JakeSRhodesLab/ForestGeom.git@main#egg=forestgeom[boosted]'
If you are adding forestgeom to an existing uv-managed project, use uv add
instead:
uv add forestgeom
uv add "forestgeom[boosted]"
For local development from a cloned checkout:
uv sync --extra test
pip also works
pip install forestgeom
pip install "forestgeom[boosted]"
pip install "forestgeom[viz]"
pip install "forestgeom[experiments]"
pip install "forestgeom[test]"
pip install "forestgeom[all]"
pip install git+https://github.com/JakeSRhodesLab/ForestGeom.git
pip install -e ".[test]"
Architecture
ForestGeom is organized around one central object, ForestProximity. The class
wraps a fitted tree ensemble and turns it into a reusable geometry object built
from sparse leaf-incidence maps.
RandomForest / ExtraTrees / GBT / LightGBM / XGBoost
|
v
X_train, y_train --> +------------------------+
fit(...) | ForestProximity |
+------------------------+
|
v
fitted adapter + ForestCache
|
v
+------------------------------------+
| sparse forest representation |
| leaf incidence + scheme weights |
+------------------------------------+
|
+-------------+-------------+
| |
v v
+----------------------+ +------------------------+
| separable schemes | | corrected schemes |
| P = Q W^T | | P = normalize(QW^T) |
+----------------------+ +------------------------+
| |
v |
+----------------------+ |
| Q: query_map(X=None) | |
| W: reference_map() | |
+----------------------+ |
| |
+-------------+-------------+
|
v
+-------------------+
| transform(X_new) |
| P(X_new, X_train) |
+-------------------+
|
v
forest-induced proximity geometry
The adapter layer hides backend-specific details such as leaf indexing,
bootstrap masks, in-bag counts, and boosted tree weights. The map-building layer
then uses those quantities to construct the sparse geometry for the selected
weighting scheme (uniform, kerf, oob, gap, or boosted).
The important distinction is:
- Symmetric schemes such as
uniform,kerf, andboosteduse the same leaf-incidence geometry on both sides, so the resulting proximity is a kernel. - Asymmetric schemes such as
gapexpose distinct query and reference maps, which induce a bilinear formP(i, j) = <Q(i), W(j)>rather than a kernel. - The true Breiman OOB scheme is pairwise-normalized and is computed directly
as a sparse proximity matrix; it does not factor into a single reusable
Q/Wpair.
Usage
ForestProximity wraps a tree ensemble estimator and clones/fits it during
fit(...). It supports a unified set of forest backends and weighting schemes:
Supported base forest classes include:
sklearn.ensemble.RandomForestClassifiersklearn.ensemble.RandomForestRegressorsklearn.ensemble.ExtraTreesClassifiersklearn.ensemble.ExtraTreesRegressorsklearn.ensemble.GradientBoostingClassifiersklearn.ensemble.GradientBoostingRegressorlightgbm.LGBMClassifierandlightgbm.LGBMRegressorwithforestgeom[boosted]xgboost.XGBClassifierandxgboost.XGBRegressorwithforestgeom[boosted]
Supported leaf-weighting schemes include:
uniform: symmetric leaf co-occurrence factorization of the standard forest kernel.kerf: symmetric leaf-size-normalized factorization of the KeRF kernel.oob: pairwise-normalized Breiman OOB proximity computed directly in sparse form.gap: asymmetric query/reference factorization that combines OOB-side query weights with in-bag reference weights to recover the GAP proximity definition.boosted: symmetric tree-weighted leaf kernel for supported boosted ensembles.
Not every estimator supports every weighting scheme. Random Forests and
ExtraTrees estimators support uniform and kerf; they support oob and
gap only when fitted with bootstrap=True. Boosted estimators support
uniform, kerf, and boosted.
Use fit(...) when you want to train and keep the fitted geometry, and use
fit_transform(...) when you want the fitted train-train proximity matrix right
away. Use query_map(...) and reference_map(...) when you need the actual
leaf-incidence factors Q and W for matrix-free applications, and use transform(...)
for the proximity block from new samples to the fitted training set.
For schemes that are not symmetric kernels, fit_transform(...) and
fit(...).transform(...) are not necessarily the same. If you need the
training geometry, use fit_transform(...) directly or call
training_proximity(...) on the fitted estimator.
For symmetric weighting schemes such as uniform, kerf, and boosted, the
query map is typically the leaf-space feature matrix. For asymmetric schemes
such as gap, keep both Q and W if you want to work directly with the
geometry. For oob, use training_proximity(...) or transform(...) directly;
there is no separate query/reference factorization.
The sparse geometry can be used directly in proximity-based workflows such as manifold learning, dimensionality reduction, visualization, imputation, and custom downstream estimators.
Quick Start
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from forestgeom import ForestProximity
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
stratify=y,
random_state=0,
)
forest = RandomForestClassifier(
n_estimators=200,
bootstrap=True,
random_state=0,
n_jobs=-1,
)
geometry = ForestProximity(forest=forest, weight_scheme="uniform").fit(X_train, y_train)
# Query/reference maps define the symmetric geometry.
Q_train = geometry.query_map()
W_train = geometry.reference_map() # This is the same a Q_train for symmetric schemes such as 'uniform', 'kerf' and 'boosted'.
Q_test = geometry.query_map(X_test) # Leaf-incidence representations of the test set.
# Matrix-free forest kernel SVM using the leaf maps directly as sparse features.
svm = LinearSVC()
svm.fit(Q_train, y_train)
pred = svm.predict(Q_test)
print(f"leaf-map SVM accuracy: {accuracy_score(y_test, pred):.3f}")
# Comparison with the base forest classifier.
pred = geometry.forest_.predict(X_test)
print(f"base-forest accuracy: {accuracy_score(y_test, pred):.3f}")
# To run the boosted example, install optional dependencies first:
# uv pip install "forestgeom[boosted]"
from xgboost import XGBClassifier
forest = XGBClassifier(n_estimators=200, random_state=0)
boosted_geometry = ForestProximity(forest=forest, weight_scheme="boosted")
K_train = boosted_geometry.fit_transform(X_train, y_train)
K_test = boosted_geometry.transform(X_test)
Demos and Experiments
The repository includes notebook demos for common workflows:
demos/demo_iris.ipynb: general-purpose introduction on the Iris dataset.demos/demo_leaf_pca.ipynb: matrix-free supervised manifold learning with leaf PCA using the leaf-incidence maps in kernel proximities.demos/demo_boosted.ipynb: boosted-tree examples using the optional boosted adapters.
The experiments/ directory contains Python scripts and notebooks used
to reproduce experiments and compile results from โRevisiting Forest Proximities via Sparse
Leaf-Incidence Kernelsโ.
Citation
If you use this software in your research or experiments, please cite the leaf-incidence kernel framework paper [1]:
[1] Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels.
@misc{aumon2026revisitingforestproximitiessparse,
title={Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels},
author={Adrien Aumon and Guy Wolf and Kevin R. Moon and Jake S. Rhodes},
year={2026},
eprint={2601.02735},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.02735}}
If you specifically use the gap weighting scheme, please also cite the GAP
proximity paper [2]:
[2] Geometry- and Accuracy-Preserving Random Forest Proximities.
@ARTICLE{10089875,
author={Rhodes, Jake S. and Cutler, Adele and Moon, Kevin R.},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={Geometry- and Accuracy-Preserving Random Forest Proximities},
year={2023},
volume={45},
number={9},
pages={10947-10959},
keywords={Random forests;Forestry;Geometry;Data visualization;Decision trees;Task analysis;Anomaly detection;Proximities;random forests;supervised learning},
doi={10.1109/TPAMI.2023.3263774}}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file forestgeom-0.1.0.tar.gz.
File metadata
- Download URL: forestgeom-0.1.0.tar.gz
- Upload date:
- Size: 4.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b6ddfafea06060eeec6a3dbd16411266be6cf1edf4490ac6259b1674c8644de
|
|
| MD5 |
7c2bffbaa00e8fac12a6da009b90c508
|
|
| BLAKE2b-256 |
8cd9ba17c7e220144087b2df18e91a24e52cd9f103e073c9894a2cceac3c6c7d
|
File details
Details for the file forestgeom-0.1.0-py3-none-any.whl.
File metadata
- Download URL: forestgeom-0.1.0-py3-none-any.whl
- Upload date:
- Size: 38.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b46e853d1d5f272ac3f1b62125483fd1e64705e7dd89f89caec3c36f0bb3800f
|
|
| MD5 |
6f02d7bae946808c93af85ca2f4a0ef7
|
|
| BLAKE2b-256 |
26332c8c159c2846ad6a18b8c065180a8cbe3a422f4ebc4b9c78c65ee17778ca
|