Interpretable machine learning on graph-structured data using path-based boosting.
Project description
Path Boost
Path Boost is a Python library for interpretable machine learning on graph-structured data. It implements the PathBoost and SequentialPathBoost algorithms, which iteratively construct features based on paths in graphs and use boosting to build predictive models. The library is designed for tasks where input data consists of collections of graphs (e.g., molecules, social networks) and supports variable importance analysis for interpretability.
Features
- PathBoost: Ensemble learning over graph paths, partitioned by anchor nodes.
- SequentialPathBoost: Boosting with path-based features, iteratively expanding the feature space.
- Variable Importance: Quantifies the importance of paths/features in prediction.
- Parallel Training: Supports multi-core training for large datasets.
- Evaluation and Visualization: Built-in tools for error tracking and variable importance plotting.
Installation
Install from PyPI:
pip install path_boost
Usage Example
Below is a minimal example using the PathBoost model:
import numpy as np
import networkx as nx
from sklearn.model_selection import train_test_split
from path_boost import PathBoost
from path_boost.utils.datasets_for_examples.generate_example_dataset import generate_synthetic_graph_dataset
if __name__ == "__main__":
# Generate synthetic dataset
nx_graphs, y, true_paths, true_weights = generate_synthetic_graph_dataset()
list_anchor_nodes_labels = [0, 1, 2]
parameters_variable_importance: dict = {
'criterion': 'absolute',
'error_used': 'mse',
'use_correlation': False,
'normalize': True,
}
X_train, X_test, y_train, y_test = train_test_split(nx_graphs, y, test_size=0.25, random_state=42)
eval_set = [(X_test, y_test)]
path_boost = PathBoost(
n_iter=50, # Reduced for quicker example run
max_path_length=5,
learning_rate=0.1,
n_of_cores=1, # Set to >1 for parallel processing if desired
verbose=True,
parameters_variable_importance=parameters_variable_importance
)
# Fit the model
# anchor_nodes_label_name must correspond to the feature storing node types ('feature_0')
path_boost.fit(
X=X_train,
y=y_train,
eval_set=eval_set,
list_anchor_nodes_labels=list_anchor_nodes_labels,
anchor_nodes_label_name="feature_0" # Node types are in 'feature_0'
)
print(f"Generated {len(nx_graphs)} graphs.")
print(f"Example y values: {y[:5]}")
print(f"True paths definitions: {true_paths}")
print(f"True path weights: {true_weights}")
path_boost.plot_training_and_eval_errors(skip_first_n_iterations=0, plot_eval_sets_error=True)
if path_boost.parameters_variable_importance is not None and hasattr(path_boost, 'variable_importance_'):
path_boost.plot_variable_importance(top_n_features=10)
else:
print("Variable importance not computed or available.")
print("Example run finished.")
API Overview
PathBoost
fit(X, y, anchor_nodes_label_name, list_anchor_nodes_labels, eval_set=None)predict(X)predict_step_by_step(X)evaluate(X, y)plot_training_and_eval_errors(skip_first_n_iterations=True)plot_variable_importance()- Attributes:
train_mse_: Training error (MSE) at each iterationmse_eval_set_: Evaluation set error (MSE) at each iteration (ifeval_setis provided)variable_importance_: Variable/path importance scores (if enabled)is_fitted_: Whether the model is fittedmodels_list_: List of fitted SequentialPathBoost models (one per anchor node)- (Each SequentialPathBoost in
models_list_exposes the attributes below)
SequentialPathBoost
fit(X, y, list_anchor_nodes_labels, name_of_label_attribute, eval_set=None)predict(X)predict_step_by_step(X)evaluate(X, y)plot_training_and_eval_errors(skip_first_n_iterations=True)plot_variable_importance()- Attributes:
train_mse_: Training error (MSE) at each iterationtrain_mae_: Training MAE at each iterationeval_sets_mse_: Evaluation set error (MSE) at each iteration (ifeval_setis provided)eval_sets_mae_: Evaluation set MAE at each iteration (ifeval_setis provided)variable_importance_: Variable/path importance scores (if enabled)paths_selected_by_epb_: Set of selected paths during boostingcolumns_names_: Names of EBM columns/features usedis_fitted_: Whether the model is fitted
Requirements
- Python 3.10+
- numpy
- pandas
- scikit-learn
- networkx
- matplotlib
(See requirements.txt for the full list.)
Citation
If you use this library in your research, please cite the corresponding paper (add citation here).
License
BSD 3-Clause License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file path_boost-2.1.0.tar.gz.
File metadata
- Download URL: path_boost-2.1.0.tar.gz
- Upload date:
- Size: 177.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65acc2dac849cef16c327d4da307efe0517dbcba1b6e8a4623aa7285ad09dc2a
|
|
| MD5 |
9edf32166327411c1dca5c62d2ae532d
|
|
| BLAKE2b-256 |
86e209c8bda4c70d4b3169271b6e74b92083948dde49da77c06e41a1ae468006
|
File details
Details for the file path_boost-2.1.0-py3-none-any.whl.
File metadata
- Download URL: path_boost-2.1.0-py3-none-any.whl
- Upload date:
- Size: 61.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9780ead44306cf4753091051b169d55419f70c1864fc42804758d182d66e4f16
|
|
| MD5 |
10b76c1bba780f895c805f5d9700e00e
|
|
| BLAKE2b-256 |
8b00f479524af1e336106b5dffbfb31f0bc7aa3bdf4f1757e13e9a0364f9f516
|