A package for building customizable decision trees and random forests.
Project description
Custom Decision Trees is a Python package that lets you build Decision Trees / Random Forests models with advanced configuration :
Main Features
Splitting criteria customization
Define your own cutting criteria in Python language (documentation in the following sections).
This feature is particularly useful in "cost-dependent" scenarios. Examples:
- Trading Movements Classification: When the goal is to maximize economic profit, the metric can be set to economic profit, optimizing tree splitting accordingly.
- Churn Prediction: To minimize false negatives, metrics like F1 score or recall can guide the splitting process.
- Fraud Detection: Splitting can be optimized based on the proportion of fraudulent transactions identified relative to the total, rather than overall classification accuracy.
- Marketing Campaigns: The splitting can focus on maximizing expected revenue from customer segments identified by the tree.
Multi-conditional node splitting
Allow trees to split nodes with one or more simultaneous conditions.
Example of multi-condition splitting on the Titanic dataset:
Other features
- Supports classification and regression
- Supports multiclass classification
- Supports standard decision tree parameters (max_depth, min_samples_split, max_features, n_estimators, etc.)
- Supports STRING type explanatory variables
- Ability to control the number of variable splitting options when optimizing a split (i.e
nb_max_cut_options_per_varparameter). - Ability to control the maximum number of splits to be tested per node to avoid overly long calculations in multi-condition mode (i.e
nb_max_split_options_per_nodeparameters) - Possibility of parallelizing calculations (i.e
n_jobsparameters)
Reminder on splitting criteria
Splitting in a decision tree is achieved by optimizing a metric. For example, Gini optimization consists in maximizing the $\Delta_{Gini}$ :
- The Gini Index represents the impurity of a group of observations based on the observations of each class (0 and 1):
$$ I_{Gini} = 1 - p_0^2 - p_1^2 $$
- The metric to be maximized is $\Delta_{Gini}$, the difference between the Gini index on the parent node and the weighted average of the Gini index between the two child nodes ($L$ and $R$).
$$ \Delta_{Gini} = I_{Gini} - \frac{N_L * I_{Gini_L}}{N} - \frac{N_R * I_{Gini_R}}{N} $$
At each node, the tree algorithm finds the split that minimizes $\Delta$ over all possible splits and over all features. Once the optimal split is selected, the tree is grown by recursively applying this splitting process to the resulting child nodes.
Usage
See
./notebooks/folder for complete examples.
Installation
pip install custom-decision-trees
Define your metric
To integrate a specific measure, the user must define a class containing the compute_metric and compute_delta methods, then insert this class into the classifier.
Example of a class with the Gini index :
import numpy as np
from custom_decision_trees.metrics import MetricBase
class Gini(MetricBase):
def __init__(
self,
n_classes: int = 2,
) -> None:
self.n_classes = n_classes
self.max_impurity = 1 - 1 / n_classes
def compute_gini(
self,
metric_data: np.ndarray,
) -> float:
y = metric_data[:, 0]
nb_obs = len(y)
if nb_obs == 0:
return self.max_impurity
props = [(np.sum(y == i) / nb_obs) for i in range(self.n_classes)]
metric = 1 - np.sum([prop**2 for prop in props])
return float(metric)
def compute_metric(
self,
metric_data: np.ndarray,
mask: np.ndarray,
):
gini_parent = self.compute_gini(metric_data)
gini_side1 = self.compute_gini(metric_data[mask])
gini_side2 = self.compute_gini(metric_data[~mask])
delta = (
gini_parent -
gini_side1 * np.mean(mask) -
gini_side2 * (1 - np.mean(mask))
)
metadata = {"gini": round(gini_side1, 3)}
return float(delta), metadata
Train and predict
Once you have instantiated the model with your custom metric, all you have to do is use the .fit and .predict_proba methods:
from custom_decision_trees import DecisionTreeClassifier
gini = Gini()
decision_tree = DecisionTreeClassifier(
metric=gini,
max_depth=2,
nb_max_conditions_per_node=2 # Set to 1 for a traditional decision tree
)
decision_tree.fit(
X=X_train,
y=y_train,
metric_data=metric_data,
)
probas = model.predict_probas(
X=X_test
)
probas[:5]
>>> array([[0.75308642, 0.24691358],
[0.36206897, 0.63793103],
[0.75308642, 0.24691358],
[0.36206897, 0.63793103],
[0.90243902, 0.09756098]])
Print the tree
You can also display the decision tree, with the values of your metrics, using the print_tree method:
decision_tree.print_tree(
feature_names=features,
metric_name="MyMetric",
)
>>> [0] 712 obs -> MyMetric = 0.0
| [1] (x["Sex"] <= 0.0) AND (x["Pclass"] <= 2.0) | 157 obs -> MyMetric = 0.16
| | [3] (x["Age"] <= 2.0) AND (x["Fare"] > 26.55) | 1 obs -> MyMetric = 0.01
| | [4] (x["Age"] > 2.0) OR (x["Fare"] <= 26.55) | 156 obs -> MyMetric = 0.01
| [2] (x["Sex"] > 0.0) OR (x["Pclass"] > 2.0) | 555 obs -> MyMetric = 0.16
| | [5] (x["SibSp"] <= 2.0) AND (x["Age"] <= 8.75) | 27 obs -> MyMetric = 0.05
| | [6] (x["SibSp"] > 2.0) OR (x["Age"] > 8.75) | 528 obs -> MyMetric = 0.05
Plot the tree
decision_tree.plot_tree(
feature_names=features,
metric_name="delta gini",
)
Random Forest
Same with Random Forest Classifier :
from custom_decision_trees import RandomForestClassifier
random_forest = RandomForest(
metric=gini,
n_estimators=10,
max_depth=2,
nb_max_conditions_per_node=2,
)
random_forest.fit(
X=X_train,
y=y_train,
metric_data=metric_data
)
probas = random_forest.predict_probas(
X=X_test
)
Regression
The "regression" mode is used in exactly the same way as "classification", i.e., by specifying the metric from a Python class.
your_metric = YourMetric()
decision_tree_regressor = DecisionTreeRegressor(
metric=your_metric,
max_depth=2,
min_samples_split=2,
min_samples_leaf=1,
max_features=None,
nb_max_conditions_per_node=2,
nb_max_cut_options_per_var=10,
n_jobs=1
)
decision_tree_regressor.fit(
X=X,
y=y,
metric_data=metric_data,
)
See /notebooks folder for complete examples.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file custom_decision_trees-3.0.1.tar.gz.
File metadata
- Download URL: custom_decision_trees-3.0.1.tar.gz
- Upload date:
- Size: 23.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29e3cb49ecb6d3d1912725da664c85aa3bc1fac59eb793fe6b534e05e13521f1
|
|
| MD5 |
67cc368ea6ccdc72172d74bfa0c80adf
|
|
| BLAKE2b-256 |
39d7d5592cc2f559a30b4f83cc4520fd13ff039dad8aee287723bf96797cb638
|
File details
Details for the file custom_decision_trees-3.0.1-py3-none-any.whl.
File metadata
- Download URL: custom_decision_trees-3.0.1-py3-none-any.whl
- Upload date:
- Size: 31.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2312542e53365d974e28dd73c1ad1daa6c9b4bf015d1f9664f44b73017c44600
|
|
| MD5 |
4cdf6cc5662ae174151bb68df82e754a
|
|
| BLAKE2b-256 |
fdbf9c5e646f1dc8c7cd7d17186251339f3c535a62790737f728c54d67560f81
|