Skip to main content

Simplistic algorithms to train decision trees for regression and classification

Project description

binarybeech

Simplistic algorithms to train decision trees for regression and classification

Features

  • Create binary trees using the CART (Classification and Regression Tree) algorithm.
  • Train ensembles of trees using Gradient Boosting, Adaptive Boosting (AdaBoost) or Random Forest.
  • Process each datatype with a data handler as provided or impöememted to suit your needs. Just add your own implementation to the factory.
  • Metrics for different kinds of outcome variables are implemented analogously.
  • Features with high cardinality are treated with a simulated annealing solver to find the best combination.
  • No need for dummy encoding.
  • Train models using supervised or unsupervised learning.
  • Specify weights for unbalanced datasets.

NOTE: These pure python (and a bit of numpy) algorithms are many times slower than, e.g., sklearn or xgboost.

Install

pip install binarybeech[visualize]

The dependencies installed using the visualize option enable support for plotting and formatting trees.

Example

Load the Classification And Regression Tree model class

import pandas as pd
from binarybeech.binarybeech import CART
from binarybeech.extra import k_fold_split

get the data from a csv file

df = pd.read_csv("data/titanic.csv")
[(df_train, df_test)] = k_fold_split(df,frac=0.75,random=True,replace=False)

grow a decision tree

c = CART(df=df_train,y_name="Survived", method="classification")
c.create_tree()

predict

c.predict(df_test)

validation metrics

c.validate(df=df_test)

Please have a look at the jupyter notebooks in this repository for more examples. To try them out online, you can use Binder.

Usage

binarybeech.binarybeech.CART

CART(df, y_name, X_names=None, min_leaf_samples=1, min_split_samples=1, max_depth=10, method="regression", handle_missings="simple", attribute_handlers=None)

Class for a Classification and Regression Tree (CART) model.

  • Parameters
    • df: pandas dataframe with training data
    • y_name: name of the column with the output data/labels
    • X_names: list of names with the inputs to use for the modelling. If None, all columns except y_name are chosen. Default is None.
    • min_leaf_samples: If the number of training samples is lower than this, a terminal node (leaf) is created. Default is 1.
    • min_split_samples: If a split of the training data is proposed with at least one branch containing less samples than this, the split is rejected. Default is 1.
    • max_depth: Maximum number of sequential splits. This corresponds to the number of vertical layers of the tree. Default is 10, which corresponds to a maximum number of 1024 terminal nodes.
    • method: Metrics to use for the evaluation of split loss, etc. Can be either "classification", "logistic", "regression", or None. Default is "regression". If None is chosen, the method is deduced from the training dataframe.
    • handle_missings: Specify the way how missing data is handeled. Can be eiter None or "simple".
    • attribute_handlers: dict with attribute handler instances for each variable. The data handler determins, e.g., how splits of the dataset are made.
  • Methods
    • predict(df):
      • Parameters:
        • df: dataframe with inputs for predictions.
      • Returns:
        • array with predicted values/labels.
    • train(k=5, plot=True, slack=1.0):
      • Parameters:
        • k: number of different splits of the dataframe into training and test sets for k-fold cross-validation.
        • plot: flag for plotting a diagram of the loss over cost complexity parameter alpha using matplotlib.
        • slack: the amount of slack granted in chosing the best cost complexity parameter alpha. It is given as multiplier for the standard deviation of the alpha at minimum loss and allows thus to chose an alpha that is probably larger to account for the uncertainty in the k-fold cross validation procedure.
      • Returns:
    • create_tree(leaf_loss_threshold=1e-12)
      • Returns
    • prune(alpha_max=None, test_set=None, metrics_only=False)
      • Parameters:
        • alpha_max: Stop the pruning procedure at this value of the cost complexity parameter alpha. If None, the tree is pruned down to its root giving the complete relationship between alpha and the loss. Default is None.
        • test_set: data set to use for the evaluation off the losses. If None, the training set is used. Default is None.
        • metrics_only: If True, pruning is performed on a copy of the tree, leaving the actual tree intact. Default is False
    • validate(df=None)
      • Parameters:
        • df: dataframe to use for (cross-)validation. If None, the training set is used. Default is None.
      • Returns:
        • dict with metrics, e.g. accuracy or RSquared.
  • Attributes
    • tree:

binarybeech.binarybeech.GradientBoostedTree

GradientBoostedTree(df, y_name, X_names=None, sample_frac=1, n_attributes=None, learning_rate=0.1, cart_settings={}, init_method="logistic", gamma=None, handle_missings="simple", s=None)

Class for a Gradient Boosted Tree model.

  • Parameters
    • df: pandas dataframe with training data
    • y_name: name of the column with the output data/labels
    • X_names: list of names with the inputs to use for the modelling. If None, all columns except y_name are chosen. Default is None.
    • sample_frac: fraction (0, 1] of the training data to use for the training of an individual tree of the ensemble. Default is 1.
    • n_attributes: number of attributes (elements of the X_names list) to use for the training of an individual tree of the ensemble. Default is None which corresponds to all available attributes.
    • learning_rate: the shinkage parameter used to "downweight" individual trees of the ensemble. Default is 0.1.
    • cart_settings: dict that is passed on to the constuctor of the individual tree (binarybeech.binarybeech.CART). For details cf. above.
    • init_method: Metrics to use for the evaluation of split loss, etc if the initial tree (stump). Can be either "classification", "logistic", "regression", or None. Default is "regression". If None is chosen, the method is deduced from the training dataframe.
    • gamma: weight for individual trees of the ensemble. If None, the weight for each tree is chosen by line search minimizing the loss given by init_method.
    • handle_missings: Specify the way how missing data is handeled. Can be eiter None or "simple".
    • attribute_handlers: dict with data handler instances for each variable. The data handler determins, e.g., how splits of the dataset are made.
  • Methods
    • predict(df)
      • Parameters:
        • df: dataframe with inputs for predictions.
      • Returns:
        • array with predicted values/labels.
    • train(M)
      • Parameters:
        • M: Number of individual trees to create for the ensemble.
      • Returns:
    • validate(df=None)
      • Parameters:
        • df: dataframe to use for (cross-)validation. If None, the training set is used. Default is None.
      • Returns:
        • dict with metrics, e.g. accuracy or RSquared.
  • Attributes
    • trees

binarybeech.binarybeech.AdaBoostTree

AdaBoostTree(training_data=None, df=None, y_name=None, X_names=None, sample_frac=1, n_attributes=None, cart_settings={}, method="classification", handle_missings="simple", attribute_handlers=None, seed=None, algorithm_kwargs={})

Class for a AdaBoost model using CARTs as weak learners.

  • Parameters:
    • training_data: Preprocessed instance of class TrainingData.
    • df: pandas dataframe with training data
    • y_name: name of the column with the output data/labels
    • X_names: list of names with the inputs to use for the modelling. If None, all columns except y_name are chosen. Default is None.
    • method: Metrics to use for the evaluation of split loss, etc. Can be either "classification", "logistic", "regression", or None. Default is "regression". If None is chosen, the method is deduced from the training dataframe.
    • handle_missings: Specify the way how missing data is handeled. Can be eiter None or "simple".
    • attribute_handlers: dict with attribute handler instances for each variable. The data handler determins, e.g., how splits of the dataset are made.
  • Methods
    • predict(df)
      • Parameters:
        • df: dataframe with inputs for predictions.
      • Returns:
        • array with predicted values/labels.
    • train(M)
      • Parameters:
        • M: Number of individual trees to create for the ensemble.
      • Returns:
    • validate(df=None)
      • Parameters:
        • df: dataframe to use for (cross-)validation. If None, the training set is used. Default is None.
      • Returns:
        • dict with metrics, e.g. accuracy or RSquared.
    • variable_importance():
      • Returns:
        • dict with normalized importance values.
  • Attributes

binarybeech.binarybeech.RandomForest

RandomForest(df, y_name, X_names=None, verbose=False, sample_frac=1, n_attributes=None, cart_settings={}, method="regression", handle_missings="simple", attribute_handlers=None)

Class for a Random Forest model.

  • Parameters
    • df: pandas dataframe with training data
    • y_name: name of the column with the output data/labels
    • X_names: list of names with the inputs to use for the modelling. If None, all columns except y_name are chosen. Default is None.
    • verbose: if set to True, status messages are sent to stdout. Default is False.
    • sample_frac: fraction (0, 1] of the training data to use for the training of an individual tree of the ensemble. Default is 1.
    • n_attributes: number of attributes (elements of the X_names list) to use for the training of an individual tree of the ensemble. Default is None which corresponds to all available attributes.
    • cart_settings: dict that is passed on to the constuctor of the individual tree (binarybeech.binarybeech.CART). For details cf. above.
    • method: Metrics to use for the evaluation of split loss, etc. Can be either "classification", "logistic", "regression", or None. Default is "regression". If None is chosen, the method is deduced from the training dataframe.
    • handle_missings: Specify the way how missing data is handeled. Can be eiter None or "simple".
    • attribute_handlers: dict with attribute handler instances for each variable. The data handler determins, e.g., how splits of the dataset are made.
  • Methods
    • predict(df)
      • Parameters:
        • df: dataframe with inputs for predictions.
      • Returns:
        • array with predicted values/labels.
    • train(M)
      • Parameters:
        • M: Number of individual trees to create for the ensemble.
      • Returns:
    • validate(df=None)
      • Parameters:
        • df: dataframe to use for (cross-)validation. If None, the training set is used. Default is None.
      • Returns:
        • dict with metrics, e.g. accuracy or RSquared.
    • validate_oob():
      • Returns:
        • dict with metrics, e.g. accuracy or RSquared.
    • variable_importance():
      • Returns:
        • dict with normalized importance values.
  • Attributes

Principle

Decision trees are, by design, data type agnostic. With only a few methods like spliter for input variables and meaningful quantification for the loss, any data type can be perused. In this code, this is implemented using a factory pattern for data handling and metrics making decision tree learing simple and versatile.

For more information please feel free to take a look at the code.

Performance

Kaggle

Sources

Decision tree

CART

Gradient Boosted Tree

Random Forest

pruning

Contributions

Contributions in the form of pull requests are always welcome.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

binarybeech-0.3.1.tar.gz (770.6 kB view hashes)

Uploaded Source

Built Distribution

binarybeech-0.3.1-py3-none-any.whl (28.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page