Object-oriented implementations of decision tree variants
This repository will contain several variants of decision tree / ensemble classification algorithms, written in an object-oriented style. My immediate goal is to try to reproduce some of the results from this paper on canonical correlation forests, which I am testing against the same datasets.
One major difference from scikit-learn is that datasets and their attributes are treated as first-class objects. Additionally, all classifiers must be initialized with their training dataset (as opposed to calling fit).
from oo_trees.dataset import Dataset from oo_trees.decision_tree import DecisionTree from oo_trees.random_forest import RandomForest X = examples # numpy 2D numeric array y = outcomes # numpy 1D array dataset = Dataset(X, y) training_dataset, test_dataset = dataset.random_split(0.75) d_tree = DecisionTree(training_dataset) forest = RandomForest(training_dataset) print(d_tree.classify(test_dataset.X)) print(forest.classify(test_dataset.X)) d_tree_confusion_matrix = d_tree.performance_on(test_dataset) forest_confusion_matrix = forest.performance_on(test_dataset) print(d_tree_confusion_matrix.accuracy) print(forest_confusion_matrix.accuracy)
When initializing datasets, we assume all attributes of the training examples are categorical. If that is not the case, you can pass in an additional attribute_types variable on initialize:
from oo_trees.dataset import Dataset from oo_trees.attribute import NumericAttribute, CategoricalAttribute X = examples y = outcomes attributes = [ NumericAttribute(index=0, name='age'), CategoricalAttribute(index=1, name='sex'), NumericAttribute(index=2, name='income') ] dataset = Dataset(X, y, attributes)
The logic for finding the best split is differs for each attribute type, and in the future there may be additional type-specific parameters (such as importance or number-to-name mappings) useful for classification or display.