A Python module for constructing a decision tree from multidimensional training data and for using the decision tree for classifying new data
Project description
Version 2.2.6 fixes a couple of serious bugs in the module, one in the method that reads training data from a ‘.dat’ file and the other in the method used for calculating the probability of a feature acquiring a particular value. The consequences of the latter bug are serious if your training data file has a large number of zero values for the features.
Version 2.2.5 changes are limited to the part of the module that generates synthetic data for experimenting with decision tree construction and classification. There were a couple of bugs in this part of the module that are now fixed.
Version 2.2.4 should prove more robust when the probability distribution for the values of a feature is expected to be heavy-tailed; that is, when the supposedly rare observations can occur with significant probability.
With regard to the purpose of the module, assuming you have placed your training data in a CSV file, all you have to do is to supply the name of the file to this module and it does the rest for you without much effort on your part for classifying a new data sample. A decision tree classifier consists of feature tests that are arranged in the form of a tree. The feature test associated with the root node is one that can be expected to maximally disambiguate the different possible class labels for a new data record. From the root node hangs a child node for each possible outcome of the feature test at the root. This maximal class-label disambiguation rule is applied at the child nodes recursively until you reach the leaf nodes. A leaf node may correspond either to the maximum depth desired for the decision tree or to the case when there is nothing further to gain by a feature test at the node.
Typical usage syntax:
training_datafile = "stage3cancer.csv" dt = DecisionTree.DecisionTree( training_datafile = training_datafile, csv_class_column_index = 2, csv_columns_for_features = [3,4,5,6,7,8], entropy_threshold = 0.01, max_depth_desired = 8, symbolic_to_numeric_cardinality_threshold = 10, ) dt.get_training_data() dt.calculate_first_order_probabilities() dt.calculate_class_priors() dt.show_training_data() root_node = dt.construct_decision_tree_classifier() root_node.display_decision_tree(" ") test_sample = ['g2 = 4.2', 'grade = 2.3', 'gleason = 4', 'eet = 1.7', 'age = 55.0', 'ploidy = diploid'] classification = dt.classify(root_node, test_sample) print "Classification: ", classification