Classification Trees

Dataset:

Throughout this website, we will be using a particular dataset to both illustrate the software and explain the process. This dataset is from a paper Applied Logistic Regression by Hosmer, D.W. and Lemeshow, S and was collected at the Baystate Medical Center, Springfield, Massachusetts in 1986. Data from 180 pregnant women was collected and for each the following 10 attributes were recorded:

The attributes are as follows:


id: case id or label for the instance.
weight: with values low/normal.
age: mother's age in years
lwt: mother's weight in pounds at last menstrual period
smoke: smoking status during pregnancy (1 = yes, 0 = no)
ptl: number of previous premature labours (0,1,2,..)
ht: history of hypertension (1 = yes, 0 = no)
ui: presence of uterine irritability (1 = yes, 0 = no)
ftv: number of physician visits during the first trimester
bwt: birth weight in grams

The class attribute is weight and a value of ‘low’ is an indicator of birth weight less than 2.5kg. The last attribute, bwt, will not be used and the idea is to be able to predict the class attribute (weight) by using the other attributes. There were a total of 50 cases with low birth weight babies and the rest had normal birth weight babies.

Methodology

Decision trees are one of the most widely used and practical forms of machine learning and data mining. They have been widely researched and applied to a large variety of data mining problems. Decision tree models are built by a process that is known as recursive partitioning.

Building the Tree Model

First, the original data is broken up into 2 or more non-overlapping sub-samples. The original sample is also known as the root node and each of the sub-samples is referred to as a node. The partitioning is done based on one of the independent variables known as the splitting attribute. Branches are drawn for different values of this splitting attribute. Each instance in the root node is sent down one of the branches (depending on its value for the splitting attribute) into one of the nodes. The choice of splitting attribute is done by picking the attribute that will partition the original sample into sub-samples that are as homogenous as possible in relation to the class variable. It is a similar idea when choosing on what values of the splitting attribute to perform the partitioning.

As a second step, the process is repeated for each of the nodes created in the first step. Each node is partitioned by considering only the cases in that node and new nodes are created from instances in that node alone. This process is repeated for each node until some stopping-rule is violated. When this happens, the node is not further partitioned and this node is referred to as leaf node. The whole process is terminated when there are only leaf nodes left.

The process described above is how tree-modeling algorithms actually work but there are 3 issues we have not looked at. The first is how the splitting attribute is selected, the second is deciding upon what stopping rules need to be in place and the last is how nodes are assigned to classes.

Different algorithms have different approaches to dealing with each of these issues. We will next consider some tree building algorithms and look at how these remaining issues are dealt with.

Implementation

The ultimate goal of building a tree model is to end up with the smallest tree that has the purest leaf nodes. The purer a leaf node is, the more precise its classification is. A leaf node where all the instances in it are correctly classified is clearly superior to one where just of over half are correctly classified.

The benefit of having a smaller tree model is twofold. First, a smaller tree model is easier to interpret and the second is that such a model will tend to have larger leaf nodes. A leaf with 100 instances in it will have more predictive power than one that has just 1 or 2 instances. These two goals of tree size and leaf purity are often contradictory, where one furthers one goal at the cost of the other. There is of course an optimal tree model that best achieves both goals. In practice, though, it is often not possible to find the ideal tree model as the solution space of all tree models an algorithm has to search in is simply too large.

Different algorithms will tend to behave better or worse on some data as opposed to other data. Another thing to consider is that compared with other data mining methods, tree modeling is computationally expensive so while some algorithms are more powerful than others, they may not be practical for large problems where simpler and faster (but less powerful) algorithms will be more suitable.

The Algorithms

As mentioned before, decision tree models are one of the most widely researched and used methods for data mining and analysis. This has led to many different approaches and algorithms to deal with the problem of building a decision tree model.

An early and pioneering method of doing this is the ID3 algorithm, developed by Ross Quinlan of the University of Sydney. He first presented it in 1975 in his book, Machine Learning, vol. 1, no 1. A series of modifications and improvements on the ID3 algorithm culminated into the popular C4.5 algorithm. This algorithm was presented along with full source code in a book by Ross Quinlan (1993) entitled, "C4.5: Programs for Machine Learning". Another popular algorithm we will look at is the CART algorithm. CART stands for Classification and Regression Trees a statistical procedure introduced by Leo Breiman, Jerome Friedman, Richard Olsen and Charles Stone in 1984.

Splitting a Node

The goal of splitting up a sample is to get sub-samples that are more pure than the original sample. Purity in this case is referring to how homogenous the sub-samples are in relation to the class variable. The ideal situation is when each sub-sample consists of instances that have the same value for the class attribute i.e. completely pure nodes.

If there are M attributes, there will be a total of M splits to consider.

  • For numerical attributes the splits are binary in nature and the test is of the form {is Xm <= c?}.
  • For nominal attributes the splits can be binary or n-ary in nature and are of the form {is Xm in (c1, c2,...)}.

A commonly used technique is to choose a split that will create the largest and purest child nodes by only looking at the instances in that node. This technique is referred to the ‘greedy’ or the ‘local optimization’ approach. Its advantage is that it is computationally efficient regardless of the size of the problem.

Greedy Approach

For each node do the following:

  • Search each attribute to find the best split for it
  • Each of these splits is a candidate split, and there will be M candidate splits for the M attributes being considered
  • Compare the M splits and pick the best split
  • Some algorithms keep the second and third best splits as surrogate splits in reserve. These splits are ranked in order in how close they resemble the behavior of the primary splitting rule. These surrogate splits are used when predicting new instances that have missing values for the splitting attribute.

Splitting attributes are chosen based on a goodness of split. If we define an impurity function l(t) where t is any given node, then the goodness of split is defined to be the decrease in impurity resulting from the split. The next picture shows a candidate split s that will create child node T1 and T2 from T.

The goodness of split will be the difference between the impurity of node T and the sum of the impurities for the child nodes of T (in this case T1 and T2). The goal is to find the split with the greatest reduction in impurity.

In the split shown above, the goodness of split for split s defined as:

∆l(s,t) = l(t) –p1l(t1) - p2l(t2)

Where p1 and p2 are the proportions of the instances of t that go into t1 and t2 respectively and l(t) is the impurity function.

Exhaustive Approach

An alternative approach is to consider not just the children of the current node but all the sub-trees that result from the picking of a particular splitting attribute and pick the one that has the best leaf nodes. While this technique will guarantee the best tree models being generated, it is too computationally expensive to be practical as the number of sub-trees to consider at each step grows exponentially. In general, algorithms take the greedy approach or will look at most 1 or two levels of the resulting sub-trees to pick the splitting attribute. All the algorithms we will look at in this paper are local optimizers.

The Impurity Function

Before proceeding we need to define a conditional probability p(j|t). If there are j classes in all, the conditional probability p(j|t) is the probability of having a class j in node t. An estimate of this conditional probability is Nj/N. Where N is the total number of instances in the node and Nj is the number of class j instances in the node. The impurity of a node is a function of this conditional probability.

The two ways of defining this impurity function that we will look at are entropy and the gini index. The first is used by C4.5 and ID3 algorithms and the second is used in the CART algorithm.

Entropy

 The entropy impurity function is defined as follows:

This function will reach a maximum when all classes are evenly distributed in the node and it will be at a minimum if all instances in the node belong to one class. There is a bias in this measure of impurity in that it tends to favor highly branching attributes, i.e. nominal attributes with many levels. An extreme example of this is the ‘id’ attribute from the birth weight data.

If there are N instances in a given node, using this attribute to perform the split will generate N branches into N child nodes. Each node will be a pure node by virtue of the fact it is the only one there and so each will have entropy value of 0. The reduction of impurity (often referred to as the information gain or just gain) will thus be a maximum as the N child nodes will have a sum entropy of zero compared with the original node. This behavior is not surprising when one considers that the ‘id’ attribute will determine the class of an instance with no ambiguity. This is of course not desirable as the ‘id’ attribute does not, and is not intended to have any predictive power on new instances. It is usually common to omit such useless attributes from analysis, but the bias toward smaller nodes exists and the C4.5 algorithm takes a few extra steps to correct this.

The C4.5 algorithm does not use the information gain function as shown but instead uses something known as the gain ratio. The gain ratio takes into account the number and size of the generated child nodes from a candidate split. This is done by a measure that is called intrinsic information. If a given split will branch a node into X child nodes, the intrinsic information is calculated as follows:

 

Nx is the number of instances in child node X.

For a given split s on the node t, the gain ratio is then defined as the follows:

Intrinsic_Info(t) will be at maximum for the ‘id’ attribute but in general will increase as the number of branches and the number of small child nodes increase. The gain ratio measure will thus penalize highly branching attributes and will avoid the bias associated with just using the information gain measure.

In many cases though, just using the gain ratio measure alone can lead to problems as it can overcompensate for the bias. A given split can be selected just because it has a small value for the intrinsic information. A common workaround is to consider first, only splits that have above average information gain and then compare their gain ratios.

Gini Index

The Gini index impurity function is defined as follows:

Like the entropy function, this measure will reach a maximum when all classes are evenly distributed in the node and it will be at a minimum if all instances in the node belong to one class. There are no issues with the bias discussed previously as the CART algorithm that uses the gini index is a binary split algorithm and does not have to deal with highly branching splits.

Assigning Classes to Tree Nodes

Every node in a tree carries with it a particular classification. This classification is usually determined by a simple majority. In a given node, the class attached to it will be the class that is most well represented by the instances in the node. Leaf nodes are different in that their classification is final, and it is from them that a model’s predictive performance is determined.

Each node will have an error rate, say e, which is the proportion of misclassified instances in it. The probability that a particular classification will be correct is then simply 1-e. The probability of a correct prediction from the model is then the weighted average of these probabilities from each leaf. These estimates can be based on training data or on a separate and independent test data used to validate the model.

Stopping Rules and Building the Final Model

Both C4.5 and CART are backward pruning algorithms. This means that they will grow a tree until it is not possible to grow it any further and thus the only stopping rule is when there are only 2 instances left in a node. When all nodes are like this, the tree growing process will end. This will of course lead to very large trees that over-fit the data.

Pruning will be necessary to build smaller tree models that perform better on new data and not just on the training data. The idea is to remove leafs that have a high error rate. There are two methods of pruning that both CART and C4.5 algorithms use. The first is to use an independent testing sample, usually made by holding back a proportion of the data in reserve and building the model with the remaining data. The testing data is then used to estimate the error rate for each node.

Working back from leafs upwards, each node’s error rate is compared with the weighted average of the error rates of all the leaf nodes in its subtree. If the error rate of the node is lower, the whole subtree is removed and the node in question is changed to be a leaf node. If it is not lower then the subtree is left intact and the node above will be examined next. This continues until the root node is reached. In this way, the tree model is reduced in size up to a point where further reduction will not yield lower error rates.

The method described above is referred to as reduced-error pruning[1]. Its main disadvantage is that some of the data is being held back for pruning and that this data cannot be used in helping build a better tree model. This can be a serious problem when dealing with small datasets.

An alternative method is to use the training data to do the pruning. Each node in the tree model has a certain number of instances that are misclassified, say E out of a total of N instances in the node. The training error rate (we will call f) for each node is then simply E/N. It is obvious that since we are using training data to estimate the error rate that it will always underestimate the true error rate.

A workaround this problem is to make a pessimistic estimate for the error rate by taking a confidence interval, c, and using the upper bound of this interval as the estimate for the true error rate (we will call q). It is pessimistic because we are taking the upper range of the interval as the estimate instead of stating the estimate as a confidence range. In the C4.5 algorithm, the value for c defaults to 25%. It is thus as follows:

The value of z that corresponds with a c=25% is z=0.69. The formula to calculate the pessimistic estimate of q from the training error estimate f is thus shown below.

So if a given node had 30 instances in it and a training error estimate of 33% the pessimistic estimate for the true error rate is then

Once the error estimates for each node have been calculated, pruning will proceed as described before. With this approach to pruning, what happens is that the model is built two times.  The first time a portion of the data is held back for estimating the model performance and the model is built with the remaining data. Pruning on this model is done with this training data alone. The pruned model is then validated using the validation portion of the data and the estimate for the expected model error rate is then obtained using this validation data.

The final model is then built again using the entire dataset and pruned the same way as in the first time. The error rate quoted for this model, however, is the one obtained from the validation of the first model. In this way the final model benefits from having more data for training and at the same time the estimate for the expected error rate for the final model is reasonably close to the true error rate for the model.

These pessimistic error estimates serve only as a heuristic as they stand on some shaky assumptions. However, the estimates have a correct qualitative behavior and in practice seem to perform well[2].

Estimating the error rate using V-Fold Cross Validation

An even more data parsimonious use of data is to use V-Fold cross validation to estimate the error rate. Usually a 10-fold cross validation is used and in this case, there will be a total of 10 tree models generated and validated. The final model will be created using the entire data set, but the error rate quoted for this final model will be the average of the 10 models previously constructed. This allows the use of the maximum amount of data to create the model at the same time getting a good estimate for the expected error rate. The error estimate obtained using the cross validation is usually a better estimate of the true error rate than that obtained by just having 1 validation dataset as the whole dataset will be used to estimate the true error rate.

Moving From Trees to Rules

It is possible to deconstruct a tree model into a set of rules and the way is to trace every possible path from the root node to a leaf. This set of rules is then pruned to remove unnecessary complexity and/or redundancy. The rule set is thus reduced to a core set of rules.

Data Mining with Decision Trees

Decision tree learning is generally best suited to problems where instances are represented by attribute-value pairs and are described by a fixed set of attributes (e.g., temperature) and their values (e.g., hot).

The easiest situation for decision tree learning occurs when each attribute takes on a small number of disjoint possible values (e.g., hot, mild, cold) though other type of attributes are handled quite easily.

Tree models are also very good at handling noisy training data where the errors may lay in the independent variables, in the dependent variable or in both. They also offer a good deal of flexibility in handling datasets with missing values.

Advantages of Tree models

A more in depth overview of the advantages that tree models offer over several other methods (such as generalized linear models) is shown below:

  • Decision tree methods tend to produce models that are easy to interpret. At each non-terminal node, a decision is based upon just one predictor variable and this makes it easy to follow. For example, to explain particular classification one need only look at the series of simple decisions that led to it. The final tree model can in-fact be cast into a set of rules one can follow to classify a given case. In comparison, generalized linear models use linear combinations of variables that can be difficult to interpret or explain.
  • Tree models make no assumptions about the distribution of the underlying data and they are thus a non-parametric procedure. This can be especially useful if the distribution of the data is indeed unknown, something that happens a lot in practice.
  • Decision tree methods are easily able to handle both categorical and continuous variables.
  • Decision tree methods have a built-in feature selection method that makes them immune to the presence of useless variables. Such variables are ignored and they do no affect the tree building process. This is a common problem with over-parametized datasets.
  • Tree models are very adept at revealing complex interactions between variables. Each branch of a tree can contain different combinations of variables and the same variable can appear more than once in different parts of the tree. This can reveal how a variable can depend on another and in what specific context this dependency exists.
  • Decision tree models are extremely robust to the effect of outliers. This is so because the models are constructed in a frequency-based technique where one is counting the instances in a split. Outliers that occur in the independent variables do not affect the tree growing process because the values used to split each node are not likely to be on the outlier values. Outliers in the dependent variable go into their own nodes and do not affect the rest of tree.
  • Tree models offer several ways of dealing with missing values that can often minimize or eliminate the effect of such values on model performance. In many other methods the whole set of instances with missing values would be omitted from analysis. This advantage is present in both the when training the tree model or when applying the model on new data for class prediction.
    • In cases where a missing value is significant in itself, such as when the missing entries were caused by some systematic problem and not just random occurrence), the missing value can be treated as another possible value of the variable. In this case, the model can be constructed as usual.
    • In the case where the missing value is not significant in itself, it may still not pose a problem in the training process if there are no instances where with a particular instance reached a node where it is missing the splitting variable.
    • If the splitting attribute value is missing, the instance with the missing value can be re-weighted so that pieces of it go down each branch in proportion to the number of cases going down each branch. This has the effect not making a decision on this instance, but its information will be used on all the lower nodes[3]. This may lead to nodes have 23.4 instances in them, but this not a problem as the measures of impurity used can be adapted to handle pieces of instances as opposed to only whole instances.
    • When using the model for predictions, new data with missing values can still be handled. Several tree-building algorithms, such as CART, keep surrogate splits, which are alternative splits to the chosen splitting attribute. If a case in question is missing a relevant splitting attribute on a given node, then the second best split (or third best and so on…) is used to process it[4].

Disadvantages of Decision Tree Models:

Decision trees are indeed powerful data mining tools, but they are not perfect and are suited to certain types of problems. The main weaknesses in decision tree modeling are listed below:

  • Classification trees can be unstable and small variations in the data (such as that made by randomization) can cause very different looking trees being generated. This is especially so in cases where the goodness of splits for the different variables are close to each other in value. In this case, a small variation in the data is enough to influence the picking of on split over another.
    • Many of these problems are dealt with by using stratified sampling when creating testing and training datasets. In this case, the training and testing datasets have the same class distribution.
    • Such problems are also a symptom of a dataset that is too small. In such cases, it is important to use the entire dataset to train and test the model. This is the idea behind the cross-validation technique. Since the final model is built on the whole dataset, randomization of this same data will not cause too much instability.
  • Some the tree models generated may be very large and complex. This can make the model difficult to interpret, understand or justify.
  • Tree models really shine when applied to classification problems; unfortunately, they are not very good at estimation tasks(as in regression), i.e. having numerical dependent variable.
  • Decision tree models are computationally expensive to train. Big ’O’ notation is a way of expressing algorithmic complexity given a certain size of problem. For example, if an algorithm needs to do n operations to handle a problem of size n it is said to have an order complexity of O(n). This means that the complexity of an algorithm grows nearly linearly with the size of the data.

The C4.5 algorithm, when constructing a tree with a dataset of size n and each instance having m attributes is said to have an order complexity of [5]. Therefore, if for a dataset with 100 instances and 10 attributes, it needs to do roughly 6726 operations. For a dataset with 10,000 instances and 10 attributes, it will need to do 1,769,337 operations. Therefore, as the size of the dataset increases the number of operations needed to build the tree model grows much faster.


[1] Ian H. Witten and  Eibe Frank, Data Mining : Practical Machine Learning Tools and Techniques with Java Implementations, by  (Paperback - October 11, 1999) 162-164
[2] Witten, Frank 167
[3] Witten, Frank 161-162
[4] Beullens Martine, Decision Trees, 2003, Notes of a short course at Ludit, K.U.Leuven, Surrogate Splits, pg 43, 34,25
[5] Witten, Frank 168