Shih Data Miner

Shih Data Miner

Installation And Starting Up

The program is written in java, so it will be able to run on most operating systems. The program is offered on a free basis, and is available for download in the website http://www.Shih.be. The software is being actively developed, and features are being added regularly. As such, it will expire every month, and you will need to download the latest version every 21st of the month. There is an auto-update feature that makes this easy to do.

 

Text Box:  The install process is very easy, as all that is required is to unzip the file to a directory and run the file “runShihWindows.bat”. This will start up the program. To start building a tree model the user will go to the File menu and select Tree->Set Parameters and build, as shown right.

 



Model Creation: Setup Parameters and Options

Section 1: Data Input

In this section, a user will specify the data files to be used to construct the data model. Shih only accepts only CSV file formats. A user can specify separate training and testing data sets, or can use the same dataset for both. In the latter case, the data would be partitioned appropriately. If a model has been created and saved previously, it can be reused with the new data. In this case, the user would specify the model file to load.

 

Section 2: Tree Type Selection

Three types of trees can be constructed using the Shih software. These are namely: Regression, Classification and Probability trees. The type of tree will depend on the type of target variable. If the target or dependent variable is a categorical variable then the tree generated will be a classification or probability tree. If the dependent variable is either continuous or interval type a regression tree will be built.

Section 3: Splitting Algorithm Selection

In this section, a user can select what splitting algorithm will be used in the tree building process. In the case of classification trees, a user can select from the following algorithms: Gini, Symmetric-Gini, Entropy and Twoing. In the case of regression trees, the user can pick either ordinary least squares (OLS) or least absolute deviation LAD algorithms.

Section 4: Prior Probability Options

In the birth weight data, about 30 percent of the cases were low birth weight. The question arises if this proportion is representative of the population at large. This section allows the user to specify what the correct proportions should be.

 

The default behavior is to treat all classes as equally likely and where this is not the case, the user can obtain these probabilities from the data set. In doing this, the user may choose to use the whole data, the training, test data sets or a mixture of the two. In the case where the sample class probabilities are not representative of the population class probabilities the user can specify the correct prior class probabilities. This option is obviously only related to classification and probability trees.

Section 5: Test Settings

If the user wishes to perform some model validation, then he/she will need to specify where to obtain the data to test the model. Shih will first build a maximal tree and then uses backward pruning to obtain the optimal tree in terms of prediction. To do this, it will require separate training and testing data sets. This testing data set can be obtained by partitioning the data into training and testing data sets. The user can also specify a separate testing data set file.

 

In the case when the data set is not large enough to have separated training and testing data sets, the user can specify to use the cross validation technique. This allows the user to create N partitions of the data. Using these N partitions, N trees are created each using N-1 of the partitions. Each tree is then validated using the last unused partition. As mentioned earlier, the cross-validation technique is a more parsimonious use of data and leads to better estimates of model error. In Shih, the final model is then trained using the entire dataset and pruned using the error estimate obtained from the cross validation.

 

In general, tree model instability is a result of a class variable that is inherently difficult to predict. This results in cases where many of the splitting attributes are very close to each other in terms of goodness of split and small variations in the data is enough to result in different splits being picked. However, when using random split sampling this instability can be made much worse and there are two main reasons for this. First is that the when using random split, the final model is being built with less data and there may be types of cases that are never considered when the model is being trained. The second is that as different cases end up in the train and test samples, very different looking trees result.

 

This is far less of a problem when using the cross-validation technique. Since the final model is built using all the data, the final unpruned tree will always be the same. In Shih however, the final pruned tree will be slightly different as different randomizations of the data are applied. This is so because Shih uses the cross-validation estimate for error to prune the final tree model.

 

Different randomizations will tend to produce slightly different cross-validation estimates for the error rate. Since this estimated error rate is what is used to prune the final tree, different pruning will result depending on the estimate. The result is that final model will differ slightly with different randomizations of the data.

 

This type of instability is, however, not a serious problem as the performance of the model is quite similar with different randomizations of the data and the decision rules of the different tree models will be comparable.

Section 6: Create Model

Once the necessary parameters have been set, the user will push this button to start the automatic tree building process.

Section 7: Tree Growth Control & Randomization

Shih will first try and build a maximal tree. The option “Minimum parent node size” provides a stopping rule. The user can determine what are the minimum number cases required for a split search to be carried on a given node.

 

Section 8: Look ahead

By default, Shih will select the splitting variable by using the “greedy” approach. This means that at every step it will pick the best variable to perform the split by using information available only at that node. This is known as local optimization. It does not look forward to see how this choice of variable will affect future splits.

 

At each level of the tree, the number of child nodes grows exponentially and thus as one looks forward, the information required to process also grows exponentially. The greedy approach is thus chosen for the sake of computational efficiency. This is especially the case for large classification problems.

 

Still, Shih allows the user to specify whether it should look at 1 or 2 levels ahead. While this increases the computation time and the memory required, it will yield better tree models.

 

Section 9

Variable selection, setting custom misclassification costs and priors are done by selecting the appropriate tab at the bottom of this window shown above. This is shown below.

Section 9(a): Variable Selection

Variable selection is accessed by clicking on the “Variables” tab. In this section, a user will choose the dependent variable and the predictor variables. For each they must specify if the variable is a categorical type. Shih is not good at auto detecting this so it is important for the user to mind this. See below.

Section 9(b): Misclassification Costs and Priors

To access this area the user should click on the class parameters tab in section 9. By default, Shih assumes equal prior probabilities and equal misclassification costs. Both these settings affect how precise a tree will be concerning each class and they have a large influence in the tree building process.

 

It is often the case, that the cost of misclassification is not equal. One such case is in the area of medical diagnosis. It is often more dangerous to wrongly classify an ill patient as healthy as it is to wrongly classify healthy patients as ill. Often in such cases more precision is required in classifying a patient as healthy, even at the expense of a misclassifying more healthy patients as ill. By default equal misclassification costs are given to every class of the target variable, i.e. they both have a cost of 1. The user can change this appropriately to suit the need. This will influence how precise the tree model will try be with certain classes as opposed to others.

 

In cases where the prior probabilities associated with each class in the sample are not equal, the user can choose to enter the real prior probabilities. The user must choose “User Specified” in section 4 for in order to be able to change these priors.

 

Model Reporting and Testing Features

This program features good interactive and reporting facilities. A user can interactively grow and prune the tree and have a very fine control of the tree building process. This part of the program features a large window that is divided into four separate sections. Tree Panel, Splitting Center, Pruning panel and the zoom view. (See next).

 

 

Section 1: Tree Panel

This panel has four different modes of display. These are tree topology, lift curves, classification info and scored data. To switch between these modes, a user can click on the tab at the bottom as shown below.

Section 1a: Tree Topology

Tree topology shows the tree in detail. For each node & leaf of the tree, the following information is displayed: See picture below:

The first row displays the number of cases from the train and testing data sets. The second row displays classification of the node. The third row displays the percentages of the cases correctly classified in both the train and test data sets. The fourth row shows the percentages for the misclassified cases. The last row in the node shows the variable chosen to perform the split. Below each node are 2 bars. This shows, in a graphical way, the misclassification rate. The top bar is for the training dataset and the second for the testing dataset. The misclassified proportion appears as the red portion of the bar.

 

At each node, the user can choose to split the node either manually or automatically. This is done by right clicking on a node and selecting from the menu as shown below:

Section 1b: Lift Curves

In the lift-curves mode, lift and gain charts can be displayed for all the levels (classes) of the target variable. See below:

 

The table above the charts displays the following information for each leaf of the tree (terminal node): In the picture above we have 2 classes (A and B) and 12 leafs. Class A is being shown.

  • Profile: Each leaf is given a reference number and corresponds to a profile given by the rule that defines the path from the root node to the leaf node.
  • Class # cases: Number of cases in the leaf that belong to the class A.
  • Class % Profile: Percentage of cases in the leaf that belong to class A.
  • Class % in Pop. : Number of cases of class A in the node as a percentage of the total number of cases of class A cases in the data (in this case in the train data since train is selected).
  • Cum % Class: Cumulative number of class A cases as a percentage of the total number of cases in the population (in this case in the train population since train is selected).
  • Cum % Pop.: Cumulative percentage of the total number of cases in the population.
  • Pop % profile: Percentage of cases in the profile (node) with respect to the population.
  • Cases in profile: Total number of cases in the node.
  • Cum. lift: Cumulative percentage of class A cases divided by the cumulative share of the number of total cases. i.e. Cum % Class / Cum % Pop
  • Lift index: Percentage of a given class(such as low birth weight cases) in the node divided by the percentage of the total number of cases in the node, i.e. Class % in Pop. / Pop % profile

This table can be sorted according to any of the columns, and in the picture above the rows are ordered according to the percentage of cases in the leaf that belong to class A. This means the leaf node with the most number of cases of class A will appear first and the leaf node with the fewest number of cases of class A will appear last.

 

There are three charts displayed below the table as shown above. These are the Gain chart, the Lift chart and finally the Cumulative lift chart. The selected data set is the training data set so it’s leafs are displayed in orange color and the testing data set’s leafs are in green color.

 

The gain chart has in the x-axis the percentage of the total population and the y-axis shows the percentage of the selected class. In the picture above it is shows how the class A cases relate with the training dataset as a whole. In this example on the gains graph you see that with around 60% of the population, 75% of the class A is reached. The points in the graph indicate the leaf nodes in the tree.

 

The lift chart shows for a given leaf node the ratio of the class percentage in the dataset to the population percentage in the node. If for example this value is 3 it means that for that node you have 3 times more chances of being in class A than of being in any other class.

 

The cumulative lift chart show the cumulative percentage of the class vs. the cumulative percentage of the population. It is just like the lift chart but it adds up the amount of the total population that is reached already, instead of looking at every node separately. So at the end, it converges always to 1, since 100% of a class will be reached if 100% of the population is taken into account.

Section 1c: Classification Info

In classification info mode, information is displayed showing how good each class is classified. This is reported for both the train and test data sets and in each case both absolute numbers of classified and misclassified are shown along with percentages.

 

Section 1d: Scored Data

In scored data mode, real vs. predicted is displayed for each example in the data set. This is displayed together with what rule was used for the classification. This rule is identified by the leaf(Profile number) that was used to classify the case. The last two columns display the probabilities associated with classifying a case in this leaf. This is associated with the misclassification rate of the node.

 

 

 

Section 2: Splitting Center

By default the tree model will be generated automatically but the user is free to prune the tree and interactively build it. The splitting center lies at the heart of Shih’s interactive tree building process. This process happens on a node by node basis. When the user clicks on a node in the tree topology panel, information about the node as well as the whole model is displayed in this pane.

The list of predictor variables is listed on the top left. Information about the model as a whole is shown on the top right. The colored bars just below show the proportions of classified and misclassified cases in the node. This is shown for both the train and test cases.

 

Below this is shown the splitting algorithm that was used in this node. When the tree is first generated, the same splitting algorithm is used on all the nodes. However, when interactively growing or modifying the tree, the user is free to change the splitting algorithm on a node-by-node basis.

 

The variable that will be used to perform the split can be modified in this section. To assist in this, the program will display a 'goodness of split' for each variable. This information is displayed by pressing the ‘G.O.S’ (goodness of split) button. When this is done, the variable list is shown as below.

 

The list of variables is now shown in order of the goodness of split. At this point, the user can pick the desired variable and push the split button.

 

For continuous or interval variables the user has the option of selecting on what value to perform the split. To aid in selecting an appropriate cutoff value the graph on the left is displayed. It shows the goodness of split for every possible cutoff value. The user is free to enter any cutoff value. When they do so they can click on ‘GOS CO’ to display the graph on left around the chosen cutoff value or they can click on ‘SPLIT CO’ to perform the actual split on the chosen cutoff value. See the picture below.

 

 

Section 3: Pruning Sequence Panel

When the tree is generated automatically, this panel will display a graph. It shows the relative cost of the tree as it was pruned from the full tree down to a single node. The tree with the lowest cost is automatically chosen, but the user can pick other trees as well. It is also possible is to save a selected tree and this is done by pushing the store button. This does not save a tree to disk but instead just keeps it in memory. A user can store several of these trees and switch between them to compare them.

 

Section 4: Zoom View

To help visualize large trees, a zoom view is displayed. A user can then jump to different parts of the tree using it. Clicking at any place on this pane will move a rectangular box around to that place. This rectangle represents what is visible the tree topology pane (section 1a), and moving around will change what is visible in the tree topology pane. This feature is used to help visualize particularly large tree models.

 

Advantages & Disadvantages

The main strength of Shih lies in its inherent simplicity. It is a small software package and does what it does well. The interface is well designed for interactive tree building and the visualization options are quite good. The main weakness of Shih is also that it is a small package and the user will be quite limited in the number and variety of data mining tools at their disposal.