Classification

Weka has a comprehensive set of classification tools. Many of these algorithms are very new and reflect an area of active development. We will only be examining the tree based classifiers but this is only a very small part of all the classification methods available in Weka. There are 11 tree algorithms, and 71 algorithms in all.

 

Data Analysis Steps in Explorer

The steps one needs to follow in the Explorer in order to analyze the data are listed below:

  1. Select Algorithm
  2. Setup its options
  3. Setup sampling options
  4. Setup output options
  5. Choose class variable
  6. Analyze output
Selecting an Algorithm

When it comes to choosing a specific algorithm, there are no obvious choices. Different algorithms perform differently depending on characteristics of the data. It is for this reason that Weka offers so many algorithms. Some algorithms can be used for both regression and/or classification while others are only for a specific type. Some can handle only nominal attributes while others can handle both nominal and ordinal/continuous variables. The explorer interface is limited to using one algorithm at a time, and offers no tools to compare different methods. Such comparisons are done using the Experimenter and the KnowledgeFlow interfaces.

 

The 71 algorithms available in Weka are grouped into 6 categories, namely, Bayes (Bayesian algorithms), Functions (function algorithms such as logistic regression and SVMs), Lazy (lazy algorithms or instance based learners), Meta (algorithms that combine several models and in some cases models from different algorithms), Trees (classification/regression tree algorithms) and Rules (rule based algorithms). Short descriptions of these algorithms, along with an explanation of the different options available are listed in the Reference Section 3.2. These descriptions have been extracted from the online help available within the Weka environment.

Setup algorithm Options:

Viewing or modifying the options for a chosen algorithm is done in the same way as it is done for filters. By clicking on the name of the chosen algorithm a pop-up window will display on which the options available can be viewed and/or modified.

Sampling:

The choice of sampling to be used depends on the size of the sample. Some sampling methods such as cross validation use the data more efficiently than simple random sampling. In all cases, it is advisable to use some form of testing to prevent overfitting.

 

Weka supports both cross-validation and random split sampling. In both cases, stratification is used to ensure that each sub-sample has the same proportions for each of the levels of the class variable. In our example dataset, stratification would ensure that the proportions of normal birth weights and low birth weights are the same in each sub-sample as in the whole sample. Weka also supports specifying a separate user supplied test dataset. The default choice in Explorer is to use the 10-fold stratified cross-validation.

 

When the sample size is very small, bootstrapping can be used. Bootstrapping is a sampling with replacement procedure. This means that instances are chosen randomly according to specified class probabilities from the original dataset and placed into a new dataset. The resulting dataset can be many times larger than the original dataset. The procedure to do this is similar to the pre-processing we did with the PKIDiscretize filter.

 

In this case, we would use the Resample filter. This filter comes in both supervised and unsupervised modes. It is usually better to use the supervised filter so that the resampled data has the same class distribution as the original data. In the options for the filter, one can control the size of the resampled data to be produced as well whether to leave its class distribution as is or to make it a uniform class distribution.

Output Options:

It is possible to control what output the explorer will produce as a result of its analysis of the data. These options affect both the graphical and text outputs generated by the Explorer. They are accessed by clicking on the “More Options” button right below the “Test Options” section. A description of each option when using tree models follows:

  • Output Model: This option controls whether the tree model generated will be displayed.
  • Output Per Class Stats: This option controls whether accuracy statistics will be displayed for each class level.
  • Output Entropy evaluation measures: Outputs measures such as K&B Information Score, Class Complexity and so on.
  • Output confusion matrix: This option controls whether a confusion matrix will be displayed.
  • Store predictions for visualizations: This option should be turned on, if one intends to use the various graphical visualization tools available to analyze the generated model.
  • Cost Sensitive Evaluations: This is used to control whether all errors are treated equally (default) or whether misclassifying certain classes should be treated more seriously than misclassifying other classes. This option does not affect the training process as the generated model will be same. It only affects the prediction probabilities (thresholds) produced by the model so as to minimize expected costs. The default is that all misclassifications are treated equally so the prediction probability or threshold would be 51%. If the costs for a type of misclassification are higher then the threshold would be raised accordingly in order to minimize expected costs.
  • Random seed for XVal /% Split: This is used enter seed values that will modify the randomization process.
Choosing a class variable:

Text Box:  Weka makes it easy to use any attribute as the class variable. To do this, the user needs to click on the dropdown button. A list of attributes will be displayed along with an indication whether the variable is nominal (Nom) or numerical (Num).

 

Depending on the type of learning algorithm chosen, the dependent variable may be restricted to nominal or numeric. For example, the REPTree algorithm can accept either nominal or numeric variables as the dependent variable while ID3 can only handle nominal dependent variables.