Weka Data Miner

Weka Data Miner

Introduction:

Weka is a comprehensive set of advanced data mining and analysis tools. The strength of Weka lies in the area of classification where it covers many of the most current machine learning (ML) approaches. The version of Weka used in this project is version 3-4-4.

 

At its simplest, it provides a quick and easy way to explore and analyze data. Weka is also suitable for dealing with large data where the resources of many computers and or multi-processor computers can be used in parallel. We will be examining different aspects of the software with a focus on its decision tree classification features.

 

Data Handling:

Weka currently supports 3 external file formats namely CSV, Binary and C45. Weka also allows for data to be pulled directly from database servers as well as web servers. Its native data format is known as the ARFF format. It is basically a CSV (comma separated value) format with some extra headers to specify what type each attribute is (numerical, binary, nominal). As will be shown next, it is very easy to add these headers manually and convert a CSV file into an ARFF file.

Below is an example of a small datafile in ARFF format:

1.               %This is the weather data file

2.               @relation weather

3.               @attribute outlook {sunny, overcast, rainy}

4.               @attribute temperature real

5.               @attribute humidity real

6.               @attribute windy {TRUE, FALSE}

7.               @attribute play {yes, no}

8.               @data

9.               sunny,85,85,FALSE,no

10.           sunny,80,90,TRUE,no

11.           overcast,83,86,FALSE,yes

12.           rainy,70,96,FALSE,yes

13.           rainy,68,80,FALSE,yes

14.           rainy,65,70,TRUE,no

15.           overcast,64,65,TRUE,yes

16.           sunny,72,95,FALSE,no

17.           sunny,69,70,FALSE,yes

18.           rainy,75,80,FALSE,yes

 

All header commands start with ‘@’ and all comment lines start with ‘%’. Comment and blank lines are ignored. Line 1 is a comment line. Line 2 is a header command that names the dataset; in this case the data set is called ‘weather’.

 

Lines 4-7 define all the attributes in the data set. All such commands start with ‘@attribute’ followed by the name of the attribute and then the type of attribute. There are two main types of attributes, numeric and nominal. Numeric attributes are defined as either ‘real’, ‘integer’ or just ‘numeric’. Nominal attributes are defined by placing in brackets all the possible values an attribute can take. In the above example, the attribute outlook is defined as nominal and can take the values: ‘sunny’, ‘overcast’ or ‘rainy’. Line 8 signifies that the header is section is finished and the data section will begin. From this point on the format is exactly the same as a CSV file.

 

The ARFF format does not specify which attribute is the class attribute. This is done intentionally as in some cases there is no class attribute as in clustering applications. In some cases there are many class variables, as is the case with association rules where one would like to test how well each attribute can be predicted based on the other attributes.

Pre-Processing:

A key strength of the Weka system lies in what are called filters. Filters are algorithms that allow one to modify a dataset. Filters can be used to convert numerical attributes to nominal or nominal into binary, they can be used to standardize numerical values, or to remove instances with incorrect or missing values, remove misclassified instances and a lot more. Moreover, filters can also be applied on top of each other. Weka’s filters give one a powerful set of tools to clean and prepare data for analysis.

 

Weka supports many filters, and for convenience they are organized according to whether they take class information into account (Supervised/Unsupervised) and whether they act on instances or attributes. A filter can thus be one of four types, i.e., supervised instance filter, unsupervised instance filter, supervised attribute filter and lastly, unsupervised attribute filter.

 

Supervised filters take class information into account, while unsupervised filters do not. Good examples of this are the two filters supervised discretization and unsupervised discretization. Both these filters are designed to convert numerical attributes into nominal ones; however the unsupervised filter does not take class information into account when grouping instances together. There is always a risk that distinctions between the different instances in relation to the class can be wiped out when using such a filter.

 

The supervised filter does not have the same problem because it takes the class information into account and tries to maintain the class distinctions in the grouped instances. The usefulness of class information and thus supervised filters will of course depend on the context and either type of filter is useful in many cases.

The filters described above are examples of attribute filters as they are designed to act upon specific attributes. Another example of an attribute filter is the ReplaceMissingValues filter. This filter will scan all (or selected) nominal and numerical attributes and replace missing values with the modes and mean.

 

Weka also has another type of filter called instance filters. Instance filters work on a whole instance of the data not just a specific attribute, while attribute filter works on attributes in general and not on just a specific instance or instances.

 

Instance filters are also of the supervised or unsupervised types. A simple example of a supervised instance filter is the RemoveMisclassified filter. It takes a classification algorithm as a parameter and removes from the data the instances that were misclassified.

 

To further illustrate the difference between supervised and unsupervised instance filters, we can compare the supervised filter StratifiedRemoveFolds and its related unsupervised filter RemoveFolds. Both filters allow one to select a specific cross-validation fold of the data. The supervised filter takes class information into account and makes sure that the selected fold is stratified appropriately. The unsupervised filter performs an ordinary cross-validation.

 

Descriptions of the all filters available to Weka have been extracted from the online help available in the Weka environment and listed in the Reference Section 3.1. The descriptions also include an explanation of the options available for each filter.

 

Knowledge Representation in Weka:

“Knowledge Representation” is a general term used to define the way structural patterns in data are represented. The form and function of the different representations can vary a great deal as the methods or algorithms that generate them also vary in their form and function.

 

These algorithms can however be broadly grouped into three groups, namely, clustering, classification and association rule algorithms. Clustering algorithms try and fit instances into groups that are most similar to them. As such there is no class variable involved in such applications. Classification algorithms try and predict which class an instance belongs to based on its attribute values.

 

Association rule algorithms are similar to the classification ones except it is not only one class attribute to be predicted but any attribute. Using our birth weight example, an association rule would explore not only how the class variable weight relates to the attributes but would try to learn how all the attributes are related to each other. So in this case, it would explore if low birth weights occurred more among older women, or if smokers tended to be younger women and so on.

 

We will only be looking at the classification algorithms in this project and we will focus on decision tree based classifiers.

 

Weka Classification Schemes:

Weka has a very long list of classification algorithms or schemes available. For convenience, the algorithms are grouped into six groups: Bayes, Functions, Lazy, Meta, Trees and Rules. We will be examining only the tree classifiers in this project. Descriptions of the different tree algorithms available in Weka can be found in the Reference Section.