Data Retrieval and Preparation

Getting the data

There are three ways of loading data into the explorer, these are loading from a file, a database connection and finally getting a file from web server. We will be loading the data file from a locally stored file.

 

Weka supports 4 different file formats namely, CSV, C4.5, flat binary files and the native ARFF format. To demonstrate the functionality of the explorer environment we will be loading a CSV file and then in the following section we will preprocess the data to prepare it for analysis. To open a local data file, click on the “Open File” button, and in the window that follows select the desired data file.

 

Pre-Processing and Visualizing the Data

 

In general, before one goes about applying filters to a dataset, one must first carefully observe the data and use tools to help in visualizing it.

 

The picture left shows the attributes of the “birth.csv” datafile. By default, Weka will select the last attribute as the class variable; however, the user is free to choose any variable as the class variable. Here attributes can be selected and removed. In our example we will remove the id and bwt attributes as they are not used in the analysis.

 

Text Box:  

 
When the user selects an attribute, information about it is displayed on the right section of the window. The two pictures on the right show information on the smoke attribute. The first shows the smoke as a numerical attribute with 2 distinct values. We will be converting it to a nominal variable later using filters.

 

The picture below this shows how the class attribute, weight, is distributed on the two levels of smoke. It can be seen here that, as a proportion, there are more low birth weights among smokers than among non-smokers. As stated before, the user can select any attribute as the class attribute. This is done by using the drop-down button and selecting another variable.

 

Weka allows a quick way to visualize all the attributes at once by clicking on the “Visualize All” button. All attributes are then shown in one window with each showing a color coded breakdown of the class variable for the different values of the attribute.

 

We already briefly described filters in the introduction; we will now be using a filter to prepare the data for analysis. Continuing our example, we loaded the birth data file. It contains numerical, nominal and binary attributes.

 

The nominal and binary attributes are coded using numerical codes and since the CSV format contains no information about the attributes, it is not obvious to the system how to treat these attributes. By default they are treated as numerical attributes. Before we analyze the data we need to convert these attributes to nominal attributes. To do this, we will use the PKIDiscretize filter.

 

To do this we will click on the choose button and select the PKIDiscretize filter from the list of unsupervised attribute filters. Since attributes are intended to be nominal we will discretize them without loosing any information so we can use the unsupervised filter to do this.

 

Filters are intended to make changes to the data, and this brings to a concept in Weka referred to as elations. This is shown in the picture. A relation is a current instance of a data file. In this case, it is showing the relation called “birth” which is the name of the data file along with information about the number of instances and attributes. After a filter has been applied on the relation (and in fact any changes to the relation), its name will change reflecting what change was applied to the data. This new relation is in a sense a new data file and can be saved to file. This allows one to sequentially apply multiple filters to a data file and conversely go back multiple times to previous relations in case of an error. To do this, the user clicks on the undo button.

 

To continue with our example, we first choose the PKIDiscretize. Options on this filter as well as help on it are accessed by clicking on its name. In the window that follows we set it to be applied on attributes 4-8 and apply the filter. We then select the attributes id and bwt and click on the remove button.

Text Box:

At this point we would normally select each of the nominal variables and then label the different levels appropriately. Unfortunately, there is a bug in the software that prevents this.

 

To get around this problem we will modify the data by hand and label all the categorical variables properly. This is not difficult as the ARFF format is easy to work with. A portion of this new data file is shown below:



@relation birth

@attribute id numeric

@attribute age numeric

@attribute lwt numeric

@attribute smoke {0,1}

@attribute ptl {0,1,2,3}

@attribute ht {0,1}

@attribute ui {0,1}

@attribute ftv {0,1,2,3,4,5,6}

@attribute bwt numeric

@attribute weight {low,normal}

 

@data

85,19,182,0,0,0,1,0,2523,low

86,33,155,0,0,0,0,3,2551,low

87,20,105,1,0,0,0,1,2557,low

88,21,108,1,0,0,1,2,2594,low

89,18,107,1,0,0,1,0,2600,low

91,21,124,0,0,0,0,0,2622,low

After this, we then drop the id and bwt filters as discussed previously. This concludes our pre-processing for the data, and we can save the relation to a datafile at this point. To save to the relation to an external datafile is done by simply pushing on the “Save” button at the top.