Data Retrieval and Preparation
Getting the data
There are three ways of loading data into the explorer, these are loading from a file, a database connection and finally getting a file from web server. We will be loading the data file from a locally stored file.
Weka supports 4 different file formats namely, CSV, C4.5, flat binary files and the native ARFF format. To demonstrate the functionality of the explorer environment we will be loading a CSV file and then in the following section we will preprocess the data to prepare it for analysis. To open a local data file, click on the “Open File” button, and in the window that follows select the desired data file.
Pre-Processing and Visualizing the Data
In general, before one goes about applying filters to a
dataset, one must first carefully observe the data and use tools to help in
visualizing it.
The picture left shows the attributes of the “birth.csv” datafile. By default, Weka will select the last attribute as the class variable; however, the user is free to choose any variable as the class variable. Here attributes can be selected and removed. In our example we will remove the id and bwt attributes as they are not used in the analysis.
When the
user selects an attribute, information about it is displayed on the right
section of the window. The two pictures on the right show information on the
smoke attribute. The first shows the smoke as a numerical attribute with 2
distinct values. We will be converting it to a nominal variable later using
filters.
The picture below this shows how the class attribute, weight, is distributed on the two levels of smoke. It can be seen here that, as a proportion, there are more low birth weights among smokers than among non-smokers. As stated before, the user can select any attribute as the class attribute. This is done by using the drop-down button and selecting another variable.
Weka allows a quick way to visualize all the attributes at once by clicking on the “Visualize All” button. All attributes are then shown in one window with each showing a color coded breakdown of the class variable for the different values of the attribute.
We already briefly described filters in the introduction; we will now be using a filter to prepare the data for analysis. Continuing our example, we loaded the birth data file. It contains numerical, nominal and binary attributes.
The nominal and binary attributes are coded using numerical codes and since the CSV format contains no information about the attributes, it is not obvious to the system how to treat these attributes. By default they are treated as numerical attributes. Before we analyze the data we need to convert these attributes to nominal attributes. To do this, we will use the PKIDiscretize filter.
To do this we will click on the choose button and select
the PKIDiscretize filter from the list of unsupervised attribute filters. Since
attributes are intended to be nominal we will discretize them without loosing
any information so we can use the unsupervised filter to do this.
Filters are intended to make changes to the data, and this
brings to a concept in Weka referred to as elations. This is shown in the
picture. A relation is a current instance of a data file. In this case, it is
showing the relation called “birth” which is the name of the data file along
with information about the number of instances and attributes. After a filter
has been applied on the relation (and in fact any changes to the relation), its
name will change reflecting what change was applied to the data. This new
relation is in a sense a new data file and can be saved to file. This allows
one to sequentially apply multiple filters to a data file and conversely go
back multiple times to previous relations in case of an error. To do this, the
user clicks on the undo button.
To continue with our example, we first choose the PKIDiscretize. Options on this filter as well as help on it are accessed by clicking on its name. In the window that follows we set it to be applied on attributes 4-8 and apply the filter. We then select the attributes id and bwt and click on the remove button.

At this point we would normally select each of the nominal variables and then label the different levels appropriately. Unfortunately, there is a bug in the software that prevents this.
To get around this problem we will modify the data by hand and label all the categorical variables properly. This is not difficult as the ARFF format is easy to work with. A portion of this new data file is shown below:
@relation birth
@attribute id numeric
@attribute age numeric
@attribute lwt numeric
@attribute smoke {0,1}
@attribute ptl {0,1,2,3}
@attribute ht {0,1}
@attribute ui {0,1}
@attribute ftv {0,1,2,3,4,5,6}
@attribute bwt numeric
@attribute weight {low,normal}
@data
85,19,182,0,0,0,1,0,2523,low
86,33,155,0,0,0,0,3,2551,low
87,20,105,1,0,0,0,1,2557,low
88,21,108,1,0,0,1,2,2594,low
89,18,107,1,0,0,1,0,2600,low
91,21,124,0,0,0,0,0,2622,low
After this, we then drop the id and bwt filters as discussed previously. This concludes our pre-processing for the data, and we can save the relation to a datafile at this point. To save to the relation to an external datafile is done by simply pushing on the “Save” button at the top.