Running the Experiment
Running
The next step is to run the experiment and this is done by clicking on the Run tab at the top of the window. There is not much involved in this step, all that is needed is to click on the Start button. The progress of the experiment is displayed the Status area at the bottom of the window and any errors reported will be displayed in the Log area. Once the experiment has been run, the next step is to analyze the results.
Analyzing the output
When the Experimenter has finished running the experiment, what is produced is a dataset that contains results of the experiment. In our example, we were using 1 dataset and 2 algorithms. We also used 10-fold cross-validation for sampling/testing. This means that for every algorithm, 1 result is produced for each fold of each dataset. Since we used 10 folds and had 1 dataset, 10 results were produced for each algorithm for a total of 20 results. The data was re-randomized and the experiment repeated 10 times so 200 results were produced.
Each result contains the algorithm, dataset and sample that produced it as well as various performance measures. A result set specifies a certain model based on a certain dataset. Since we only have 1 dataset in this example, this mean that the number of models and the number of result sets are equal.
Unlike the explorer interface, there are no graphical analysis tools available within the Experimenter environment. Analysis in this environment is limited to numerical analysis and significance testing. It is also possible to output the experiment results into a dataset so that they can be used in an external program such as excel to generate the visual displays.
The Interface
Please refer to reference section 3.6.2
The interface of the Analysis panel in the Experimenter is composed of four different areas. The area at the top, called ‘Source’, is where the user will be able to load the experiment results. The user has the option of loading up the results of the last experiment run, from a file or directly from a database.
Below this, on the left, is the ‘Configure test’ area. It is in this area where the user will choose how they would like to compare the different models built in the experiment. The button ‘Row’ and ‘Column’ determine how the results are grouped together so that they may be compared against each other. The Column button controls what fields will be used to identify a particular learning scheme. By default, these fields are Scheme, Scheme_options and Scheme_version_ID. The Row button controls what fields will identify the dataset used, and by default, this field is the dataset field. Together these fields will uniquely identify every model built on a given dataset in the experiment and will serve to group the results accordingly so that the models may be compared with each other. It is advisable to leave these default settings.
Below this is the ‘Comparison Field’ dropdown button where the user will select on what basis the different models will be compared. There are many choices here, as the user can compare the accuracy of the models, their training times, their efficiency and their complexity. Next is a field that allows the user to set the significance level of the test. The button below is used to select a baseline model to be one of the models built in the experiment.
All other models will be compared with
this baseline model when performing the test. When a user clicks on this button
a window will pop-up that will allow the user to select from one of the models.
This window is shown in the picture on the right. As shown, the user has the
option selecting the J48 model or the REPTree model as the baseline. It is also
possible to rank the models or to show a summary.
The ‘Displayed Columns’ button below this allows the user to pick and choose which of the result sets (models) will be used in the test and by default all result sets are selected. Lastly, the ‘Output Format’ button gives more options in controlling the test output. In the window the pops-up, the user can control to how many decimal places the mean and standard deviations are estimated, and in what format the test information will be displayed. The test output can be saved in plain text (default), Latex or CSV format.
The large area to the right of this is the ‘Test output’ area. It is here that the details of the test will be displayed. Lastly at the bottom left is the ‘Results list’ area. Any action such as loading of results or performing a test will produce some output that is displayed in the ‘Test output’ area. Past results can be recalled from this ‘Results list’ area. The newest result is always shown at the bottom of this list.
Analyzing the Experiment
We will first compare the two models on what percentage of the test cases each model got correct and if there was any statistically significant different in their performance. We shall then rank them based on their training time. Of course, there many other tests can be conducted. However, since the information about each test presented will be the same, these examples will suffice to explain how to conduct any of the other tests possible.
Comparing Model Performance
To begin the analysis we first load the experiment results by pushing Experiment button (Source area). This will load up the last experiment’s results. Continuing our example from the birth weight data 200 results will be loaded. To compare the two models on their percentage correct score we shall set the Comparison field to Percentage_correct, and we will set the baseline model to be the J48 model. We will also check the ‘Show std. deviations’ checkbox. We perform the test by clicking on the ‘Perform test’ and the test results are displayed in the test output area on the right.
The picture on the left shows the output
produced from the test. The first 5 lines of the output show basic information
about the test.
Below this is a table that contains the main information about the test. This table’s rows and columns are controlled via the ‘Row’ and ‘Column’ buttons we looked at before. By default, every row of the table will contain test scores from a particular dataset, and every column will contain test scores from a particular learning scheme (algorithm).
The picture shows a 3x3 table with the first row serving to identify the different schemes, and the first column identifying the dataset that the test score came from and the number of instances in it shown in brackets. Since in our example there was only 1 dataset and 2 algorithms there will be only 2 test scores.
The dataset name ‘birth-weka.filters.unsup (100)’ is a partial name and the number shown in brackets is the number of instances in it. This dataset does not refer to the birth weight data set, but the dataset produced by the Experimenter and the 100 instances come from the fact we conducted a 10 fold cross validation so each experiment run produced 10 results per algorithm. We repeated the experiment 10 times so this resulted in the 100 instances produced per algorithm.
The first algorithm shown is always the baseline and in this case, it is the J48 algorithm. The model generated had a mean Percentage_correct score of 72.78% with a std. deviation of 8.63. The next model based on the REPTree algorithm had a score 72.11% with a std. deviation of 6.17. The result of the comparisons is shown in the last row.
The Experimenter uses a series of pairwise t-tests to compare the different models and uses a special notation to describe the comparisons. Each scheme given three numbers in brackets (#,#,#). These numbers are counts of the number of times the scheme was statistically better, equal and worse than the baseline model respectively. We can see the REPTree scheme got a ‘(0/1/0)’ meaning it was never better than the J48 scheme, it was equal once and it was never worse. For the J48 scheme we see a ‘(V/ /*)’ instead of numbers because it is the baseline scheme.
Rank and Summary Views
At times, it is more convenient to view the experiment by ranking the different models according to the performance criteria chosen and other times it is sufficient to show a summary of the experiment. A user can switch to either of these by clicking on the ‘Select Base’ button and choosing either summary or ranking from the window that pops-up.
The ranking test has a special notation
that requires some elaboration. The picture on the left is from the test we did
where we rank the two schemes based on their training time. There are 3 columns
on this table with the title ‘>-<’, ‘>’ and ‘<’. Numbers in the
last 2 columns are counts of the number of times the scheme was significantly
more and significantly less than the other schemes. Numbers in the first column
represent the difference between these two counts. There are 2 rows in this
table, one row for each learning scheme.
In this case, we can see that only once had J48 scheme had a training time that was significantly more than other learning schemes tested. Since shorter training times are more desirable, in this case, the scheme with the lowest number in the first column is the winner and in this case, it is the REPTree scheme