Weka - More than a bird in New Zealand

Have you ever heard of the tiny flightless birds called Weka, which have its roots in New Zealand. Probably not. However, what you as a software and technology interested person pretty sure have heard of, these days even more often than ever before, is the buzzword Machine Learning. You may now be a bit confused and might think what the hell I am talking about, but there is a good reason why I am mentioning this specific bird species in my introduction. As some of you might suppose, Weka also has something to do with Machine Learning, in essence it is a popular Data Mining and Machine Learning framework written in Java. It was developed by the University of Waikato in New Zealand and is issued under the GNU GPL. I have been actively working with Weka recently and although I am not yet an expert in it, I want to give you an overview on what the software is capable of and will show you some basic stuff to easily master your first few Machine Learning tasks more easily in practice. Before getting started I want to remind you that this blog post will not be a comprehensive lecture about Machine Learning concepts and algorithms, it rather focuses on applying algorithms in practice and analyzing the results. Since I want to elaborate a bit more in depth on Machine Learning and Weka, I will continue with at least one follow up post where I want to share some more of my experiences and provide hints on how you can improve your results and get more out of your Machine Learning task.

Installing and running Weka

Since Weka is written in Java it can be conveniently installed on Windows, Mac and Linux. If you have installed Java just download the latest version depending on your operating system from the offical siteof the project, run the executable and you are good to go.

How Weka can be used

If you install Weka the way I described it, you are provided with a GUI with which you can apply the provided algorithms on a data set directly. We will focus on the capabilities of the GUI in this post. After running the application you should see the initial window as shown below.

The GUI has quite a lot of functionality you can use with a few clicks. However, you can access the functionality of Weka from your Java code as well. I will show you more on that in my upcoming post since it gives you more flexibility in preparing your data set and enables you to extend some functionality as well as to automize repetitive things.

Loading your data set into Weka

Before you can apply any of the provided algorithms you need to feed some data into Weka. We hereby talk about a data set which is usually a table where the rows represent observations in the data set and the columns represent different features of that data. Weka uses a file format called ARFF (Attribute-Relation File Format) to load such a data set. In Weka terms the rows are named instances and the features are called attributes. An ARFF file is basically a standard CSV file containing some metadata describing the attributes and its types followed by the actual data. Since such an ARFF file is usually not just created from scratch but rather originates from some other source, Weka provides functionality to convert an existing data set which may be a CSV or JSON file to the ARFF format and thus make it processable by Weka. You can do this by clicking on Tools -> Arff Viewer and selecting the desired CSV file. There is an often used example data set in the Machine Learning field called iris data set which serves as perfect starting point for trying out things. You can get ithere. After loading the CSV file you should see a window showing you the content like illustrated in the figure below. I encourage you to use the iris data set if you want to play around with it in Weka alongside reading this blog post.

We can see that the different attributes have a data type which was automatically recognized by Weka. Note that the last attribute “species” has type Nominal which means that it can only take specific values whereas the others are of type Numeric and can therefore contain an arbitrary continuous number. Since there is only one nominal attribute in the data set, Weka automatically uses this attribute as class attribute. The class attribute has a special meaning since it will be used as the attribute to predict when it comes to classification algorithms. In this data set for instance the use case is to predict the species by finding patterns in the remaining numeric attributes. You now have a basic understanding of the ARFF file and know what a class attribute means, so we can finally save the data set as ARFF file and inspect it after the conversion.

  
    
@relation iris

@attribute sepal_length numeric
@attribute sepal_width numeric
@attribute petal_length numeric
@attribute petal_width numeric
@attribute species {setosa,versicolor,virginica}

@data
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5,3.6,1.4,0.2,setosa

The generated file starts with a name (automatically derived from the filename) followed by some metadata describing each attribute as mentioned earlier and finally shows the data comma separated in the@datasection. Not that complex right. We are now ready to do something with this data set. Let’s head on.

Weka GUI Chooser

After starting Weka you will be prompted with a simple GUI with some buttons letting you decide where to go. I will briefly explain the different applications Weka offers. The Explorer lets you load an ARFF file, work with the data and apply Machine Learning algorithms on it. It gives you the best possibility to get started and try algorithms easily. I will only talk about the Explorer in this post. The Experimenter provides an enhanced GUI which lets you compare different algorithms on different data sets. It is very powerful when you want to dive deeper and analyze which particular algorithms work better on the data set than others. The KnowledgeFlow and Workbench have more or less the same capabilities as the Explorer and the Experimenter. The only difference is that components can be built together by drag and drop and the visualization is a bit different. If you are very curious and cannot wait to learn more details on the other GUI opportunities, I refer to the official documents. Since the next few paragraphs work with the Weka Explorer, let’s click on Explorer and load the data set with Open file… and selecting the produced ARFF file.

Preprocessing

I am definitely not the last person who will tell you one of the golden rules in Machine Learning, namely that garbage put in on the one side of the algorithm will most probably produce garbage coming out on the other side. Therefore preprocessing a data set before passing it to a Machine Learning task can save you much time and most probably make your results far better. Preprocessing in Weka is mainly done via filtering whereby Weka provides a huge collection of different filters you can apply to a data set. To get a better overview it is important to separate filters depending on whether they consider the class distribution (then we talk about supervised filters) or not (in this case we talk about unsupervised filters). The other separation is done based on the fact of whether filtering is done on attributes or instances. To make this a bit more clear I will provide two useful examples.

Class Balancer: This filter is a supervised instance-based filter which is used to combat the problem of imbalanced class distribution in a data set. This is a quite common problem since large data sets usually do not have each class represented equally. Think about a training data set about lot of people where the class attribute indicates whether the person has cancer or not. It is expected that only a minority falls into the “yes” category. The resulting problem is that most classifier algorithms will overfit into the direction of the majority class. The Class Balancer reweights the instances based on the class distribution to balance the data.
ReplaceMissingValues: Another common problem is the sparsity of data sets. Data instances often lack of some feature values which could not be collected in the data preparation phase. This filter is an unsupervised attribute-based filter which automatically replaces missing values by the average of the present values regarding this particular attribute among all instances.

There are many more filters as you can see when expanding the filtering tree. I hope you now have a basic understanding of the possibilities Weka filtering provides you with and be able to distinguish between supervised and unsupervised as well as instance and attribute based filters. Applying a filter is now as easy as selecting one and clicking on “Apply”.

Attribute selection

After preprocessing one could claim that we can now feed the data set into some classifier letting it to some magic for us and getting a perfect result. However, the reality is that Data Mining and Machine Learning requires a lot of experimenting until you find the needle in the huge data haystack. A recommended way is trying different combinations of attributes to find out which of them performs best. Even if you have a very good understanding of the present attributes in the data set, it makes sense to use the attribute selection functions of Weka. By clicking on the tab Select attributes you can now choose between a set of so called Attribute evaluators. These let you decide on a method how attributes should be selected. A popular evaluator for instance is CorrelationAttributeEval which computes the Pearson correlation coefficient between each attribute and the class attribute to determine the contribution of an attribute to the class prediction. The search method on the other hand specifies whether attributes should be ranked (allows for manual selection by the user) or automatically selected by the algorithm for the user. The results after running an attribute selection like the correlation based one will look like following.

Supervised and Unsupervised algorithms

Now we have talked about preprocessing our data in a way that it makes more sense and we know how to determine which attributes work well together. So it’s time to apply an actual Machine Learning algorithm to solve a problem. In Machine Learning we basically differentiate between supervised and unsupervised algorithms whereby the first category works on a training data set where instances have been assigned some class which should be predicted for another set of unknown instances, called test data. Two common types of supervised algorithms are regression, which is capable of predicting a continuous numeric target attribute, and classification, which purpose is to assign one of a set of predefined classes. (or categories if you will) Since the iris data set has the species as nominal class attribute with three different values, a pure classification algorithm better fits the problem. To apply such an algorithm, click on the tab Classify and then on Choose. A very popular and fast algorithm family is those of decision trees. If you want to know more about decision trees and want to understand how they work checkout the Internet for more theoretical background. A good starting point might be Wikipedia. A well performing algorithm available in Weka is J48 (a Java implementation of the C4.5 decision tree algorithm). Let’s choose this one for now. Before clicking the start button make sure that you have selected the species attribute in the combo box. In the Test options section, you have the opportunity to choose the evaluation method. You can either test on the training set, supply a test set (recommended after training a classifier successfully), split the training set for validation or use cross-validation. 10-fold cross-validation is the recommended strategy since it is most reliable for training. It randomly uses 90% of the data set for training the classifier and 10% for validation. This done 10 times always using different instances. The average performance tells us how good the choosen classifier performs. Let’s choose cross-validation on the iris data set. Since the data set is rather small, the classification is done within seconds.

Analyzing results

After applying a classification algorithm you will be provided a results overview as shown in the next figure.

  
    
=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances         144               96      %
Incorrectly Classified Instances         6                4      %
Kappa statistic                          0.94  
Mean absolute error                      0.035 
Root mean squared error                  0.1586
Relative absolute error                  7.8705 %
Root relative squared error             33.6353 %
Total Number of Instances              150     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.980    0.000    1.000      0.980    0.990      0.985    0.990     0.987     setosa
                 0.940    0.030    0.940      0.940    0.940      0.910    0.952     0.880     versicolor
                 0.960    0.030    0.941      0.960    0.950      0.925    0.961     0.905     virginica
Weighted Avg.    0.960    0.020    0.960      0.960    0.960      0.940    0.968     0.924     

=== Confusion Matrix ===

  a  b  c   <-- classified as
 49  1  0 |  a = setosa
  0 47  3 |  b = versicolor
  0  2 48 |  c = virginica

Most important, the first look will be how many instances were correctly classified. In our case this is 96% which is a very good result. Following are different error statistics before a separate table tells use about detailed accuracy by class. Interesting metrics are the precision indicating how many percent of the total instances with that class were classified as that class. The recall is the same as the percentage of correctly classified instances. The F-Measure is the weighted average of precision and recall. Do not forget the confusion matrix which gives a simple overview on how the distribution of correctly and incorrectly classified instances is.

Conclusion

This blog post should have given you a brief introduction into the practical appliance of Machine Learning in a Java environment. It definitely serves as a good starting point if you are a beginner in Machine Learning and used to Java. As said in the beginning, the post is no comprehensive introduction into Machine Learning so I highly recommend to search for other sources to get more detailed theoretical background on different Machine Learning algorithms and concepts. It will help you for your further work. There are plenty of good resources out there so Google will be definitely your friend here. I hope you learned something new after reading this post and ready to apply Machine Learning with Weka in Java.