Lecture 12
Main Features
Weka is a freely available software developed by Waikato
University New Zealand. You can download it from the
following link
https://www.cs.waikato.ac.nz/ml/weka/
Weka contains tools for data pre-processing, classification,
clustering, association rules, and visualization. (Weka
Knowledge Explorer)
Environment for comparing learning algorithms
(Experimental)
It is also well-suited for developing new data mining or
machine learning schemes.
WEKA: versions
There are several versions of WEKA:
WEKA 3.0: “command-line”
WEKA 3.2: “GUI version” adds graphical user
interfaces
WEKA 3.3: “development version” with lots of
improvements
These slides use a mixture of snapshots of
WEKA 3.3 and 3.9 (soon to be WEKA 3.4).
WEKA Knowledge Explorer
Preprocess Choose and modify the data
Classify Train and test learning schemes that classify
Cluster Learn clusters for the data
Association Learn association rules for the data
Select attributes Most relevant attributes in the data
Visualize View an interactive 2D plot of the data
WEKA Explorer: Pre-processing
the Data
Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary
Data can also be read from a URL or from
an SQL database (using JDBC)
Pre-processing tools in WEKA are called
“filters”
WEKA contains filters for:
Discretization, normalization, attribute
selection, transforming, …
WEKA only deals with “flat” files
The data must be converted to ARFF format
before applying any algorithm.
The dataset’s name: @relation
The attribute information: @attribute
The data section begins with @data
Data: a list of instances with the attribute values
being separated by commas.
By default, the class is the last attribute in the
ARFF file.
Numeric attribute and Missing
Value
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute play {YES,NO}
@data
Sunny, 85, 85, FALSE, no
Sunny, 80, 90, TRUE, no
Overcast, 83, 86, FALSE, yes
Rainy, 70, 96, FALSE, yes
...
Numeric attribute and Missing
Value
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute play {YES,NO}
@data
Sunny, 85, 85, FALSE, no
Sunny, 80, 90, TRUE, no
Overcast, 83, 86, FALSE, ?
Rainy, 70, 96, ?, yes
...
Explorer: building “classifiers”
Classifiers in WEKA are models for
predicting nominal or numeric quantities
Implemented learning schemes include:
Decision trees and lists, instance-based
classifiers, support vector machines, multi-
layer perceptrons, logistic regression, Bayes’
nets, …
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.
Explorer: clustering data
WEKA contains “clusterers” for finding
groups of similar instances in a dataset
Implemented schemes are:
k-Means, EM, Cobweb, X-means, FarthestFirst
Clusters can be visualized
Evaluation based on loglikelihood if
clustering scheme produces a probability
distribution
Explorer: finding associations
WEKA contains an implementation of the
Apriori algorithm for learning association rules
Works only with discrete data
Can identify statistical dependencies between
groups of attributes:
milk, butter bread, eggs (with confidence 0.9)
Apriori can compute all rules that have a
given minimum support and exceed a given
confidence
Explorer: attribute selection
Panel that can be used to investigate which
(subsets of) attributes are the most predictive
ones
Attribute selection methods contain two parts:
A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking
An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
Very flexible: WEKA allows (almost) arbitrary
combinations of these two
Explorer: data visualization
Visualization very useful in practice: e.g.
helps to determine difficulty of the learning
problem
WEKA can visualize single attributes (1-d)
and pairs of attributes (2-d)
To do: rotating 3-d visualizations (Xgobi-style)
Color-coded class values
“Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)
Performing experiments
Experimenter makes it easy to compare
the performance of different learning
schemes
For classification and regression problems
Results can be written into file or database
Evaluation options: cross-validation,
learning curve
Resources:
WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka
Also has a list of projects based on
WEKA
Tutorial.
http://prdownloads.sourceforge.net/weka/wek
a.ppt