Data Mining at UVA
New Horizons in Teaching and Learning
Conference
May 21-24, 2007
Kathy Gerber, ITC Research Computing
[email protected] Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions
• Computers have become cheaper
and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data (e.g., GEOSS)
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Mining Large Data Sets - Motivation
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful
information
• Much of the data is never analyzed at all
4,000,000
3,500,000
3,000,000
The Data Gap
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
500,000
Number of
0
analysts
1995 1996 1997 1998 1999
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What is Data Mining?
• Many Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Summary of SAS DM Process -
SEMMA
• Sample the data by creating one or more data tables.
The sample should be large enough to contain the
significant information, yet small enough to process.
• Explore the data by searching for anticipated
relationships, unanticipated trends, and anomalies in
order to gain understanding and ideas.
• Modify the data by creating, selecting, and transforming
the variables to focus the model selection process.
• Model the data by using the analytical tools to search for
a combination of the data that reliably predicts a desired
outcome.
• Assess the data by evaluating the usefulness and
reliability of the findings from the data mining process.
What is (not) Data Mining?
What is not Data What is Data Mining?
Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
– Enormity of data Statistics/ Machine Learning/
– High dimensionality AI Pattern
of data Recognition
– Heterogeneous, Data Mining
distributed nature
of data
Database
systems
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
• Categorizing news stories as finance,
weather, entertainment, sports, etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief
Networks
• Support Vector Machines
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Software Demonstrations
SAS Enterprise Miner
R Rattle
Weka
SAS Enterprise Miner
Screenshot – EM Tutorial Workflow
R Rattle
• Install R 2.5.0
• > source("http://www.ggobi.org/downloads/install.r")
• > install(“rattle”, dep=TRUE)
Weka
Slide Credits
• R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering
Applications”
• SAS Enterprise Miner tutorial
• Frank Eibe, Machine Learning with Weka
• Tan, Steinbach, Kumar “Introduction to Data Mining”
Versions and References for
Software Used Today
• SAS 9.1.3 EAS with Enterprise Miner
– UVA licensed software
– http://rescomp.virginia.edu
• R 2.5.0 with Rattle (open source)
– Open source
• Weka (open source)
– Ian Witten, Frank Eibe: Data Mining: Practical Machine Learning
Tools and Techniques (Second Edition)
• Not demonstrated but also see Insightful Miner and
Orange