0% found this document useful (0 votes)

123 views19 pages

Data Mining at UVA: New Horizons in Teaching and Learning Conference

This document summarizes a presentation on data mining at UVA. It discusses the commercial and scientific motivations for data mining, including finding patterns in large datasets. Data mining can help discover useful information that humans may miss. The presentation covers classification techniques like decision trees and neural networks. It provides examples of software for data mining, demonstrating SAS Enterprise Miner, R with the Rattle package, and Weka.

Uploaded by

sathishjoseph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

123 views19 pages

Data Mining at UVA: New Horizons in Teaching and Learning Conference

Uploaded by

sathishjoseph

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 19

Data Mining at UVA

New Horizons in Teaching and Learning

Conference
May 21-24, 2007
Kathy Gerber, ITC Research Computing
[email protected]
Why Mine Data? Commercial Viewpoint
• Lots of data is being collected
and warehoused
– Web data, e-commerce
– purchases at department/
grocery stores
– Bank/Credit Card
transactions

• Computers have become cheaper

and more powerful
• Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Scientific Viewpoint
• Data collected and stored at
enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene
expression data (e.g., GEOSS)
– scientific simulations
generating terabytes of data
• Traditional techniques infeasible for raw data
• Data mining may help scientists
– in classifying and segmenting data
– in Hypothesis Formation
Mining Large Data Sets - Motivation
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful
information
• Much of the data is never analyzed at all
4,000,000

3,500,000

3,000,000
The Data Gap
2,500,000

2,000,000

1,500,000
Total new disk (TB) since 1995
1,000,000

500,000
Number of
0
analysts
1995 1996 1997 1998 1999

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What is Data Mining?
• Many Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data
– Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
Summary of SAS DM Process -
SEMMA
• Sample the data by creating one or more data tables.
The sample should be large enough to contain the
significant information, yet small enough to process.
• Explore the data by searching for anticipated
relationships, unanticipated trends, and anomalies in
order to gain understanding and ideas.
• Modify the data by creating, selecting, and transforming
the variables to focus the model selection process.
• Model the data by using the analytical tools to search for
a combination of the data that reliably predicts a desired
outcome.
• Assess the data by evaluating the usefulness and
reliability of the findings from the data mining process.
What is (not) Data Mining?
What is not Data  What is Data Mining?
Mining?

– Look up phone – Certain names are more

number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by
information about search engine according to
“Amazon” their context (e.g. Amazon
rainforest, Amazon.com,)
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
– Enormity of data Statistics/ Machine Learning/
– High dimensionality AI Pattern
of data Recognition

– Heterogeneous, Data Mining

distributed nature
of data
Database
systems
Classification: Definition
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions

as legitimate or fraudulent

• Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil

• Categorizing news stories as finance,

weather, entertainment, sports, etc
Classification Techniques

• Decision Tree based Methods

• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief
Networks
• Support Vector Machines
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
No
1 Yes Large 125K
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction

14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Software Demonstrations

SAS Enterprise Miner

R Rattle
Weka
SAS Enterprise Miner
Screenshot – EM Tutorial Workflow
R Rattle
• Install R 2.5.0
• > source("http://www.ggobi.org/downloads/install.r")
• > install(“rattle”, dep=TRUE)
Weka
Slide Credits

• R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering

Applications”

• SAS Enterprise Miner tutorial

• Frank Eibe, Machine Learning with Weka

• Tan, Steinbach, Kumar “Introduction to Data Mining”

Versions and References for
Software Used Today
• SAS 9.1.3 EAS with Enterprise Miner
– UVA licensed software
– http://rescomp.virginia.edu
• R 2.5.0 with Rattle (open source)
– Open source
• Weka (open source)
– Ian Witten, Frank Eibe: Data Mining: Practical Machine Learning
Tools and Techniques (Second Edition)

• Not demonstrated but also see Insightful Miner and

Orange

Ai-900 2021
No ratings yet
Ai-900 2021
261 pages
IBM Data Analyts Professional Certificate Note
No ratings yet
IBM Data Analyts Professional Certificate Note
16 pages
Improve Model Accuracy With Data Pre-Processing
No ratings yet
Improve Model Accuracy With Data Pre-Processing
11 pages
4-Confluence of Multiple Disciplines, Classifictaion, Integration-08-Feb-2021Material - I - 08-Feb-2021 - Mod1 - Confluence - Classifictaion
0% (1)
4-Confluence of Multiple Disciplines, Classifictaion, Integration-08-Feb-2021Material - I - 08-Feb-2021 - Mod1 - Confluence - Classifictaion
4 pages
Machine Learning and Neural Networks Quiz
55% (11)
Machine Learning and Neural Networks Quiz
17 pages
DataMiningForTheMasses (001 158)
No ratings yet
DataMiningForTheMasses (001 158)
158 pages
Synthetic Well Log Generation Using Machine Learning Techniques
No ratings yet
Synthetic Well Log Generation Using Machine Learning Techniques
16 pages
Diabetes Case Study - Jupyter Notebook
100% (1)
Diabetes Case Study - Jupyter Notebook
10 pages
Comp 6838
No ratings yet
Comp 6838
41 pages
Machine Learning Lab Manual (15CSL76)
No ratings yet
Machine Learning Lab Manual (15CSL76)
30 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Retail Data Insights & Strategies
No ratings yet
Retail Data Insights & Strategies
24 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
No ratings yet
Introduction To Data Mining: Saeed Salem Department of Computer Science North Dakota State University Cs - Ndsu.edu/ Salem
30 pages
Evolution of Machine Learning Algorithm
No ratings yet
Evolution of Machine Learning Algorithm
21 pages
Lec 1
No ratings yet
Lec 1
48 pages
Introduction To Big Data & Basic Data Analysis
No ratings yet
Introduction To Big Data & Basic Data Analysis
47 pages
Decision Support Systems Guide
No ratings yet
Decision Support Systems Guide
9 pages
A Project-Based Seminar Report On Movie Rating Prediction System
100% (2)
A Project-Based Seminar Report On Movie Rating Prediction System
21 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
More Tips From The Pros: SFO Magazine
No ratings yet
More Tips From The Pros: SFO Magazine
8 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
CH 05 PPTaccessible
No ratings yet
CH 05 PPTaccessible
60 pages
Project Report Full
No ratings yet
Project Report Full
60 pages
Unlocking The Potential of Deep Learning For Marin
No ratings yet
Unlocking The Potential of Deep Learning For Marin
44 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data Warehousing & Mining Guide
No ratings yet
Data Warehousing & Mining Guide
142 pages
Approaches To The Analysis of Survey Data PDF
No ratings yet
Approaches To The Analysis of Survey Data PDF
28 pages
Multi-Agent Forex Trading System
No ratings yet
Multi-Agent Forex Trading System
28 pages
EPL Match Prediction Using AI
100% (1)
EPL Match Prediction Using AI
5 pages
1506 06726 PDF
No ratings yet
1506 06726 PDF
11 pages
Frequent Patterns
No ratings yet
Frequent Patterns
80 pages
Decision Trees: A Recent Overview: S. B. Kotsiantis
No ratings yet
Decision Trees: A Recent Overview: S. B. Kotsiantis
23 pages
Machine Learning for Engineering Students
No ratings yet
Machine Learning for Engineering Students
24 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Data Science and Big Data Analytics
0% (1)
Data Science and Big Data Analytics
3 pages
Digital Shade Guide
No ratings yet
Digital Shade Guide
13 pages
Emc Data Science Study WP PDF
No ratings yet
Emc Data Science Study WP PDF
6 pages
Vinee
100% (1)
Vinee
28 pages
CH11
No ratings yet
CH11
36 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
Day Ahead Hourly Load Forecast of PJM Electricity Market and ISO New England Market by Using Artificial Neural Network
No ratings yet
Day Ahead Hourly Load Forecast of PJM Electricity Market and ISO New England Market by Using Artificial Neural Network
5 pages
DB 14
No ratings yet
DB 14
97 pages
RapidMiner Tutorial Breve PDF
No ratings yet
RapidMiner Tutorial Breve PDF
24 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
IM Ch14 Big Data Analytics NoSQL Ed12
No ratings yet
IM Ch14 Big Data Analytics NoSQL Ed12
8 pages
Project Report - Data Mining
0% (1)
Project Report - Data Mining
52 pages
Represented Using Tensors, and As A Result, Neural Network Programming Utilizes
No ratings yet
Represented Using Tensors, and As A Result, Neural Network Programming Utilizes
32 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Lecture 2 Data Mining Functions
No ratings yet
Lecture 2 Data Mining Functions
40 pages
Free Download Data Science Curriculum - Innomatics Research Labs Hyderabad, India
No ratings yet
Free Download Data Science Curriculum - Innomatics Research Labs Hyderabad, India
14 pages
Week 02 PDF
No ratings yet
Week 02 PDF
39 pages
Application of Sentiment Analysis On Product Review E-Commerce
No ratings yet
Application of Sentiment Analysis On Product Review E-Commerce
9 pages
Data Scales and Representation: Prof. Asim Tewari IIT Bombay
No ratings yet
Data Scales and Representation: Prof. Asim Tewari IIT Bombay
27 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
28 pages
Unit - I IDS
No ratings yet
Unit - I IDS
33 pages
Data Analytics: Key Concepts & Terms
No ratings yet
Data Analytics: Key Concepts & Terms
22 pages
4 Data Mining & Preprocessing L 11,12,13,14,15,16
No ratings yet
4 Data Mining & Preprocessing L 11,12,13,14,15,16
100 pages
Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
10 pages
Data Mining
No ratings yet
Data Mining
20 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
Questions For Chapter 2
No ratings yet
Questions For Chapter 2
6 pages
M818A: Machine Learning and Cyber Security-A
No ratings yet
M818A: Machine Learning and Cyber Security-A
11 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
139 pages
Deepak Raj BBA-4th Data Base
No ratings yet
Deepak Raj BBA-4th Data Base
48 pages
Data Science Engineering Full Time Program Brochure
No ratings yet
Data Science Engineering Full Time Program Brochure
21 pages
Data Mining Assignment No 2
No ratings yet
Data Mining Assignment No 2
4 pages
Finding Optimal Neural Network Architecture Using Genetic Algorithms
No ratings yet
Finding Optimal Neural Network Architecture Using Genetic Algorithms
10 pages
CH 6
No ratings yet
CH 6
72 pages
A Weighted Majority Voting Ensemble Approach For Classification
No ratings yet
A Weighted Majority Voting Ensemble Approach For Classification
6 pages
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
No ratings yet
Review Article: Data Mining For The Internet of Things: Literature Review and Challenges
14 pages
Unit of Analysis
No ratings yet
Unit of Analysis
56 pages
Data Mining for Analysts
100% (1)
Data Mining for Analysts
29 pages
Analysis of Customer Churn Prediction in Telecom Industry Using Decision Trees and Logistic Regression
No ratings yet
Analysis of Customer Churn Prediction in Telecom Industry Using Decision Trees and Logistic Regression
4 pages
Applications of Data Mining in The Banking Sector
No ratings yet
Applications of Data Mining in The Banking Sector
8 pages
Efficient Sequential Pattern Mining
No ratings yet
Efficient Sequential Pattern Mining
7 pages
Analysis of Mood Based On Song Data Using Clustering and Supervised Learning Techniques
No ratings yet
Analysis of Mood Based On Song Data Using Clustering and Supervised Learning Techniques
3 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
Data Mining Approach For Cyber Security
No ratings yet
Data Mining Approach For Cyber Security
7 pages
Big Data Course for MBA Students
No ratings yet
Big Data Course for MBA Students
27 pages
Assignment 1&2
No ratings yet
Assignment 1&2
4 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Data Mining For Business Analytics: Concepts, Techniques and Applications in Python Ebook PDF Download
0% (1)
Data Mining For Business Analytics: Concepts, Techniques and Applications in Python Ebook PDF Download
88 pages
Ad404 Data Science Notes Unit-2
No ratings yet
Ad404 Data Science Notes Unit-2
21 pages
Week 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
No ratings yet
Week 2 Descriptive Analytics I Nature of Data, Statistical Modeling, and Visualization
45 pages
Unit 3
No ratings yet
Unit 3
3 pages

Data Mining at UVA: New Horizons in Teaching and Learning Conference

Uploaded by

Data Mining at UVA: New Horizons in Teaching and Learning Conference

Uploaded by

Data Mining at UVA

New Horizons in Teaching and Learning

• Computers have become cheaper

– Look up phone – Certain names are more

– Heterogeneous, Data Mining

• Classifying credit card transactions

• Classifying secondary structures of protein

• Categorizing news stories as finance,

• Decision Tree based Methods

7 Yes Large 220K No Learn

10 No Small 90K Yes

13 Yes Large 110K ? Deduction

SAS Enterprise Miner

• R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering

• SAS Enterprise Miner tutorial

• Frank Eibe, Machine Learning with Weka

• Tan, Steinbach, Kumar “Introduction to Data Mining”

• Not demonstrated but also see Insightful Miner and

You might also like