0% found this document useful (0 votes)

62 views29 pages

Informa) CS: Lecture 6 - Processing Informa4on

This document provides an overview of processing data as part of a lecture on data mining. It discusses common issues with data like errors, outliers, and calibration needs. For analog data like audio and video, preprocessing techniques are described like stretching, equalizing, and various filtering methods. The goals of preprocessing are to clean the data and prepare it for more advanced data mining techniques later in the analysis process.

Uploaded by

Colin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views29 pages

Informa) CS: Lecture 6 - Processing Informa4on

Uploaded by

Colin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Informa)cs

Lecture 6 Processing Informa4on

Introduc)on
We have no shortage of data about almost
anything of interest
A well designed database can make that data
easy to access
The use of SQL can do simple interroga)ons of
the data
A huge amount of useful informa4on lies
hidden however the need for data mining

Introduc)on
So in this lecture we will look at the elements
of data mining
We will begin however by looking at simple
ways in which our original data may be
processed so that the more complex stages
later on are not compromised

Processing data
Regardless of the source of the data we can
encounter a number of issues:
Errors some data is wrong due to a fault or a
simple transcrip)on error.
Outliers some data is very dierent to the
rest can be signicant if true
Calibra)on the data may need to be
converted to a physical quan)ty to check

Processing data
Test ar)fact it is some)mes possible to
include an object in the data collec)on whose
proper)es are well known we can then
check what has been recorded

Processing data
With data that begins as analogue, especially
audio and video, there are a number of
processing methods that can be used to prepare
the data for later stages:
Stretch if the data can range from 0-100 but
we only record 0-20 we can stretch the data
to use the whole range
Equalise we can modify a range of 20-60 to
use 0-100

Processing data
Filtering
Lo pass lter hiss and noise
Hi pass lter rumble and hum
Band pass selec)ve ltering

Averaging to smooth noisy data and prevent

data spikes
Enhancements a huge range in images for
deblur, distor)on and feature extrac)on

Examples

What is data mining?

The non-trivial extrac)on of implicit,
previously unknown and poten)ally useful
knowledge from data
KDD a process of Knowledge Discovery in
Databases
Associated areas are Sta)s)cs, SQL, Machine
Learning, AI and Expert Systems

Knowledge is power
Remember the hierarchy that we aspire to work
through:

Data facts and gures accuracy important
Informa)on organised data for analysis
Knowledge interpreta)on to inform ac)on

Applica)on areas

Insurance claim analysis and risk

Medical diagnosis and preventa)ve medicine
Banking iden)fying fraud
Marke)ng new customers and sales
Science human genome project
Security iden)fy behaviours
Business intelligence trends and threats

Scope of data mining

Data mining can try to use data in a variety of
ways using sophis)cated mathema)cal
techniques:
Classica)on
Es)ma)on
Clustering
Associa)on

Classica)on
Use data to predict the category of an object
e.g. someone to lend money to or perhaps
arrest or perhaps someone who will make a
certain kind of purchase etc.
The result of a classica)on problem can be a
decision tree which shows how a new object
can be classied on the basis of the exis)ng
data

Classica)on
Data
age

cartype

risk

saloon

low

sports

low

saloon

low

hatchback

high

saloon

low

hatchback

high

hatchback

low

sports

high

saloon

low

Age
<= 25

> 25

Car Type
Saloon

Low risk

sports,
hatchback
high risk

Es)ma)on
Similar to classica)on in that a model is
created
The model allows the output of a con)nuous
variable to be predicted
The model could be a mathema)cal func)on
to predict a value or could be a theorem
which then also predicts a value or perhaps
even a behaviour.

Clustering
Can we analyse the data for a set of objects
and iden)fy sub-groups and their membership
We may know the sub-groups and some
exis)ng members and want to know what
data helps iden)fy which cluster a new object
will belong to.

Clustering

Reproduced from Adriaans and Zantinge

Clustering K means example

The general idea of a clustering techniques is to divide

the population into partitions
Starts with an initial random selection of K partitions
Then points are moved into each partition using a
centroid calculation and a similarity measure in an
iterative process until the final set of clusters stabilises
The final set is then evaluated

Associa)on
Seeking co-occurrences of groups of data
items in a data set
Associa)on can be in )me i.e. a sequen)al
pa[ern
Can be very popular with retailers to target
adver)sing for related purchases and for store
layouts

Associa)on rules
Rules are of the form X => Y
where X and Y are distinct sets of items

Importance of a rule described by its

support and its confidence
Support : % of transactions containing X
and Y
Confidence: % of transactions with X that
also contain Y

Associa)on rules
All transactions
Transactions
with X
Transactions
with X and Y
Transactions
with Y

Support of X=>Y = Support of Y=>X =

3/10 = 30%
Confidence of X=>Y = = 75%
Confidence of Y=>X = 3/5 = 60%

Associa)on rules example

Transaction
1
2
3
4
5
Rule
Milk => Eggs
Eggs => Tea
sugar => {butter, milk}

Items bought
milk, eggs, tea
butter, milk, sugar, tea
biscuits, sugar, eggs
tea, coffee, eggs
coffee, chocolate, sugar
Support, Confidence
20%, 50%
40%, 66.7%
20%, 33.3%

Associa)on - issues
number of rules grows exponentially with number
of items
User to specify
Minimum Support (e.g. 10%) and
Minimum Confidence (e.g. 70%) levels
Which rules are interesting - define interesting
Negative rules can also be interesting
70% buying crisps => do not buy cream
absence implies millions of useless rules!

Hierarchies
Items are grouped
e.g. pen, pencil are writing tools
Can have different rules for groups than for
individual items
e.g., strong positive association between
crisps and biscuits, but negative
associations lower in hierarchy
use to define interesting
e.g. rules across groups can be more
interesting than rules within groups

Hierarchies
+ve
Crisps

Biscuits

C
-ve

+ve

X
-ve

Process
Cleansing, quality

Input data
from repository

Data
Pre-processing

Mining patterns

Data
Post-processing

Redrawn from Du, p14

Output patterns

Pre-processing
We need to understand the
data that we are using type
and quality
This will inform the mining
technique to be used
Data visualisa)on can also
inform the mining process

Target

Precise, inaccurate, biased

Precise, accurate, unbiased
imprecise, inaccurate, biased
imprecise, accurate, unbiased

DM vs. Query Tools

If you know what you want, use SQL (the database
query language)
SQL finds data under known constraints
SQL cannot readily find hidden knowledge
DM finds hidden nuggets
DM can find interesting patterns, irregularities and
optimal clusters
DM can use repeated SQL queries
DM gives more possibilities
DM requires a good foundation in the data

Reading

Hongbo Du (generally online resource)

Adriaans and Zantinge (a small book)
Witten & Frank (the WEKA software)
Christopher Westphal: Data mining for
intelligence, fraud, & criminal detection :
advanced analytics & information sharing
technologies
Marcus Maloof (e-book on Dawsonera)
Machine Learning and Data Mining for
Computer Security

MR22-DM 1
No ratings yet
MR22-DM 1
21 pages
Artificial Intelligence
100% (1)
Artificial Intelligence
76 pages
Chapter 2 DM
No ratings yet
Chapter 2 DM
91 pages
Week-1-Introduction To Data Mining
No ratings yet
Week-1-Introduction To Data Mining
43 pages
Data Mining Merged PDF CS1 CS8
No ratings yet
Data Mining Merged PDF CS1 CS8
272 pages
Profile Publications PanzerKampfWagen III PDF
100% (2)
Profile Publications PanzerKampfWagen III PDF
22 pages
Unit III
No ratings yet
Unit III
101 pages
Data Mining, Data Pattern, Machine Learning (Week 2
No ratings yet
Data Mining, Data Pattern, Machine Learning (Week 2
19 pages
Combinepdf 1
No ratings yet
Combinepdf 1
74 pages
Introduction To Data Mining For Business Analytics
No ratings yet
Introduction To Data Mining For Business Analytics
51 pages
DWDM Unit-II Notes
No ratings yet
DWDM Unit-II Notes
29 pages
Module1 1 Introduction
No ratings yet
Module1 1 Introduction
27 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
R18CSE4102-UNIT 2 Data Mining Notes
100% (1)
R18CSE4102-UNIT 2 Data Mining Notes
31 pages
Module 4
No ratings yet
Module 4
54 pages
Study Material I
No ratings yet
Study Material I
140 pages
Data Mining for Students
No ratings yet
Data Mining for Students
122 pages
DM-Unit-I Introduction To Association-1
No ratings yet
DM-Unit-I Introduction To Association-1
97 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
24 pages
Introduction
No ratings yet
Introduction
26 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Unit 1 DM Apx
No ratings yet
Unit 1 DM Apx
5 pages
BI Unit 3 Part 1
No ratings yet
BI Unit 3 Part 1
51 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
46 pages
Data Mining Essentials for Analysts
No ratings yet
Data Mining Essentials for Analysts
73 pages
Data Mining
No ratings yet
Data Mining
88 pages
DM Module1 Notes
No ratings yet
DM Module1 Notes
25 pages
Data Mining & Machine Learning Guide
No ratings yet
Data Mining & Machine Learning Guide
19 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Datamining 1
No ratings yet
Datamining 1
30 pages
Unit 1
No ratings yet
Unit 1
21 pages
8 Chapter Eight
No ratings yet
8 Chapter Eight
20 pages
Data Mining
No ratings yet
Data Mining
25 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
CSE2021 - MODULE 1ppt
No ratings yet
CSE2021 - MODULE 1ppt
62 pages
Data Mining 1
No ratings yet
Data Mining 1
56 pages
DMW Notes UNIT-1 2023-24
No ratings yet
DMW Notes UNIT-1 2023-24
15 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Data Mining Essentials Guide
No ratings yet
Data Mining Essentials Guide
23 pages
Introduction To Data Mining-Week1
No ratings yet
Introduction To Data Mining-Week1
43 pages
Dabacon Error Codes in PDMS
100% (1)
Dabacon Error Codes in PDMS
7 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Warship Profile 01 HMS Dreadnought
100% (4)
Warship Profile 01 HMS Dreadnought
27 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Module1 DataMining Ktustudents - in
No ratings yet
Module1 DataMining Ktustudents - in
24 pages
1 ST Review Document
No ratings yet
1 ST Review Document
37 pages
MongoDB Basics for Beginners
No ratings yet
MongoDB Basics for Beginners
2 pages
Data Mining - An Overview
No ratings yet
Data Mining - An Overview
40 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
46 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
31 pages
AFV Weapons Profile 01 Churchill - British Infantry Tank Mk. IV
100% (2)
AFV Weapons Profile 01 Churchill - British Infantry Tank Mk. IV
22 pages
DWM
No ratings yet
DWM
66 pages
Data Mining Course Overview
No ratings yet
Data Mining Course Overview
38 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Datamining Fifth Lecture
No ratings yet
Datamining Fifth Lecture
65 pages
Collections
No ratings yet
Collections
26 pages
PostgreSQL Commands for Windows
No ratings yet
PostgreSQL Commands for Windows
2 pages
L1 The Process
No ratings yet
L1 The Process
24 pages
JDA DP Leadership Exchange Tips To Optimize Jdas DP Modules
No ratings yet
JDA DP Leadership Exchange Tips To Optimize Jdas DP Modules
32 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
Lec 1 Data Mining Introduction For Exam
No ratings yet
Lec 1 Data Mining Introduction For Exam
48 pages
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
0% (1)
Data Warehousing and Data Mining Dr.P.rizwan Ahmed
20 pages
Data Mining Tutorials
No ratings yet
Data Mining Tutorials
52 pages
ADO.NET Data Objects Overview
No ratings yet
ADO.NET Data Objects Overview
29 pages
Data Storage Mechanism
No ratings yet
Data Storage Mechanism
19 pages
Informatics: Information Sources
No ratings yet
Informatics: Information Sources
35 pages
RDBMS Unit-3
No ratings yet
RDBMS Unit-3
16 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
ALV Report Header Footer Function
No ratings yet
ALV Report Header Footer Function
2 pages
Bca Iii (C) 1
No ratings yet
Bca Iii (C) 1
1 page
1674176984
No ratings yet
1674176984
3 pages
Lecture 4 - Design Principles Layout
No ratings yet
Lecture 4 - Design Principles Layout
35 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
#6 Adding File Upload To A Form
No ratings yet
#6 Adding File Upload To A Form
10 pages
gc ٢٠٢٤ ١١ ٢٥
No ratings yet
gc ٢٠٢٤ ١١ ٢٥
32 pages
Mining Public Datasets
100% (1)
Mining Public Datasets
45 pages
Chapter 05 Retrieve Data From Two or More Tables
No ratings yet
Chapter 05 Retrieve Data From Two or More Tables
22 pages
Linked Lists
No ratings yet
Linked Lists
1 page
IM211
No ratings yet
IM211
4 pages
Wa0000.
No ratings yet
Wa0000.
28 pages
Lecture 10 - Evaluation
No ratings yet
Lecture 10 - Evaluation
27 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
4 pages
Informa) CS: Lecture 10 - Intelligence Analysis
No ratings yet
Informa) CS: Lecture 10 - Intelligence Analysis
33 pages
Inferential Stats: Two-Group Design
No ratings yet
Inferential Stats: Two-Group Design
36 pages
Questionbank
No ratings yet
Questionbank
4 pages
28 2 Always On Clusters Notes
No ratings yet
28 2 Always On Clusters Notes
4 pages
JDBC Learning for Students
No ratings yet
JDBC Learning for Students
9 pages
Data Warehousing & Analytics Expert
No ratings yet
Data Warehousing & Analytics Expert
8 pages
Pharmacy Lecture (Data Processing)
No ratings yet
Pharmacy Lecture (Data Processing)
11 pages
Book of Bomaby by James Douglas (Account of Raigad Fort)
No ratings yet
Book of Bomaby by James Douglas (Account of Raigad Fort)
586 pages
HW 4
No ratings yet
HW 4
11 pages
EnCase Prep Questions
No ratings yet
EnCase Prep Questions
29 pages
Informatics: Transmission of Information
No ratings yet
Informatics: Transmission of Information
38 pages
Research Methods: Data Organisation and Descriptive Statistics
No ratings yet
Research Methods: Data Organisation and Descriptive Statistics
26 pages
Lecture 9 - Prototyping
No ratings yet
Lecture 9 - Prototyping
17 pages
Lecture 1 - Thinking Like A Scientist
No ratings yet
Lecture 1 - Thinking Like A Scientist
23 pages
Informatics: Trust and Validity of Information
No ratings yet
Informatics: Trust and Validity of Information
25 pages
JPA Guide for Java Developers
No ratings yet
JPA Guide for Java Developers
60 pages
Research Methods: The Logic of Experimental Design
No ratings yet
Research Methods: The Logic of Experimental Design
27 pages
Lecture 2 - Scoping The Design
No ratings yet
Lecture 2 - Scoping The Design
21 pages
TI59 Personal Programming
No ratings yet
TI59 Personal Programming
256 pages
Research Methods: Introduction To Inferential Statistics
No ratings yet
Research Methods: Introduction To Inferential Statistics
35 pages
Informatics: Lecture 11 - Legal & Ethical Issues
No ratings yet
Informatics: Lecture 11 - Legal & Ethical Issues
30 pages
Informatics: Lecture 8 - Gadgets and Devices The Hardware
No ratings yet
Informatics: Lecture 8 - Gadgets and Devices The Hardware
31 pages
Informa (CS: Lecture 4 - Informa0on Storage
No ratings yet
Informa (CS: Lecture 4 - Informa0on Storage
31 pages
Cloning Database Training
No ratings yet
Cloning Database Training
16 pages
Informatics: Lecture 9 - The Human Aspect
No ratings yet
Informatics: Lecture 9 - The Human Aspect
32 pages
Informatics: Lecture 7 - Visualising Information
No ratings yet
Informatics: Lecture 7 - Visualising Information
33 pages
Lecture 2 - Ethics and Legal Issues
No ratings yet
Lecture 2 - Ethics and Legal Issues
38 pages
New Lecture 2 - Using TSO and ISPF
No ratings yet
New Lecture 2 - Using TSO and ISPF
44 pages

Informa) CS: Lecture 6 - Processing Informa4on

Uploaded by

Informa) CS: Lecture 6 - Processing Informa4on

Uploaded by

Informa)cs

Lecture 6 Processing Informa4on

Averaging to smooth noisy data and prevent

What is data mining?

Insurance claim analysis and risk

Scope of data mining

Reproduced from Adriaans and Zantinge

Clustering K means example

The general idea of a clustering techniques is to divide

Importance of a rule described by its

Support of X=>Y = Support of Y=>X =

Associa)on rules example

Redrawn from Du, p14

Precise, inaccurate, biased

DM vs. Query Tools

Hongbo Du (generally online resource)

You might also like