Informa)cs
Lecture
6
Processing
Informa4on
Introduc)on
We
have
no
shortage
of
data
about
almost
anything
of
interest
A
well
designed
database
can
make
that
data
easy
to
access
The
use
of
SQL
can
do
simple
interroga)ons
of
the
data
A
huge
amount
of
useful
informa4on
lies
hidden
however
the
need
for
data
mining
Introduc)on
So
in
this
lecture
we
will
look
at
the
elements
of
data
mining
We
will
begin
however
by
looking
at
simple
ways
in
which
our
original
data
may
be
processed
so
that
the
more
complex
stages
later
on
are
not
compromised
Processing
data
Regardless
of
the
source
of
the
data
we
can
encounter
a
number
of
issues:
Errors
some
data
is
wrong
due
to
a
fault
or
a
simple
transcrip)on
error.
Outliers
some
data
is
very
dierent
to
the
rest
can
be
signicant
if
true
Calibra)on
the
data
may
need
to
be
converted
to
a
physical
quan)ty
to
check
Processing
data
Test
ar)fact
it
is
some)mes
possible
to
include
an
object
in
the
data
collec)on
whose
proper)es
are
well
known
we
can
then
check
what
has
been
recorded
Processing
data
With
data
that
begins
as
analogue,
especially
audio
and
video,
there
are
a
number
of
processing
methods
that
can
be
used
to
prepare
the
data
for
later
stages:
Stretch
if
the
data
can
range
from
0-100
but
we
only
record
0-20
we
can
stretch
the
data
to
use
the
whole
range
Equalise
we
can
modify
a
range
of
20-60
to
use
0-100
Processing
data
Filtering
Lo
pass
lter
hiss
and
noise
Hi
pass
lter
rumble
and
hum
Band
pass
selec)ve
ltering
Averaging
to
smooth
noisy
data
and
prevent
data
spikes
Enhancements
a
huge
range
in
images
for
deblur,
distor)on
and
feature
extrac)on
Examples
What
is
data
mining?
The
non-trivial
extrac)on
of
implicit,
previously
unknown
and
poten)ally
useful
knowledge
from
data
KDD
a
process
of
Knowledge
Discovery
in
Databases
Associated
areas
are
Sta)s)cs,
SQL,
Machine
Learning,
AI
and
Expert
Systems
Knowledge
is
power
Remember
the
hierarchy
that
we
aspire
to
work
through:
Data
facts
and
gures
accuracy
important
Informa)on
organised
data
for
analysis
Knowledge
interpreta)on
to
inform
ac)on
Applica)on
areas
Insurance
claim
analysis
and
risk
Medical
diagnosis
and
preventa)ve
medicine
Banking
iden)fying
fraud
Marke)ng
new
customers
and
sales
Science
human
genome
project
Security
iden)fy
behaviours
Business
intelligence
trends
and
threats
Scope
of
data
mining
Data
mining
can
try
to
use
data
in
a
variety
of
ways
using
sophis)cated
mathema)cal
techniques:
Classica)on
Es)ma)on
Clustering
Associa)on
Classica)on
Use
data
to
predict
the
category
of
an
object
e.g.
someone
to
lend
money
to
or
perhaps
arrest
or
perhaps
someone
who
will
make
a
certain
kind
of
purchase
etc.
The
result
of
a
classica)on
problem
can
be
a
decision
tree
which
shows
how
a
new
object
can
be
classied
on
the
basis
of
the
exis)ng
data
Classica)on
Data
age
cartype
risk
23
saloon
low
30
sports
low
36
saloon
low
25
hatchback
high
30
saloon
low
23
hatchback
high
30
hatchback
low
25
sports
high
18
saloon
low
Age
<= 25
> 25
Car Type
Saloon
Low risk
Low risk
sports,
hatchback
high risk
Es)ma)on
Similar
to
classica)on
in
that
a
model
is
created
The
model
allows
the
output
of
a
con)nuous
variable
to
be
predicted
The
model
could
be
a
mathema)cal
func)on
to
predict
a
value
or
could
be
a
theorem
which
then
also
predicts
a
value
or
perhaps
even
a
behaviour.
Clustering
Can
we
analyse
the
data
for
a
set
of
objects
and
iden)fy
sub-groups
and
their
membership
We
may
know
the
sub-groups
and
some
exis)ng
members
and
want
to
know
what
data
helps
iden)fy
which
cluster
a
new
object
will
belong
to.
Clustering
Reproduced from Adriaans and Zantinge
Clustering
K
means
example
The general idea of a clustering techniques is to divide
the population into partitions
Starts with an initial random selection of K partitions
Then points are moved into each partition using a
centroid calculation and a similarity measure in an
iterative process until the final set of clusters stabilises
The final set is then evaluated
Associa)on
Seeking
co-occurrences
of
groups
of
data
items
in
a
data
set
Associa)on
can
be
in
)me
i.e.
a
sequen)al
pa[ern
Can
be
very
popular
with
retailers
to
target
adver)sing
for
related
purchases
and
for
store
layouts
Associa)on
rules
Rules are of the form X => Y
where X and Y are distinct sets of items
Importance of a rule described by its
support and its confidence
Support : % of transactions containing X
and Y
Confidence: % of transactions with X that
also contain Y
Associa)on
rules
All transactions
Transactions
with X
Transactions
with X and Y
Transactions
with Y
Support of X=>Y = Support of Y=>X =
3/10 = 30%
Confidence of X=>Y = = 75%
Confidence of Y=>X = 3/5 = 60%
Associa)on
rules
example
Transaction
1
2
3
4
5
Rule
Milk => Eggs
Eggs => Tea
sugar => {butter, milk}
Items bought
milk, eggs, tea
butter, milk, sugar, tea
biscuits, sugar, eggs
tea, coffee, eggs
coffee, chocolate, sugar
Support, Confidence
20%, 50%
40%, 66.7%
20%, 33.3%
Associa)on
-
issues
number of rules grows exponentially with number
of items
User to specify
Minimum Support (e.g. 10%) and
Minimum Confidence (e.g. 70%) levels
Which rules are interesting - define interesting
Negative rules can also be interesting
70% buying crisps => do not buy cream
absence implies millions of useless rules!
Hierarchies
Items are grouped
e.g. pen, pencil are writing tools
Can have different rules for groups than for
individual items
e.g., strong positive association between
crisps and biscuits, but negative
associations lower in hierarchy
use to define interesting
e.g. rules across groups can be more
interesting than rules within groups
Hierarchies
+ve
Crisps
Biscuits
C
-ve
+ve
X
-ve
Process
Cleansing, quality
Input data
from repository
Data
Pre-processing
Mining patterns
Data
Post-processing
Redrawn from Du, p14
Output patterns
Pre-processing
We
need
to
understand
the
data
that
we
are
using
type
and
quality
This
will
inform
the
mining
technique
to
be
used
Data
visualisa)on
can
also
inform
the
mining
process
Target
Precise, inaccurate, biased
Precise, accurate, unbiased
imprecise, inaccurate, biased
imprecise, accurate, unbiased
DM
vs.
Query
Tools
If you know what you want, use SQL (the database
query language)
SQL finds data under known constraints
SQL cannot readily find hidden knowledge
DM finds hidden nuggets
DM can find interesting patterns, irregularities and
optimal clusters
DM can use repeated SQL queries
DM gives more possibilities
DM requires a good foundation in the data
Reading
Hongbo Du (generally online resource)
Adriaans and Zantinge (a small book)
Witten & Frank (the WEKA software)
Christopher Westphal: Data mining for
intelligence, fraud, & criminal detection :
advanced analytics & information sharing
technologies
Marcus Maloof (e-book on Dawsonera)
Machine Learning and Data Mining for
Computer Security