Artificial Intelligence
Lab 8
Machine Learning Algorithms
ID3
DBscan
1
Agenda
Decision tree.
• ID3
Clustering
• DBSCAN Algorithm.
2
Decision Trees
• The idea is to partition input space into a
disjoint set of regions and to use a very
simple predictor for each region.
• For classification simply predict the most
frequent class in the region
3
Play tennis training data
• Hard to guess.
• Divide & Conquer:
• split into subsets
• are they are
pure?
(all yes or all no)
• if yes: stop.
• If no: repeat.
• See which subset
new data falls into
New Data
D15 Rain High weak ? 4
Decision Tree Representation
• Each internal node tests an attribute.
• Each branch corresponds to attribute
value.
• Each leaf node make a prediction.
5
Outlook
Sunny Overcast Rain
6
Outlook
Sunny Overcast Rain
Humidity Wind
High Normal
Weak Strong
7
9/5
Outlook
2/3 4/0 3/2
Sunny Overcast Rain
Yes
Humidity Wind
0/3 2/0
3/0
High Normal
0/2
Weak Strong
NO Yes
Yes NO
8
Which attribute to split on
9
Entropy
10
9 9 5 5
• H(Outlook) = − log 2 − log 2
14 14 14 14
2 2 3 3
• H(Sunny) = − log 2 − log 2
5 5 5 5
4 4 0 0
• H(Overcast) = − log 2 − log 2
4 4 4 4
3 3 2 2
• H(Rain) = − log 2 − log 2
5 5 5 5
11
Information Gain
Want many items in pure sets.
Expected drop in entropy after split:
Wind Example
H(S strong)
12
9 9 5 5
• H(Outlook) = − log 2 − log 2
14 14 14 14
𝑆𝑣
• Gain(Outlook) = H(Outlook) − σ𝑣 ∈𝑂𝑢𝑡𝑙𝑜𝑜𝑘 𝐻(𝑆𝑣)
𝑆
5
• Gain(Outlook) = H(Outlook) – ( H(Sunny)
14
4 5
+ H(Overcast) + H(Rain))
14 14
13
Similarly,
Note: Highest gain is always selected.
Gain( Humidity)=0.151
Choose the highest
Gain(Outlook)=0.246 to split on
Gain(Wind)=0.048
14
ID3 Algorithm
15
16
tearRate
IG = 0.548
Normal (0) Reduced (1)
Output: No
contact lenses (0)
What is a Clustering?
In general a grouping of objects such that the objects in a
group (cluster) are similar (or related) to one another and
different from (or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
DBSCAN: Density-Based
Clustering
DBSCAN is a Density-Based Clustering algorithm
Reminder: In density based clustering we partition points into
dense regions separated by not-so-dense regions.
Important Questions:
• How do we measure density?
• What is a dense region?
DBSCAN:
• Density at point p: number of points within a circle of radius Eps
• Dense Region: A circle of radius Eps that contains at least
MinPts points
Dbscan model
parameters
Eps : defines the radius of neighborhood around a
point x. It’s called the epsilon-neighborhood of x.
The parameter MinPts is the minimum number of
neighbors within “eps” radius.
Eps
MinPts =4 20
DBSCAN
Characterization of points
Density=number of points within a specified
radius r (Eps)
• A point is a core point if it has more than a specified
number of points (MinPts) within Eps
• These points belong in a dense region and are at the
interior of a cluster
• A border point has fewer than MinPts within Eps, but
is in the neighborhood of a core point.
• A noise point is any point that is not a core point or a
border point.
DBSCAN: Core, Border, and Noise
Points
DBSCAN: Core, Border and Noise
Points
Point types: core,
Original Points
border and noise
Eps = 10, MinPts = 4
Density-Connected points
Density edge
• We place an edge between p
two core points q and p if they q
p1
are within distance Eps.
Density-connected
• A point p is density-connected to a
point q if there is a path of edges p q
from p to q
o
DBSCAN Algorithm
Label points as core, border and noise
Eliminate noise points
For every core point p that has not been
assigned to a cluster
• Create a new cluster with the point p and all
the points that are density-connected to p.
Assign border points to the cluster of the
closest core point.
26
When DBSCAN Works Well
Original Points
Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
Advantages &
Disadvantages of DBSCAN
Advantages:
• Unlike K-means, DBSCAN not required to
specify number of clusters to be generated.
• Find any shape of clusters
• Can identify the outliers
Disadvantages:
• Does not work well with high dimensional
datasets
• Parameters selections are tricky
28
Hands on
Open Dbscan algorithm template and
complete the DBSCAN & Expand functions
29
Questions?
30