Unsupervised ML SOM
and how to choose a
Data Science method
Quick Review of the methods we learned
> Statistical analysis
> Supervised ML
– Linear regression
– NN,
– KNN,
– Decision Tree
– SVM
> Unsupervised ML
– K-Means Clustering
– PCA
– ….. Why another one,
Why another method, SOM
> Demonstrate some data science methods that are not
widely used or well known, but also can be very useful
for material informatic study
> Introduce a method I have used, and feel is adequate
to the uniqueness of many materials study
applications.
> Demonstrate how various data science methods can
be used together to drive improved results
> Demonstrate a few projects using the same methods
so that we can understand a methods from user point
of view
What is Self-Organizing Map (SOM)
> An Unsupervised ML method
> Dimensional reduction, enabling powerful
visualizations of the data:
– K-Means does clustering, but neither dimensionality
reduction nor visualization
– PCA does dimensionality reduction, enabling visualization to
certain level (not applicable if the first 3 principal
components won’t represent the data well), however, it
does not perform clustering. Besides, the visualization does
not keep the original topographic information.
> Give some insights into how data is clustered in high
dimensions
What is SOM
> You can think of SOM as an artificial neural network
with a single neuronal layer, whose neurons are
arranged in a two-dimensional matrix.
– The 2D matrix can been seen as a position map that
captures the characteristics of the data
> Merits of SOM
– Effective in training big datasets
– Since this is a 2D matrix, visualization of the resulting map
is possible
– kept the topography of the original data,
– Possible to present the Euclidean distance between data
points
Algorithm of SOM
– Normalization of the input data, all features will be distributed more
balancely
– Initialization: each (x,y) position in the map is assigned a weight for each
input neuron, thus associating a weight vector for each map position.
– Iteration:
> Choose a sample from dataset
> Calculate Euclidean distance between that sample and each weight vector
> The (x,y) position ”closest” to the sample is declared the Best Matching Unit
> The weights vector for the BMU get adjusted to more closely match the sample.
Amount of adjustment (learning) decreases as we go through iterations
> The weights vector for neighbors of the BMU also get adjusted, to a lesser extent.
The number of neighbors and how much they get adjusted also depends on
hyperparameters and the number of iterations.
– Convergence:
> Max number of iterations
> Monitoring of topological error
– Reference: https://link.springer.com/article/10.1007/BF00337288
Self-Organizing Map (SOM)
How does it work?
𝑎!
𝑏!
𝑐!
𝑥! = 𝑑
!
𝑒!
𝑓!
Two Dimensional Mesh structure
Each connection can deform
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
a
1 11 12
b 6 10
f
2
4 8
3
9
5 7
c
e
d
Self-Organizing Map (SOM) Algorithm
> Dragging Nodes
> “Flattening a crumpled paper”
U-matrix and how to use it to get insights for
clustering
> After training, the nodes in
the 2D key map are not
evenly distributed. The
adjacent data point might
not be similar to each
other in the higher
dimension space.
> U-matrix use the concept
of the heatmap to
illustrate the distance in
Euclidean space
Using SOM in conjunction with other methods
> Since this is a dimensionality
reduction method, for smaller
dataset, you can initialize your
SOM map using the first 2
Principal components,
essentially the 2D PCA map
> K-means can also be run on the
same dataset, and
corresponding clusters can be
visualized on SOM map.
K-Means clustering and U-Matrix
They can be compared to validate the results!
> SOM can provide a means to visualize K-Means!
> If the boundary matches well, then the training is
successful
Different Implementations of SOM
> SOM is just an algorithm, there are many
packages you can use that implement it
> We will introduce
– An augmented version of SOMPY, a version our group has
contributions on
– MiniSOM
The uniqueness and functions of augmented
SOMPY
https://github.com/DataScienceUWMSE/SOM
> Utilizes PCA for initialization, and include K-Means
Clustering overlay
> “Heat maps” provide a way to visualize each
feature after training
> Projection function helps users find additional
correlations or patterns among features,
including for categorical data
“heatmap” concept
> Map each node’s
weight onto the 2D
map
> Number of heat maps
equals to number of
input variables
Example of utilizing the
heatmap on materials research
Example 1 Granta Data Set: Experimental Commercial Materials
Property Dataset
> Training data set
contains 398 commercial
materials and 21
numerical properties
Example of utilizing the heatmap on materials
research
Example 1 Granta Data Set: Experimental Commercial Materials
Property Dataset (continue)
Project information concept
> Overlay one specific data
property onto SOM, can
use even categorical
values
> Easily identify patterns
Example of utilizing the project function on
materials research
Example 1 Granta Data Set: Experimental Commercial Materials
Property Dataset (continue), finding the outliers’ uniqueness
Example of utilizing the projection function on
materials research
Example 2 OPV materials study using an experimental dataset
Reference Y.Huang, J. Phys. Chem. C 2020, 124, 12871−12882
> Dataset includes 1203 donor
polymers of Donor-Acceptor
pairs, with properties
related to the proficiency of
the charge transfer.
Molecular Descriptors
Python package of Molecular Descriptor
> There are Python tools to extract molecular
structural or geometrical information from
notation of molecule, such as SMILES (Simplified
molecular-input line-entry system)
> We will introduce Mordred, (covered in the Hands-
on session)
The advantage of using MiniSOM
> SOMPY is not as easy to use as the other packages
introduced in this class.
– The Augmented SOMPY has contribution from a few
Materials Science researchers in our group, including
your TA Jimin, Qian
> MiniSOM is relatively easier to use, well
documented and constantly maintained, and
have the basic implementation of the SOM
algorithm
What MiniSOM provides
> It has :
– The core implementation of SOM
– Visualization
– U-Matrix (“distance map” in MiniSOM)
– Project certain feature onto SOM
> Doesn’t have:
– PCA initialization
– Cannot generate heatmap for each features
– K-Means clustering,
Hyperparameters of SOM
> Length of input vectors (the number of properties)
> Map size, the most important one
> Map topology – rectangular or hexagonal
– Important in defining the notion of “neighbors”
> Sigma – spread of the neighborhood function
> Learning Rate – initial learning rate, decreases with the
number of iterations
> Decay function – defines how much learning rate and sigma
decrease with the number of iterations
> Neighborhood function – defines how much neighbors of
the BMU get impacted at each iteration (eg gaussian,
bubble,…)
> Activation distance function (eg Euclidean distance)
> Initialization method – random or PCA
Hands-on session and HW for this week