Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
4 views2 pages

Splex Tme2

The document outlines a training module focused on clustering methods using the scikit-learn Python library. It includes instructions for analyzing simulated data sets and applying various clustering techniques such as K-means, hierarchical clustering, and spectral clustering. Additionally, it emphasizes evaluating clustering results using metrics like homogeneity, completeness, and silhouette scores, and suggests applying these methods to real-world data sets on breast cancer and mice protein expression.

Uploaded by

ahmedprof843
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views2 pages

Splex Tme2

The document outlines a training module focused on clustering methods using the scikit-learn Python library. It includes instructions for analyzing simulated data sets and applying various clustering techniques such as K-means, hierarchical clustering, and spectral clustering. Additionally, it emphasizes evaluating clustering results using metrics like homogeneity, completeness, and silhouette scores, and suggests applying these methods to real-world data sets on breast cancer and mice protein expression.

Uploaded by

ahmedprof843
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

SPLEX TME 2

Clustering

The goal of the TME is to learn how to use some popular clustering methods (unsupervised
learning), and how to interpret the results.
We will use the scikit-learn Python library http://scikit-learn.org which is already installed
on the computers.

Data (simulated data sets + data sets of TME 1)


We explore two data sets downloadable from the Machine Learning Repository (http://archive.
ics.uci.edu/ml/index.php)

• Breast Cancer Wisconsin (Diagnostic) Data Set (https://archive.ics.uci.edu/ml/datasets/


Breast+Cancer+Wisconsin+(Diagnostic))

• Mice Protein Expression Data Set (https://archive.ics.uci.edu/ml/datasets/Mice+Protein+


Expression)

Libraries
You will need to load the following packages:

import matplotlib.pyplot as plt


from sklearn import cluster
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_classification
from sklearn.datasets import make_blobs
from sklearn.datasets import make_moons

Analysis

Before running analysis on the Breast and Mice data sets, we will do analysis on three simulated
data sets to better understand what different clustering methods do, and why they produce different
clustering. Generate and visualize the artificial data as follows:

# First simulated data set


plt.title("Two informative features, one cluster per class", fontsize=’small’)
X1, Y1 = make_classification(n_samples=200, n_features=2, n_redundant=0, n_informative=2,
n_clusters_per_class=1)
plt.scatter(X1[:, 0], X1[:, 1], marker=’o’, c=Y1,s=25, edgecolor=’k’)

# Second simulated data set


plt.title("Three blobs", fontsize=’small’)
X2, Y2 = make_blobs(n_samples=200, n_features=2, centers=3)
plt.scatter(X2[:, 0], X2[:, 1], marker=’o’, c=Y2, s=25, edgecolor=’k’)

# Third simulated data set


plt.title("Non-linearly separated data sets", fontsize=’small’)
X3, Y3 = make_moons(n_samples=200, shuffle=True, noise=None, random_state=None)
plt.scatter(X3[:, 0], X3[:, 1], marker=’o’, c=Y3, s=25, edgecolor=’k’)

1
Apply the following clustering methods to the three simulated data sets.
Clustering Methods

1. K-means
http://scikit-learn.org/stable/modules/clustering.html#k-means
An example of k-means clustering (where k is the number of clusters you want to produce,
and X is the data matrix):

km = KMeans(n_clusters=k, init=’k-means++’, max_iter=100, n_init=1)


km.fit(X)

You can also visualize the clustering (and compare it to the true repartition):

plt.scatter(X[:, 0], X[:, 1], s=10, c=km.labels_)

2. Hierarchical clustering
http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
An example of hierarchical clustering (where k is the number of clusters you want to produce,
and X is the data matrix):

for linkage in (’ward’, ’average’, ’complete’):


clustering = AgglomerativeClustering(linkage=linkage, n_clusters=k)
clustering.fit(X)

3. Spectral clustering
http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering
An example of spectral clustering (where k is the number of clusters you want to produce,
and X is the data matrix):

spectral = cluster.SpectralClustering(n_clusters=k, eigen_solver=’arpack’,


affinity="nearest_neighbors")
spectral.fit(X)

4. Analyse the results of clustering in terms of

• Homogeneity metrics.homogeneity score()


• Completeness metrics.completeness score()
• V-measure metrics.v measure score()
• Adjusted Rand-Index metrics.adjusted rand score()
• Silhouette Coefficient metrics.silhouette score()

5. What is an optimal clustering method for each simulated data set?

6. Re-run the clustering methods on the Breast cancer and Mice data sets. Do not include the
class variables in your clustering analysis but compare the obtained clustering with the true
class labels.

You might also like