Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views42 pages

Data Analytics Lab

The document provides a comprehensive guide for installing and configuring Hadoop, including detailed steps for setting up Java and Hadoop on a multicore server or desktop. It outlines the process for implementing a MapReduce program to analyze weather data, specifically focusing on identifying hot and cold days based on temperature readings. Additionally, it includes code examples for the Map and Reduce functions, as well as instructions for managing Hadoop daemons and handling datasets in HDFS.

Uploaded by

Divya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views42 pages

Data Analytics Lab

The document provides a comprehensive guide for installing and configuring Hadoop, including detailed steps for setting up Java and Hadoop on a multicore server or desktop. It outlines the process for implementing a MapReduce program to analyze weather data, specifically focusing on identifying hot and cold days based on temperature readings. Additionally, it includes code examples for the Map and Reduce functions, as well as instructions for managing Hadoop daemons and handling datasets in HDFS.

Uploaded by

Divya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

The Laboratory shall have a multicore server running Hadoop or any similar platform.

Alternatively, desktop with Multicore processor and 64/128GB RAM can be used to install
Hadoop. R shall also be available in the Laboratory
Exercises:
A. Hadoop
1. Install, configure and run Hadoop and HDFS
Hadoop is a globally-used, open source software programming framework which is based on
Java programming with some native code of C and shell scripts. It can effectively manage
large data, both structured and unstructured formats on clusters of computers using simple
programming models.
Install Hadoop

Step 1: Click here to download the Java 8 Package. Save this file in your home directory.

Step 2: Extract the Java Tar File.


Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Fig: Hadoop Installation – Extracting Java Files


Step 3: Download the Hadoop 2.7.3 Package.
Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-
2.7.3.tar.gz

Fig: Hadoop Installation – Downloading Hadoop


Step 4: Extract the Hadoop tar File.
Command: tar -xvf hadoop-2.7.3.tar.gz

Fig: Hadoop Installation – Extracting Hadoop Files


Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
Learn more about the Hadoop Ecosystem and its tools with the Hadoop Certification.
Command: vi .bashrc
Fig: Hadoop Installation – Setting Environment Variable
Then, save the bash file and close it.
For applying all these changes to the current Terminal, execute the source command.
Command: source .bashrc

Fig: Hadoop Installation – Refreshing environment variables


To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.
Command: java -version

Fig: Hadoop Installation – Checking Java Version


Command: hadoop version

Fig: Hadoop Installation – Checking Hadoop Version


Step 6: Edit the Hadoop Configuration files.
Command: cd hadoop-2.7.3/etc/hadoop/
Big Data Hadoop Certification Training Course

• Instructor-led Sessions
• Real-life Case Studies
• Assessments
• Lifetime Access

Explore Curriculum

Command: ls
All the Hadoop configuration files are located in hadoop-2.7.3/etc/hadoop directory as you
can see in the snapshot below:

Fig: Hadoop Installation – Hadoop Configuration Files


Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are common to HDFS &
MapReduce.
Command: vi core-site.xml
Fig: Hadoop Installation – Configuring core-site.xml

1
<?xml version="1.0" encoding="UTF-8"?>
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>fs.default.name</name>
6<value>hdfs://localhost:9000</value>
7</property>
8</configuration>
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:

hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,


Secondary NameNode). It also includes the replication factor and block size of HDFS.

Command: vi hdfs-site.xml

Fig: Hadoop Installation – Configuring hdfs-site.xml

1
2<?xml version="1.0" encoding="UTF-8"?>
3<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
4<configuration>
<property>
5
<name>dfs.replication</name>
6<value>1</value>
7</property>
8<property>
9<name>dfs.permission</name>
10<value>false</value>
11</property>
12</configuration>
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like number of
JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores
available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml
file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: vi mapred-site.xml.

Fig: Hadoop Installation – Configuring mapred-site.xml

1
<?xml version="1.0" encoding="UTF-8"?>
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>mapreduce.framework.name</name>
6<value>yarn</value>
7</property>
8</configuration>

Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:

yarn-site.xml contains configuration settings of ResourceManager and NodeManager like


application memory management size, the operation needed on program & algorithm, etc.

Command: vi yarn-site.xml
Fig: Hadoop Installation – Configuring yarn-site.xml

1
2<?xml version="1.0">
3<configuration>
<property>
4
<name>yarn.nodemanager.aux-services</name>
5<value>mapreduce_shuffle</value>
6</property>
7<property>
8<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
9<value>org.apache.hadoop.mapred.ShuffleHandler</value>
10</property>
11</configuration>

2. Implement word count / frequency programs using MapReduce

In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into


individual tasks that can be executed in parallel across a cluster of servers. The results of tasks
can be joined together to compute final results.

MapReduce consists of 2 steps:

Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair).

Example – (Map function in Word Count)

Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car,
Input Set of data
BUS, TRAIN

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),


Convert into another set of
Output
data
(car,1), (bus,1), (car,1), (train,1), (bus,1),
(Key,Value) (TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),

(car,1), (BUS,1), (TRAIN,1)

• Reduce Function – Takes the output from Map as an input and


combines those data tuples into a smaller set of tuples.

Example – (Reduce function in Word Count)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Input (car,1), (bus,1), (car,1), (train,1), (bus,1),


Set of Tuples
(output of Map function) (TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),

(car,1), (BUS,1), (TRAIN,1)

(BUS,7),

Output Converts into smaller set of tuples (CAR,7),

(TRAIN,4)

Work Flow of the Program

Workflow of MapReduce consists of 5 steps:


1. Splitting – The splitting parameter can be anything, e.g. splitting by space, comma,
semicolon, or even by a new line (‘\n’).

2. Mapping – as explained above.

3. Intermediate splitting – the entire process in parallel on different clusters. In order to


group them in “Reduce Phase” the similar KEY data should be on the same cluster.

4. Reduce – it is nothing but mostly group by phase.

5. Combining – The last phase where all the data (individual result set from each cluster)
is combined together to form a result.

3. Implement an MR program that processes a weather or similar dataset. Dataset needs to be


found and used.

Here, we will write a Map-Reduce program for analyzing weather datasets to understand its
data processing programming model. Weather sensors are collecting weather information
across the globe in a large volume of log data. This weather data is semi-structured and
record-oriented.
This data is stored in a line-oriented ASCII format, where each row represents a single
record. Each row has lots of fields like longitude, latitude, daily max-min temperature, daily
average temperature, etc. for easiness, we will focus on the main element, i.e. temperature.
We will use the data from the National Centres for Environmental Information(NCEI). It has
a massive amount of historical weather data that we can use for our data analysis.
Problem Statement:
Analyzing weather data of Fairbanks, Alaska to find cold and hot
days using MapReduce Hadoop.
Step 1:
We can download the dataset from this Link, For various cities in different years. choose the
year of your choice and select any one of the data text-file for analyzing. In my case, I have
selected CRND0103-2020-AK_Fairbanks_11_NE.txt dataset for analysis of hot and cold
days in Fairbanks, Alaska.
We can get information about data from README.txt file available on the NCEI website.

Step 2:
Below is the example of our dataset where column 6 and column 7 is showing Maximum and
Minimum temperature, respectively.

Step 3:
Make a project in Eclipse with below steps:

First Open Eclipse -> then select File -> New -> Java Project ->Name it MyProject -> then
select use an execution environment -> choose JavaSE-1.8 then next -> Finish.

In this Project Create Java class with name MyMaxMin -> then click Finish
Copy the below source code to this MyMaxMin java class

JAVA

// importing Libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;

public class MyMaxMin {

// Mapper

/*MaxTemperatureMapper class is static


* and extends Mapper abstract class
* having four Hadoop generics type
* LongWritable, Text, Text, Text.
*/

public static class MaxTemperatureMapper extends


Mapper<LongWritable, Text, Text, Text> {

/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/

// the data in our data set with


// this value is inconsistent data
public static final int MISSING = 9999;

@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {

// Convert the single row(Record) to


// String and store it in String
// variable name line

String line = Value.toString();

// Check for the empty line


if (!(line.length() == 0)) {

// from character 6 to 14 we have


// the date in our dataset
String date = line.substring(6, 14);

// similarly we have taken the maximum


// temperature from 39 to 45 characters
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());

// similarly we have taken the minimum


// temperature from 47 to 53 characters

float temp_Min = Float.parseFloat(line.substring(47, 53).trim());

// if maximum temperature is
// greater than 30, it is a hot day
if (temp_Max > 30.0) {

// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new Text(String.valueOf(temp_Max)));
}

// if the minimum temperature is


// less than 15, it is a cold day
if (temp_Min < 15) {

// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}

// Reducer

/*MaxTemperatureReducer class is static


and extends Reducer abstract class
having four Hadoop generics type
Text, Text, Text, Text.
*/

public static class MaxTemperatureReducer extends


Reducer<Text, Text, Text, Text> {

/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/

public void reduce(Text Key, Iterator<Text> Values, Context context)


throws IOException, InterruptedException {

// putting all the values in


// temperature variable of type String
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}

/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/

public static void main(String[] args) throws Exception {

// reads the default configuration of the


// cluster from the configuration XML files
Configuration conf = new Configuration();

// Initializing the job with the


// default configuration of the cluster
Job job = new Job(conf, "weather example");

// Assigning the driver class name


job.setJarByClass(MyMaxMin.class);

// Key type coming out of mapper


job.setMapOutputKeyClass(Text.class);

// value type coming out of mapper


job.setMapOutputValueClass(Text.class);

// Defining the mapper class name


job.setMapperClass(MaxTemperatureMapper.class);

// Defining the reducer class name


job.setReducerClass(MaxTemperatureReducer.class);

// Defining input Format class which is


// responsible to parse the dataset
// into a key value pair
job.setInputFormatClass(TextInputFormat.class);

// Defining output Format class which is


// responsible to parse the dataset
// into a key value pair
job.setOutputFormatClass(TextOutputFormat.class);

// setting the second argument


// as a path in a path variable
Path OutputPath = new Path(args[1]);

// Configuring the input path


// from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));

// Configuring the output path from


// the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));

// deleting the context path automatically


// from hdfs so that we don't have
// to delete it explicitly
OutputPath.getFileSystem(conf).delete(OutputPath);

// exiting the job only if the


// flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

Now we need to add external jar for the packages that we have import. Download the jar
package Hadoop Common and Hadoop MapReduce Core according to your Hadoop version.
You can check Hadoop Version:

hadoop version

Now we add these external jars to our MyProject. Right Click on MyProject -> then
select Build Path-> Click on Configure Build Path and select Add External jars…. and add
jars from it’s download location then click -> Apply and Close.

Now export the project as jar file. Right-click on MyProject choose Export.. and go to Java ->
JAR file click -> Next and choose your export destination then click -> Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click -> Finish -> Ok.
Step 4:
Start our Hadoop Daemons
start-dfs.sh
start-yarn.sh
Step 5:
Move your dataset to the Hadoop HDFS.
Syntax:

hdfs dfs -put /file_path /destination


In below command / shows the root directory of our HDFS.

hdfs dfs -put /home/dikshant/Downloads/CRND0103-2020-


AK_Fairbanks_11_NE.txt /
Check the file sent to our HDFS.

hdfs dfs -ls /

Step 6:
Now Run your Jar File with below command and produce the output in MyOutput File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS /output-
file_name
Command:
hadoop jar /home/dikshant/Documents/Project.jar /CRND0103-2020-
AK_Fairbanks_11_NE.txt /MyOutput

Step 7:

Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
Step 8:

See the result in the Downloaded File.

In the above image, you can see the top 10 results showing the cold days. The second column
is a day in yyyy/mm/dd format. For Example, 20200101 means
year = 202

B. R/ Python
4. Implement Linear and logistic Regression
Linear Regression (Python Implementation)
This article discusses the basics of linear regression and its implementation in the Python
programming language.
Linear regression is a statistical method for modeling relationships between a dependent
variable with a given set of independent variables.
Note: In this article, we refer to dependent variables as responses and independent variables
as features for simplicity.
In order to provide a basic understanding of linear regression, we start with the most basic
version of linear regression, i.e. Simple linear regression.
Simple Linear Regression
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
Let us consider a dataset where we have a value of response y for every feature x:
For generality, we define:
x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
A scatter plot of the above dataset looks like:-

Now, the task is to find a line that fits best in the above scatter plot so that we can predict the
response for any new feature values. (i.e a value of x not present in a dataset)
This line is called a regression line.
The equation of regression line is represented as:

Here,
• h(x_i) represents the predicted response value for ith observation.
• b_0 and b_1 are regression coefficients and represent y-
intercept and slope of regression line respectively.
• To create our model, we must “learn” or estimate the values of regression
coefficients b_0 and b_1. And once we’ve estimated these coefficients, we can
use the model to predict responses!
In this article, we are going to use the principle of Least Squares.
Now consider:

Here, e_i is a residual error in ith observation.


So, our aim is to minimize the total residual error.
We define the squared error or cost function, J as:

and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!
Without going into the mathematical details, we present the result here:

where SS_xy is the sum of cross-deviations of y and x:

and SS_xx is the sum of squared deviations of x:

Note: The complete derivation for finding least squares estimates in simple linear
regression can be found here.
• Code: Python implementation of above technique on our small dataset

• Python

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):


# number of observations/points
n = np.size(x)

# mean of x and y vector


m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x


SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients


b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):


# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector


y_pred = b[0] + b[1]*x

# plotting the regression line


plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot


plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line


plot_regression_line(x, y, b)

if __name__ == "__main__":
main()

Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:
5. Implement SVM / Decision tree classification techniques
Introduction to SVMs: In machine learning, support vector machines (SVMs, also support
vector networks) are supervised learning models with associated learning algorithms that
analyze data used for classification and regression analysis. A Support Vector Machine
(SVM) is a discriminative classifier formally defined by a separating hyperplane. In other
words, given labeled training data (supervised learning), the algorithm outputs an optimal
hyperplane which categorizes new examples.
What is Support Vector Machine?
An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible. In
addition to performing linear classification, SVMs can efficiently perform a non-linear
classification, implicitly mapping their inputs into high-dimensional feature spaces.
What does SVM do?
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier. Let you have basic
understandings from this article before you proceed further. Here I’ll discuss an example
about SVM classification of cancer UCI datasets using machine learning tools i.e. scikit-
learn compatible with Python. Pre-requisites: Numpy, Pandas, matplot-lib, scikit-learn Let’s
have a quick example of support vector classification. First we need to create a dataset:

• python3
# importing scikit learn with make_blobs
from sklearn.datasets.samples_generator import make_blobs

# creating datasets X containing n_samples


# Y containing two classes
X, Y = make_blobs(n_samples=500, centers=2,
random_state=0, cluster_std=0.40)
import matplotlib.pyplot as plt
# plotting scatters
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring');
plt.show()

Output: What Support vector machines do, is


to not only draw a line between two classes here, but consider a region about the line of some
given width. Here’s an example of what it can look like:

• python3

# creating linspace between -1 to 3.5


xfit = np.linspace(-1, 3.5)

# plotting scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')

# plot a line between the different sets of data


for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)

plt.xlim(-1, 3.5);
plt.show()
Importing datasets
This is the intuition of support vector machines, which optimize a linear discriminant model
representing the perpendicular distance between the datasets. Now let’s train the classifier
using our training data. Before training, we need to import cancer datasets as csv file where
we will train two features out of all features.

• python3

# importing required libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# reading csv file and extracting class column to y.


x = pd.read_csv("C:\...\cancer.csv")
a = np.array(x)
y = a[:,30] # classes having 0 and 1

# extracting two features


x = np.column_stack((x.malignant,x.benign))

# 569 samples and 2 features


x.shape

print (x),(y)

[[ 122.8 1001. ]
[ 132.9 1326. ]
[ 130. 1203. ]
...,
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0.,
0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.,
0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1.,
1.,
1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1.,
....,
1.])
Fitting a Support Vector Machine
Now we’ll fit a Support Vector Machine Classifier to these points. While the mathematical
details of the likelihood model are interesting, we’ll let read about those elsewhere. Instead,
we’ll just treat the scikit-learn algorithm as a black box which accomplishes the above task.

• python3

# import support vector classifier


# "Support Vector Classifier"
from sklearn.svm import SVC
clf = SVC(kernel='linear')

# fitting x samples and y classes


clf.fit(x, y)

After being fitted, the model can then be used to predict new values:

• python3

clf.predict([[120, 990]])

clf.predict([[85, 550]])

array(])
Let’s have a look on the graph how does this

show. This is obtained by analyzing the data


taken and pre-processing methods to make optimal hyperplanes using matplotlib function.
This article is contributed by Afzal Ansari. If you like GeeksforGeeks and would like to
contribute, you can also write an article using write.geeksforgeeks.org or mail your article to
[email protected]. See your article appearing on the GeeksforGeeks main
page and help other Geeks. Please write comments if you find anything incorrect, or you
want to share more information about the topic discussed above.

6. Implement clustering techniques


k-means, DBSCAN and HAC are 3 very popular clustering algorithms which all take very
different approaches to creating clusters.
Before diving in, you can also explore why you might want to learn these topics or just
what career opportunities these skills will present you with!
Cluster Analysis
Imagine we have some data. In cluster analysis, we want to (in an unsupervised manner –
no apriori information), separate different groups based on the data.
Looking at a plot of the above data, we can say that it fits into 2 different groups – a cluster of
points in the bottom left and a larger, elongated cluster on the top right. When we give this
data to a clustering algorithm, it will create a split. Algorithms like k-means need to be told
how many clusters we want. In some cases, we don’t need to specify the number of clusters.
DBSCAN for instance is smart enough to figure out how many clusters there are in the data.
The data above is from the IRIS data set. This was created by a famous statistician R.A.
Fischer, who collected this data set of 3 different species of flowers and plotted their
measured properties such as petal width, petal length, sepal width, sepal length. Since we are
doing clustering, we have removed the class labels from the data set as that is what the
clustering algorithms is trying to give us in terms of what data points belong together.
Clustering
Grouping data into clusters so that the data in each cluster has similar attributes or properties.
For example the data in the small cluster in the above plot have small petal length and small
petal width.
There are several applications of clustering analysis used across a variety of fields:
Market analysis and segmentation
Medical imaging – Xrays, MRIs, fMRIs
Recommender systems – such as those used on Amazon.com
Geospatial data – longitudinal coordinates etc
Anomaly detection
People have used clustering algorithms to detect brain anomalies. We see below various brain
images. C1 – C7 are various clusters. This is an example of clustering in the medical domain.

Another example is a spatial analysis of user-generated ratings of venues on yelp.com


Cluster Analysis
A set of points X1….Xn are taken in, analyzed and out comes a set of mappings from each
point to a cluster (X1 -> C1, X2 -> C2 etc).

There are several such algorithms that will detect clusters. Some of these algorithms have
additional parameters e.g. Number of clusters. These parameters vary for each algorithm.
The input however is a set of data points X1…Xn in any dimensionality i.e. 2D, 3D, 100D
etc. For our purposes, we will stick with 2D points. It is hard to visualize data of higher
dimensions though there are dimensionality reduction techniques that reduce say 100
dimensions to 2 so that they can be plotted.
The output is a cluster assignment where each point either belongs to a cluster or could be an
outlier (noise).
Cluster analysis is a kind of unsupervised machine learning technique, as in general, we do
not have any labels. There may be some techniques that use class labels to do clustering but
this is generally not the case.
Summary
We discussed what clustering analysis is, various clustering algorithms, what are the inputs
and outputs of these algorithms. We discussed various applications of clustering – not
necessarily in the data science field.
Part 2
In this video, we will look at probably the most popular clustering algorithm i.e. k-means
clustering.

This is a very popular, simple and easy-to-implement algorithm.


In this algorithm, we separate data into k disjoint clusters. These clusters are defines such that
they minimize the within-cluster sum-of-squares. We’ll discuss this more when we look at k-
means convergence. Disjoint here means that 1 point cannot belong to more than 1 cluster.
There is only 1 parameter for this algorithm i.e. k (the number of clusters). We need to have
an idea of k before we run the algorithm. Sometimes it is obvious how many clusters we
should have, but sometimes it is not that clear. We will discuss later how to make this choice.
This algorithm is a baseline algorithm in clustering.
The cluster center/centroid is a point that represents the cluster. The figure above has a red
and a blue cluster. X is the centroid – the average of the x and y coordinates. In the blue
cluster the average of the x and y coordinates is somewhere in the middle represented by the
X in the middle of the square.
K-means Clustering Algorithm
1. Randomly initialize the cluster centers. For example, in the above diagram, we pick 2
random points to initialize the clusters.
2. Assign each point to it’s nearest cluster using distance formula like Euclidian distance.
3. Update the cluster centroids using the mean of the points assigned to it.
4. Go back to 2 until convergence (the cluster centroids stop moving or they move small
imperceptible amounts).
Let’s go through an example (fake data) and discuss some interesting things about this
algorithm.
Visually, we can see that there are 2 clusters present.
Let’s randomly assign the cluster centers.

Let’s now assign each point to the closest cluster.


The points are now colored blue or red depending on which centroid they are closer to.
Next we need to update our cluster centroids. For the blue centroid, we take the average of all
the x-coordinates – this will be the new x-coordinate for the centroid. Similarly we look at all
the y-coordinates for the blue points, take their average and this becomes the new y-
coordinate for the centroid. Likewise for the red points.
When I do this, the centroids shift over.

Once again I need to figure out which centroid each point is close to, which gives me the
following.
Once again, I update my cluster centroids as before.

If we try to do another shift, the centroids won’t move again. This is evident from the last 2
figures where the same points are assigned to the cluster centroids.

At this point, we say that k-means has converged.

Convergence
Convergence mean that the cluster centroid don’t move at all or move a very very small
amount. We use a threshold value to indicate that if the centroid does not move at least that
much the k-means has converged.

Mathematically, k-means is guaranteed to converge in a finite number of iterations (assigning


point to a cluster and shifting). It may take a long time, but will eventually converge. It does
not say anything about best or optimal clustering, just that it will converge.

K-means is sensitive to where you initialize the centroids. There are a few techniques to do
this:

• Assign each cluster center to a random data point.


• Choose k points to be farthest away from each other within
the bounds of the data.
• Repeat k-means over and over again and pick the average of
the clusters.
• Another advanced approach called k-means ++ does things
like ANOVA (Analysis Of Variance). We won’t be getting into it
though.
Choosing k (How many clusters to use)

One way is to plot the data points and try different values to see what works the best. Another
technique is called the elbow method.

Elbow method
Steps:

Choose some values of k and run the clustering algorithm

For each cluster, compute the within-cluster sum-of-squares between the centroid and each
data point.

Sum up for all clusters, plot on a graph

Repeat for different values of k, keep plotting on the graph.

Then pick the elbow of the graph.

This is a popular method supported by several libraries.

Advantages Of k-means

This is widely known and used algorithm.

It’s also fairly simple to understand and easy to implement.


It is also guaranteed to converge.

Disadvantages of k-means

It is algorithmically slow i.e. can take a long time to converge.

It may also not converge to the local minima i.e. the optimal solution.

It’s also not very robust against varying cluster shapes e.g. It

may not perform very well for elongated cluster shapes. This is
because we use the same parameters for each cluster.
This was a quick overview of k-means clustering. Lets now look at how it
performs on different kinds of data sets.

7. Visualize data using any plotting framework


In today’s world, a lot of data is being generated on a daily basis. And
sometimes to analyze this data for certain trends, patterns may become
difficult if the data is in its raw format. To overcome this data visualization
comes into play. Data visualization provides a good, organized pictorial
representation of the data which makes it easier to understand, observe,
analyze. In this tutorial, we will discuss how to visualize data using Python.
Python provides various libraries that come with different features for
visualizing data. All these libraries come with different features and can
support various types of graphs. In this tutorial, we will be discussing four such
libraries.
• Matplotlib
• Seaborn
• Bokeh
• Plotly
We will discuss these libraries one by one and will plot some most commonly
used graphs.
Note: If you want to learn in-depth information about these libraries you can
follow their complete tutorial.
Before diving into these libraries, at first, we will need a database to plot the
data. We will be using the tips database for this complete tutorial. Let’s discuss
see a brief about this database.
Database Used
Tips Database
Tips database is the record of the tip given by the customers in a restaurant
for two and a half months in the early 1990s. It contains 6 columns such as
total_bill, tip, sex, smoker, day, time, size.
You can download the tips database from here.
Example:
• Python3
import pandas as pd

# reading the database

data = pd.read_csv("tips.csv")

# printing the top 10 rows

display(data.head(10))

Output:

8. Implement an application that stores big data in Hbase / MongoDB / Pig using
Hadoop / R.

To get started on using MongoDB, you’ll first need to sign up for a free MongoDB Atlas
account here. Once you have created your account, you will be prompted to name your
organization, name your project, and choose the language for code samples and help.
Next, choose the type of account you need.

I chose the free option for this example. It’s worth noting that the free tier here remains free,
as opposed to other products which might offer a free trial period only.

Next, create a cluster. Unless you want to modify the cluster, you can choose the default and
click Create
Cluster.

It will take a few minutes for the cluster to provision. Once complete, you will see a screen
like
below:

Click on the Connect button to start setting up your connection. Here you will have to add your
local IP address and create a user for your database. The IP address will auto-populate with
your local IP address. Add a description if you want and click Add IP Address. Then add a
Username and Password and click Create Database User. After that, click the Choose a
connection method button, which will now be active.
The next screen will give you the option to choose how you will connect to your new database.
Since you are going to be connecting with Python, choose Connect your application.

You might also like