0% found this document useful (0 votes)

11 views42 pages

Data Analytics Lab

The document provides a comprehensive guide for installing and configuring Hadoop, including detailed steps for setting up Java and Hadoop on a multicore server or desktop. It outlines the process for implementing a MapReduce program to analyze weather data, specifically focusing on identifying hot and cold days based on temperature readings. Additionally, it includes code examples for the Map and Reduce functions, as well as instructions for managing Hadoop daemons and handling datasets in HDFS.

Uploaded by

Divya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views42 pages

Data Analytics Lab

Uploaded by

Divya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

The Laboratory shall have a multicore server running Hadoop or any similar platform.

Alternatively, desktop with Multicore processor and 64/128GB RAM can be used to install
Hadoop. R shall also be available in the Laboratory
Exercises:
A. Hadoop
1. Install, configure and run Hadoop and HDFS
Hadoop is a globally-used, open source software programming framework which is based on
Java programming with some native code of C and shell scripts. It can effectively manage
large data, both structured and unstructured formats on clusters of computers using simple
programming models.
Install Hadoop

Step 1: Click here to download the Java 8 Package. Save this file in your home directory.

Step 2: Extract the Java Tar File.

Command: tar -xvf jdk-8u101-linux-i586.tar.gz

Fig: Hadoop Installation – Extracting Java Files

Step 3: Download the Hadoop 2.7.3 Package.
Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-
2.7.3.tar.gz

Fig: Hadoop Installation – Downloading Hadoop

Step 4: Extract the Hadoop tar File.
Command: tar -xvf hadoop-2.7.3.tar.gz

Fig: Hadoop Installation – Extracting Hadoop Files

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
Learn more about the Hadoop Ecosystem and its tools with the Hadoop Certification.
Command: vi .bashrc
Fig: Hadoop Installation – Setting Environment Variable
Then, save the bash file and close it.
For applying all these changes to the current Terminal, execute the source command.
Command: source .bashrc

Fig: Hadoop Installation – Refreshing environment variables

To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.
Command: java -version

Fig: Hadoop Installation – Checking Java Version

Command: hadoop version

Fig: Hadoop Installation – Checking Hadoop Version

Step 6: Edit the Hadoop Configuration files.
Command: cd hadoop-2.7.3/etc/hadoop/
Big Data Hadoop Certification Training Course

• Instructor-led Sessions
• Real-life Case Studies
• Assessments
• Lifetime Access

Explore Curriculum

Command: ls
All the Hadoop configuration files are located in hadoop-2.7.3/etc/hadoop directory as you
can see in the snapshot below:

Fig: Hadoop Installation – Hadoop Configuration Files

Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are common to HDFS &
MapReduce.
Command: vi core-site.xml
Fig: Hadoop Installation – Configuring core-site.xml

1
<?xml version="1.0" encoding="UTF-8"?>
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>fs.default.name</name>
6<value>hdfs://localhost:9000</value>
7</property>
8</configuration>
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:

hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,

Secondary NameNode). It also includes the replication factor and block size of HDFS.

Command: vi hdfs-site.xml

Fig: Hadoop Installation – Configuring hdfs-site.xml

1
2<?xml version="1.0" encoding="UTF-8"?>
3<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
4<configuration>
<property>
5
<name>dfs.replication</name>
6<value>1</value>
7</property>
8<property>
9<name>dfs.permission</name>
10<value>false</value>
11</property>
12</configuration>
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like number of
JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores
available for a process, etc.

In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml
file using mapred-site.xml template.

Command: cp mapred-site.xml.template mapred-site.xml

Command: vi mapred-site.xml.

Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:

yarn-site.xml contains configuration settings of ResourceManager and NodeManager like

application memory management size, the operation needed on program & algorithm, etc.

Command: vi yarn-site.xml
Fig: Hadoop Installation – Configuring yarn-site.xml

1
2<?xml version="1.0">
3<configuration>
<property>
4
<name>yarn.nodemanager.aux-services</name>
5<value>mapreduce_shuffle</value>
6</property>
7<property>
8<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
9<value>org.apache.hadoop.mapred.ShuffleHandler</value>
10</property>
11</configuration>

2. Implement word count / frequency programs using MapReduce

In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into

individual tasks that can be executed in parallel across a cluster of servers. The results of tasks
can be joined together to compute final results.

MapReduce consists of 2 steps:

Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair).

Example – (Map function in Word Count)

Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car,
Input Set of data
BUS, TRAIN

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Convert into another set of
Output
data
(car,1), (bus,1), (car,1), (train,1), (bus,1),
(Key,Value) (TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),

(car,1), (BUS,1), (TRAIN,1)

• Reduce Function – Takes the output from Map as an input and

combines those data tuples into a smaller set of tuples.

Example – (Reduce function in Word Count)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Input (car,1), (bus,1), (car,1), (train,1), (bus,1),

Set of Tuples
(output of Map function) (TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),

(car,1), (BUS,1), (TRAIN,1)

(BUS,7),

Output Converts into smaller set of tuples (CAR,7),

(TRAIN,4)

Work Flow of the Program

Workflow of MapReduce consists of 5 steps:

1. Splitting – The splitting parameter can be anything, e.g. splitting by space, comma,
semicolon, or even by a new line (‘\n’).

2. Mapping – as explained above.

3. Intermediate splitting – the entire process in parallel on different clusters. In order to

group them in “Reduce Phase” the similar KEY data should be on the same cluster.

4. Reduce – it is nothing but mostly group by phase.

5. Combining – The last phase where all the data (individual result set from each cluster)
is combined together to form a result.

3. Implement an MR program that processes a weather or similar dataset. Dataset needs to be

found and used.

Here, we will write a Map-Reduce program for analyzing weather datasets to understand its
data processing programming model. Weather sensors are collecting weather information
across the globe in a large volume of log data. This weather data is semi-structured and
record-oriented.
This data is stored in a line-oriented ASCII format, where each row represents a single
record. Each row has lots of fields like longitude, latitude, daily max-min temperature, daily
average temperature, etc. for easiness, we will focus on the main element, i.e. temperature.
We will use the data from the National Centres for Environmental Information(NCEI). It has
a massive amount of historical weather data that we can use for our data analysis.
Problem Statement:
Analyzing weather data of Fairbanks, Alaska to find cold and hot
days using MapReduce Hadoop.
Step 1:
We can download the dataset from this Link, For various cities in different years. choose the
year of your choice and select any one of the data text-file for analyzing. In my case, I have
selected CRND0103-2020-AK_Fairbanks_11_NE.txt dataset for analysis of hot and cold
days in Fairbanks, Alaska.
We can get information about data from README.txt file available on the NCEI website.

Step 2:
Below is the example of our dataset where column 6 and column 7 is showing Maximum and
Minimum temperature, respectively.

Step 3:
Make a project in Eclipse with below steps:

First Open Eclipse -> then select File -> New -> Java Project ->Name it MyProject -> then
select use an execution environment -> choose JavaSE-1.8 then next -> Finish.

In this Project Create Java class with name MyMaxMin -> then click Finish
Copy the below source code to this MyMaxMin java class

JAVA

// importing Libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;

public class MyMaxMin {

// Mapper

/*MaxTemperatureMapper class is static

* and extends Mapper abstract class
* having four Hadoop generics type
* LongWritable, Text, Text, Text.
*/

public static class MaxTemperatureMapper extends

Mapper<LongWritable, Text, Text, Text> {

/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/

// the data in our data set with

// this value is inconsistent data
public static final int MISSING = 9999;

@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {

// Convert the single row(Record) to

// String and store it in String
// variable name line

String line = Value.toString();

// Check for the empty line

if (!(line.length() == 0)) {

// from character 6 to 14 we have

// the date in our dataset
String date = line.substring(6, 14);

// similarly we have taken the maximum

// temperature from 39 to 45 characters
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());

// similarly we have taken the minimum

// temperature from 47 to 53 characters

float temp_Min = Float.parseFloat(line.substring(47, 53).trim());

// if maximum temperature is
// greater than 30, it is a hot day
if (temp_Max > 30.0) {

// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new Text(String.valueOf(temp_Max)));
}

// if the minimum temperature is

// less than 15, it is a cold day
if (temp_Min < 15) {

// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}

// Reducer

/*MaxTemperatureReducer class is static

and extends Reducer abstract class
having four Hadoop generics type
Text, Text, Text, Text.
*/

public static class MaxTemperatureReducer extends

Reducer<Text, Text, Text, Text> {

/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/

public void reduce(Text Key, Iterator<Text> Values, Context context)

throws IOException, InterruptedException {

// putting all the values in

// temperature variable of type String
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}

/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/

public static void main(String[] args) throws Exception {

// reads the default configuration of the

// cluster from the configuration XML files
Configuration conf = new Configuration();

// Initializing the job with the

// default configuration of the cluster
Job job = new Job(conf, "weather example");

// Assigning the driver class name

job.setJarByClass(MyMaxMin.class);

// Key type coming out of mapper

job.setMapOutputKeyClass(Text.class);

// value type coming out of mapper

job.setMapOutputValueClass(Text.class);

// Defining the mapper class name

job.setMapperClass(MaxTemperatureMapper.class);

// Defining the reducer class name

job.setReducerClass(MaxTemperatureReducer.class);

// Defining input Format class which is

// responsible to parse the dataset
// into a key value pair
job.setInputFormatClass(TextInputFormat.class);

// Defining output Format class which is

// responsible to parse the dataset
// into a key value pair
job.setOutputFormatClass(TextOutputFormat.class);

// setting the second argument

// as a path in a path variable
Path OutputPath = new Path(args[1]);

// Configuring the input path

// from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));

// Configuring the output path from

// the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));

// deleting the context path automatically

// from hdfs so that we don't have
// to delete it explicitly
OutputPath.getFileSystem(conf).delete(OutputPath);

// exiting the job only if the

// flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

Now we need to add external jar for the packages that we have import. Download the jar
package Hadoop Common and Hadoop MapReduce Core according to your Hadoop version.
You can check Hadoop Version:

hadoop version

Now we add these external jars to our MyProject. Right Click on MyProject -> then
select Build Path-> Click on Configure Build Path and select Add External jars…. and add
jars from it’s download location then click -> Apply and Close.

Now export the project as jar file. Right-click on MyProject choose Export.. and go to Java ->
JAR file click -> Next and choose your export destination then click -> Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click -> Finish -> Ok.
Step 4:
Start our Hadoop Daemons
start-dfs.sh
start-yarn.sh
Step 5:
Move your dataset to the Hadoop HDFS.
Syntax:

hdfs dfs -put /file_path /destination

In below command / shows the root directory of our HDFS.

hdfs dfs -put /home/dikshant/Downloads/CRND0103-2020-

AK_Fairbanks_11_NE.txt /
Check the file sent to our HDFS.

hdfs dfs -ls /

Step 6:
Now Run your Jar File with below command and produce the output in MyOutput File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS /output-
file_name
Command:
hadoop jar /home/dikshant/Documents/Project.jar /CRND0103-2020-
AK_Fairbanks_11_NE.txt /MyOutput

Step 7:

Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
Step 8:

See the result in the Downloaded File.

In the above image, you can see the top 10 results showing the cold days. The second column
is a day in yyyy/mm/dd format. For Example, 20200101 means
year = 202

B. R/ Python
4. Implement Linear and logistic Regression
Linear Regression (Python Implementation)
This article discusses the basics of linear regression and its implementation in the Python
programming language.
Linear regression is a statistical method for modeling relationships between a dependent
variable with a given set of independent variables.
Note: In this article, we refer to dependent variables as responses and independent variables
as features for simplicity.
In order to provide a basic understanding of linear regression, we start with the most basic
version of linear regression, i.e. Simple linear regression.
Simple Linear Regression
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
Let us consider a dataset where we have a value of response y for every feature x:
For generality, we define:
x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
A scatter plot of the above dataset looks like:-

Now, the task is to find a line that fits best in the above scatter plot so that we can predict the
response for any new feature values. (i.e a value of x not present in a dataset)
This line is called a regression line.
The equation of regression line is represented as:

Here,
• h(x_i) represents the predicted response value for ith observation.
• b_0 and b_1 are regression coefficients and represent y-
intercept and slope of regression line respectively.
• To create our model, we must “learn” or estimate the values of regression
coefficients b_0 and b_1. And once we’ve estimated these coefficients, we can
use the model to predict responses!
In this article, we are going to use the principle of Least Squares.
Now consider:

Here, e_i is a residual error in ith observation.

So, our aim is to minimize the total residual error.
We define the squared error or cost function, J as:

and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!
Without going into the mathematical details, we present the result here:

where SS_xy is the sum of cross-deviations of y and x:

and SS_xx is the sum of squared deviations of x:

Note: The complete derivation for finding least squares estimates in simple linear
regression can be found here.
• Code: Python implementation of above technique on our small dataset

• Python

import numpy as np
import matplotlib.pyplot as plt

def estimate_coef(x, y):

# number of observations/points
n = np.size(x)

# mean of x and y vector

m_x = np.mean(x)
m_y = np.mean(y)

# calculating cross-deviation and deviation about x

SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x

# calculating regression coefficients

b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x

return (b_0, b_1)

def plot_regression_line(x, y, b):

# plotting the actual points as scatter plot
plt.scatter(x, y, color = "m",
marker = "o", s = 30)

# predicted response vector

y_pred = b[0] + b[1]*x

# plotting the regression line

plt.plot(x, y_pred, color = "g")

# putting labels
plt.xlabel('x')
plt.ylabel('y')

# function to show plot

plt.show()

def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))

# plotting regression line

plot_regression_line(x, y, b)

if __name__ == "__main__":
main()

Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:
5. Implement SVM / Decision tree classification techniques
Introduction to SVMs: In machine learning, support vector machines (SVMs, also support
vector networks) are supervised learning models with associated learning algorithms that
analyze data used for classification and regression analysis. A Support Vector Machine
(SVM) is a discriminative classifier formally defined by a separating hyperplane. In other
words, given labeled training data (supervised learning), the algorithm outputs an optimal
hyperplane which categorizes new examples.
What is Support Vector Machine?
An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible. In
addition to performing linear classification, SVMs can efficiently perform a non-linear
classification, implicitly mapping their inputs into high-dimensional feature spaces.
What does SVM do?
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier. Let you have basic
understandings from this article before you proceed further. Here I’ll discuss an example
about SVM classification of cancer UCI datasets using machine learning tools i.e. scikit-
learn compatible with Python. Pre-requisites: Numpy, Pandas, matplot-lib, scikit-learn Let’s
have a quick example of support vector classification. First we need to create a dataset:

• python3
# importing scikit learn with make_blobs
from sklearn.datasets.samples_generator import make_blobs

# creating datasets X containing n_samples

# Y containing two classes
X, Y = make_blobs(n_samples=500, centers=2,
random_state=0, cluster_std=0.40)
import matplotlib.pyplot as plt
# plotting scatters
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring');
plt.show()

Output: What Support vector machines do, is

to not only draw a line between two classes here, but consider a region about the line of some
given width. Here’s an example of what it can look like:

• python3

# creating linspace between -1 to 3.5

xfit = np.linspace(-1, 3.5)

# plotting scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')

# plot a line between the different sets of data

for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)

plt.xlim(-1, 3.5);
plt.show()
Importing datasets
This is the intuition of support vector machines, which optimize a linear discriminant model
representing the perpendicular distance between the datasets. Now let’s train the classifier
using our training data. Before training, we need to import cancer datasets as csv file where
we will train two features out of all features.

• python3

# importing required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# reading csv file and extracting class column to y.

x = pd.read_csv("C:\...\cancer.csv")
a = np.array(x)
y = a[:,30] # classes having 0 and 1

# extracting two features

x = np.column_stack((x.malignant,x.benign))

# 569 samples and 2 features

x.shape

print (x),(y)

[[ 122.8 1001. ]
[ 132.9 1326. ]
[ 130. 1203. ]
...,
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0.,
0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.,
0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1.,
1.,
1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1.,
....,
1.])
Fitting a Support Vector Machine
Now we’ll fit a Support Vector Machine Classifier to these points. While the mathematical
details of the likelihood model are interesting, we’ll let read about those elsewhere. Instead,
we’ll just treat the scikit-learn algorithm as a black box which accomplishes the above task.

• python3

# import support vector classifier

# "Support Vector Classifier"
from sklearn.svm import SVC
clf = SVC(kernel='linear')

# fitting x samples and y classes

clf.fit(x, y)

After being fitted, the model can then be used to predict new values:

• python3

clf.predict([[120, 990]])

clf.predict([[85, 550]])

array(])
Let’s have a look on the graph how does this

show. This is obtained by analyzing the data

taken and pre-processing methods to make optimal hyperplanes using matplotlib function.
This article is contributed by Afzal Ansari. If you like GeeksforGeeks and would like to
contribute, you can also write an article using write.geeksforgeeks.org or mail your article to
[email protected]. See your article appearing on the GeeksforGeeks main
page and help other Geeks. Please write comments if you find anything incorrect, or you
want to share more information about the topic discussed above.

6. Implement clustering techniques

k-means, DBSCAN and HAC are 3 very popular clustering algorithms which all take very
different approaches to creating clusters.
Before diving in, you can also explore why you might want to learn these topics or just
what career opportunities these skills will present you with!
Cluster Analysis
Imagine we have some data. In cluster analysis, we want to (in an unsupervised manner –
no apriori information), separate different groups based on the data.
Looking at a plot of the above data, we can say that it fits into 2 different groups – a cluster of
points in the bottom left and a larger, elongated cluster on the top right. When we give this
data to a clustering algorithm, it will create a split. Algorithms like k-means need to be told
how many clusters we want. In some cases, we don’t need to specify the number of clusters.
DBSCAN for instance is smart enough to figure out how many clusters there are in the data.
The data above is from the IRIS data set. This was created by a famous statistician R.A.
Fischer, who collected this data set of 3 different species of flowers and plotted their
measured properties such as petal width, petal length, sepal width, sepal length. Since we are
doing clustering, we have removed the class labels from the data set as that is what the
clustering algorithms is trying to give us in terms of what data points belong together.
Clustering
Grouping data into clusters so that the data in each cluster has similar attributes or properties.
For example the data in the small cluster in the above plot have small petal length and small
petal width.
There are several applications of clustering analysis used across a variety of fields:
Market analysis and segmentation
Medical imaging – Xrays, MRIs, fMRIs
Recommender systems – such as those used on Amazon.com
Geospatial data – longitudinal coordinates etc
Anomaly detection
People have used clustering algorithms to detect brain anomalies. We see below various brain
images. C1 – C7 are various clusters. This is an example of clustering in the medical domain.

Another example is a spatial analysis of user-generated ratings of venues on yelp.com

Cluster Analysis
A set of points X1….Xn are taken in, analyzed and out comes a set of mappings from each
point to a cluster (X1 -> C1, X2 -> C2 etc).

There are several such algorithms that will detect clusters. Some of these algorithms have
additional parameters e.g. Number of clusters. These parameters vary for each algorithm.
The input however is a set of data points X1…Xn in any dimensionality i.e. 2D, 3D, 100D
etc. For our purposes, we will stick with 2D points. It is hard to visualize data of higher
dimensions though there are dimensionality reduction techniques that reduce say 100
dimensions to 2 so that they can be plotted.
The output is a cluster assignment where each point either belongs to a cluster or could be an
outlier (noise).
Cluster analysis is a kind of unsupervised machine learning technique, as in general, we do
not have any labels. There may be some techniques that use class labels to do clustering but
this is generally not the case.
Summary
We discussed what clustering analysis is, various clustering algorithms, what are the inputs
and outputs of these algorithms. We discussed various applications of clustering – not
necessarily in the data science field.
Part 2
In this video, we will look at probably the most popular clustering algorithm i.e. k-means
clustering.

This is a very popular, simple and easy-to-implement algorithm.

In this algorithm, we separate data into k disjoint clusters. These clusters are defines such that
they minimize the within-cluster sum-of-squares. We’ll discuss this more when we look at k-
means convergence. Disjoint here means that 1 point cannot belong to more than 1 cluster.
There is only 1 parameter for this algorithm i.e. k (the number of clusters). We need to have
an idea of k before we run the algorithm. Sometimes it is obvious how many clusters we
should have, but sometimes it is not that clear. We will discuss later how to make this choice.
This algorithm is a baseline algorithm in clustering.
The cluster center/centroid is a point that represents the cluster. The figure above has a red
and a blue cluster. X is the centroid – the average of the x and y coordinates. In the blue
cluster the average of the x and y coordinates is somewhere in the middle represented by the
X in the middle of the square.
K-means Clustering Algorithm
1. Randomly initialize the cluster centers. For example, in the above diagram, we pick 2
random points to initialize the clusters.
2. Assign each point to it’s nearest cluster using distance formula like Euclidian distance.
3. Update the cluster centroids using the mean of the points assigned to it.
4. Go back to 2 until convergence (the cluster centroids stop moving or they move small
imperceptible amounts).
Let’s go through an example (fake data) and discuss some interesting things about this
algorithm.
Visually, we can see that there are 2 clusters present.
Let’s randomly assign the cluster centers.

Let’s now assign each point to the closest cluster.

The points are now colored blue or red depending on which centroid they are closer to.
Next we need to update our cluster centroids. For the blue centroid, we take the average of all
the x-coordinates – this will be the new x-coordinate for the centroid. Similarly we look at all
the y-coordinates for the blue points, take their average and this becomes the new y-
coordinate for the centroid. Likewise for the red points.
When I do this, the centroids shift over.

Once again I need to figure out which centroid each point is close to, which gives me the
following.
Once again, I update my cluster centroids as before.

If we try to do another shift, the centroids won’t move again. This is evident from the last 2
figures where the same points are assigned to the cluster centroids.

At this point, we say that k-means has converged.

Convergence
Convergence mean that the cluster centroid don’t move at all or move a very very small
amount. We use a threshold value to indicate that if the centroid does not move at least that
much the k-means has converged.

Mathematically, k-means is guaranteed to converge in a finite number of iterations (assigning

point to a cluster and shifting). It may take a long time, but will eventually converge. It does
not say anything about best or optimal clustering, just that it will converge.

K-means is sensitive to where you initialize the centroids. There are a few techniques to do
this:

• Assign each cluster center to a random data point.

• Choose k points to be farthest away from each other within
the bounds of the data.
• Repeat k-means over and over again and pick the average of
the clusters.
• Another advanced approach called k-means ++ does things
like ANOVA (Analysis Of Variance). We won’t be getting into it
though.
Choosing k (How many clusters to use)

One way is to plot the data points and try different values to see what works the best. Another
technique is called the elbow method.

Elbow method
Steps:

Choose some values of k and run the clustering algorithm

For each cluster, compute the within-cluster sum-of-squares between the centroid and each
data point.

Sum up for all clusters, plot on a graph

Repeat for different values of k, keep plotting on the graph.

Then pick the elbow of the graph.

This is a popular method supported by several libraries.

Advantages Of k-means

This is widely known and used algorithm.

It’s also fairly simple to understand and easy to implement.

It is also guaranteed to converge.

Disadvantages of k-means

It is algorithmically slow i.e. can take a long time to converge.

It may also not converge to the local minima i.e. the optimal solution.

It’s also not very robust against varying cluster shapes e.g. It
•
may not perform very well for elongated cluster shapes. This is
because we use the same parameters for each cluster.
This was a quick overview of k-means clustering. Lets now look at how it
performs on different kinds of data sets.

7. Visualize data using any plotting framework

In today’s world, a lot of data is being generated on a daily basis. And
sometimes to analyze this data for certain trends, patterns may become
difficult if the data is in its raw format. To overcome this data visualization
comes into play. Data visualization provides a good, organized pictorial
representation of the data which makes it easier to understand, observe,
analyze. In this tutorial, we will discuss how to visualize data using Python.
Python provides various libraries that come with different features for
visualizing data. All these libraries come with different features and can
support various types of graphs. In this tutorial, we will be discussing four such
libraries.
• Matplotlib
• Seaborn
• Bokeh
• Plotly
We will discuss these libraries one by one and will plot some most commonly
used graphs.
Note: If you want to learn in-depth information about these libraries you can
follow their complete tutorial.
Before diving into these libraries, at first, we will need a database to plot the
data. We will be using the tips database for this complete tutorial. Let’s discuss
see a brief about this database.
Database Used
Tips Database
Tips database is the record of the tip given by the customers in a restaurant
for two and a half months in the early 1990s. It contains 6 columns such as
total_bill, tip, sex, smoker, day, time, size.
You can download the tips database from here.
Example:
• Python3
import pandas as pd

# reading the database

data = pd.read_csv("tips.csv")

# printing the top 10 rows

display(data.head(10))

Output:

8. Implement an application that stores big data in Hbase / MongoDB / Pig using
Hadoop / R.

To get started on using MongoDB, you’ll first need to sign up for a free MongoDB Atlas
account here. Once you have created your account, you will be prompted to name your
organization, name your project, and choose the language for code samples and help.
Next, choose the type of account you need.

I chose the free option for this example. It’s worth noting that the free tier here remains free,
as opposed to other products which might offer a free trial period only.

Next, create a cluster. Unless you want to modify the cluster, you can choose the default and
click Create
Cluster.

It will take a few minutes for the cluster to provision. Once complete, you will see a screen
like
below:

Click on the Connect button to start setting up your connection. Here you will have to add your
local IP address and create a user for your database. The IP address will auto-populate with
your local IP address. Add a description if you want and click Add IP Address. Then add a
Username and Password and click Create Database User. After that, click the Choose a
connection method button, which will now be active.
The next screen will give you the option to choose how you will connect to your new database.
Since you are going to be connecting with Python, choose Connect your application.

Hadoop MapReduce Cookbook 1st Edition Srinath Perera All Chapters Available
100% (1)
Hadoop MapReduce Cookbook 1st Edition Srinath Perera All Chapters Available
88 pages
Big Data Analytics Lab Guide
No ratings yet
Big Data Analytics Lab Guide
44 pages
Data Science Record
No ratings yet
Data Science Record
30 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
Hadoop MapReduce Cookbook 1st Edition Srinath Perera Kindle & PDF Formats
No ratings yet
Hadoop MapReduce Cookbook 1st Edition Srinath Perera Kindle & PDF Formats
80 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Ad8704 BDM Manual
No ratings yet
Ad8704 BDM Manual
46 pages
Ccs334-Bda Lab Manual
No ratings yet
Ccs334-Bda Lab Manual
50 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
Big Data Analytics - Basics of Hadoop
No ratings yet
Big Data Analytics - Basics of Hadoop
15 pages
BIG DATA Finalised
No ratings yet
BIG DATA Finalised
28 pages
Ccs334-Bda Lab Manual
No ratings yet
Ccs334-Bda Lab Manual
48 pages
BDA LabManual
No ratings yet
BDA LabManual
32 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
43 pages
Bda File
No ratings yet
Bda File
28 pages
BIG Data File
No ratings yet
BIG Data File
28 pages
BDA Practical Experiment 1
No ratings yet
BDA Practical Experiment 1
5 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
54 pages
ML Mini Project Idea
No ratings yet
ML Mini Project Idea
13 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
Unit 4
No ratings yet
Unit 4
14 pages
Hadoopfile PP
No ratings yet
Hadoopfile PP
83 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
Data Science
No ratings yet
Data Science
82 pages
Notes - Unit 4 - Basics of Hadoop-3-16
No ratings yet
Notes - Unit 4 - Basics of Hadoop-3-16
14 pages
Big Data Manual
No ratings yet
Big Data Manual
82 pages
Big Data
No ratings yet
Big Data
28 pages
CCS334-BDA LAB MANUAL Final
No ratings yet
CCS334-BDA LAB MANUAL Final
46 pages
Bda Unit-4 Notes
No ratings yet
Bda Unit-4 Notes
15 pages
BDA Lab Manual UPDATED
No ratings yet
BDA Lab Manual UPDATED
45 pages
Lab Manual
No ratings yet
Lab Manual
27 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Ba Lab Record-It b2022-26
No ratings yet
Ba Lab Record-It b2022-26
43 pages
Big Data Analytics lab-JD
No ratings yet
Big Data Analytics lab-JD
49 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Bigdatamanual
No ratings yet
Bigdatamanual
45 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
Bda Manual
No ratings yet
Bda Manual
33 pages
Big Data Manual
No ratings yet
Big Data Manual
19 pages
Unit 4 Bda
No ratings yet
Unit 4 Bda
19 pages
Amc Engineering College: Dept. of Computer Science and Engineering
No ratings yet
Amc Engineering College: Dept. of Computer Science and Engineering
6 pages
Big Data File
No ratings yet
Big Data File
16 pages
BDA Lab Manual
No ratings yet
BDA Lab Manual
34 pages
Bda Lab Manuel
No ratings yet
Bda Lab Manuel
9 pages
Group A 1st
No ratings yet
Group A 1st
4 pages
Automated Mammogram Lesion Detection
No ratings yet
Automated Mammogram Lesion Detection
15 pages
Java & Hadoop Setup Guide
No ratings yet
Java & Hadoop Setup Guide
67 pages
Traffic Accident Severity Prediction
No ratings yet
Traffic Accident Severity Prediction
9 pages
DSBDSAssingment 11
No ratings yet
DSBDSAssingment 11
20 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
34 pages
Unit 4 Unit 4 Bda
No ratings yet
Unit 4 Unit 4 Bda
16 pages
Hadoop Setup & File Management Guide
No ratings yet
Hadoop Setup & File Management Guide
16 pages
Real Time Credit Card Fraud Detection
No ratings yet
Real Time Credit Card Fraud Detection
5 pages
BIGDATA LAB MANUAL
No ratings yet
BIGDATA LAB MANUAL
27 pages
Kenny-230724-Top 50 Data Science Projects
No ratings yet
Kenny-230724-Top 50 Data Science Projects
9 pages
Sensors 20 00726 With Cover
No ratings yet
Sensors 20 00726 With Cover
26 pages
Research Papers
No ratings yet
Research Papers
16 pages
B.Tech AI 3rd Year Syllabus
No ratings yet
B.Tech AI 3rd Year Syllabus
33 pages
22CM1104
No ratings yet
22CM1104
2 pages
ML Experiments
No ratings yet
ML Experiments
22 pages
Applications of Data-Driven Approaches in Prediction of Fatigue and Fracture
No ratings yet
Applications of Data-Driven Approaches in Prediction of Fatigue and Fracture
13 pages
By: Moataz Al-Haj: Vision Topics - Seminar (University of Haifa)
No ratings yet
By: Moataz Al-Haj: Vision Topics - Seminar (University of Haifa)
69 pages
OralEpitheliumDB - A Dataset For Oral Epithelial Dysplasia Image Segmentation and Classification
No ratings yet
OralEpitheliumDB - A Dataset For Oral Epithelial Dysplasia Image Segmentation and Classification
20 pages
Ganga Water Quality Assessment
No ratings yet
Ganga Water Quality Assessment
25 pages
Literature Survey
No ratings yet
Literature Survey
4 pages
25 Is 09
No ratings yet
25 Is 09
16 pages
Approaches To Anatomical and Functional Brain Connectivity Analysis With Applications To Adolescent Major Depressive Disorder - PHD
No ratings yet
Approaches To Anatomical and Functional Brain Connectivity Analysis With Applications To Adolescent Major Depressive Disorder - PHD
24 pages
House Price Prediction Project
No ratings yet
House Price Prediction Project
22 pages
Notes Machine Learning
No ratings yet
Notes Machine Learning
34 pages
Seminar
No ratings yet
Seminar
31 pages
PEER Stage2 10.1016/j.specom.2007.01.001
No ratings yet
PEER Stage2 10.1016/j.specom.2007.01.001
31 pages
Wine Quality Prediction GHAR
No ratings yet
Wine Quality Prediction GHAR
19 pages
SVMs: Classification & Regression Guide
No ratings yet
SVMs: Classification & Regression Guide
66 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
16 pages
Document 1
No ratings yet
Document 1
9 pages
Research Paper
No ratings yet
Research Paper
10 pages
Dataset Indonesia Untuk Analisis Sentimen
No ratings yet
Dataset Indonesia Untuk Analisis Sentimen
7 pages
AI for Fresh Produce Quality
No ratings yet
AI for Fresh Produce Quality
6 pages
Thesis On Gene Expression Analysis
No ratings yet
Thesis On Gene Expression Analysis
125 pages

Data Analytics Lab

Uploaded by

Data Analytics Lab

Uploaded by

The Laboratory shall have a multicore server running Hadoop or any similar platform.

Step 2: Extract the Java Tar File.

Fig: Hadoop Installation – Extracting Java Files

Fig: Hadoop Installation – Downloading Hadoop

Fig: Hadoop Installation – Extracting Hadoop Files

Fig: Hadoop Installation – Refreshing environment variables

Fig: Hadoop Installation – Checking Java Version

Fig: Hadoop Installation – Checking Hadoop Version

Fig: Hadoop Installation – Hadoop Configuration Files

hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,

Fig: Hadoop Installation – Configuring hdfs-site.xml

Command: cp mapred-site.xml.template mapred-site.xml

Fig: Hadoop Installation – Configuring mapred-site.xml

yarn-site.xml contains configuration settings of ResourceManager and NodeManager like

2. Implement word count / frequency programs using MapReduce

In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into

MapReduce consists of 2 steps:

Example – (Map function in Word Count)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

(car,1), (BUS,1), (TRAIN,1)

• Reduce Function – Takes the output from Map as an input and

Example – (Reduce function in Word Count)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Input (car,1), (bus,1), (car,1), (train,1), (bus,1),

(car,1), (BUS,1), (TRAIN,1)

Output Converts into smaller set of tuples (CAR,7),

Work Flow of the Program

Workflow of MapReduce consists of 5 steps:

2. Mapping – as explained above.

3. Intermediate splitting – the entire process in parallel on different clusters. In order to

4. Reduce – it is nothing but mostly group by phase.

3. Implement an MR program that processes a weather or similar dataset. Dataset needs to be

public class MyMaxMin {

/*MaxTemperatureMapper class is static

public static class MaxTemperatureMapper extends

// the data in our data set with

// Convert the single row(Record) to

String line = Value.toString();

// Check for the empty line

// from character 6 to 14 we have

// similarly we have taken the maximum

// similarly we have taken the minimum

float temp_Min = Float.parseFloat(line.substring(47, 53).trim());

// if the minimum temperature is

/*MaxTemperatureReducer class is static

public static class MaxTemperatureReducer extends

public void reduce(Text Key, Iterator<Text> Values, Context context)

// putting all the values in

public static void main(String[] args) throws Exception {

// reads the default configuration of the

// Initializing the job with the

// Assigning the driver class name

// Key type coming out of mapper

// value type coming out of mapper

// Defining the mapper class name

// Defining the reducer class name

// Defining input Format class which is

// Defining output Format class which is

// setting the second argument

// Configuring the input path

// Configuring the output path from

// deleting the context path automatically

// exiting the job only if the

hdfs dfs -put /file_path /destination

hdfs dfs -put /home/dikshant/Downloads/CRND0103-2020-

hdfs dfs -ls /

See the result in the Downloaded File.

Here, e_i is a residual error in ith observation.

where SS_xy is the sum of cross-deviations of y and x:

and SS_xx is the sum of squared deviations of x:

def estimate_coef(x, y):

# mean of x and y vector

# calculating cross-deviation and deviation about x

# calculating regression coefficients

return (b_0, b_1)

def plot_regression_line(x, y, b):