Data Analytics Lab
Data Analytics Lab
Alternatively, desktop with Multicore processor and 64/128GB RAM can be used to install
Hadoop. R shall also be available in the Laboratory
Exercises:
A. Hadoop
1. Install, configure and run Hadoop and HDFS
Hadoop is a globally-used, open source software programming framework which is based on
Java programming with some native code of C and shell scripts. It can effectively manage
large data, both structured and unstructured formats on clusters of computers using simple
programming models.
Install Hadoop
Step 1: Click here to download the Java 8 Package. Save this file in your home directory.
• Instructor-led Sessions
• Real-life Case Studies
• Assessments
• Lifetime Access
Explore Curriculum
Command: ls
All the Hadoop configuration files are located in hadoop-2.7.3/etc/hadoop directory as you
can see in the snapshot below:
1
<?xml version="1.0" encoding="UTF-8"?>
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>fs.default.name</name>
6<value>hdfs://localhost:9000</value>
7</property>
8</configuration>
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:
Command: vi hdfs-site.xml
1
2<?xml version="1.0" encoding="UTF-8"?>
3<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
4<configuration>
<property>
5
<name>dfs.replication</name>
6<value>1</value>
7</property>
8<property>
9<name>dfs.permission</name>
10<value>false</value>
11</property>
12</configuration>
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like number of
JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores
available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-site.xml
file using mapred-site.xml template.
Command: vi mapred-site.xml.
1
<?xml version="1.0" encoding="UTF-8"?>
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>mapreduce.framework.name</name>
6<value>yarn</value>
7</property>
8</configuration>
Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:
Command: vi yarn-site.xml
Fig: Hadoop Installation – Configuring yarn-site.xml
1
2<?xml version="1.0">
3<configuration>
<property>
4
<name>yarn.nodemanager.aux-services</name>
5<value>mapreduce_shuffle</value>
6</property>
7<property>
8<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
9<value>org.apache.hadoop.mapred.ShuffleHandler</value>
10</property>
11</configuration>
Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair).
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car,
Input Set of data
BUS, TRAIN
(BUS,7),
(TRAIN,4)
5. Combining – The last phase where all the data (individual result set from each cluster)
is combined together to form a result.
Here, we will write a Map-Reduce program for analyzing weather datasets to understand its
data processing programming model. Weather sensors are collecting weather information
across the globe in a large volume of log data. This weather data is semi-structured and
record-oriented.
This data is stored in a line-oriented ASCII format, where each row represents a single
record. Each row has lots of fields like longitude, latitude, daily max-min temperature, daily
average temperature, etc. for easiness, we will focus on the main element, i.e. temperature.
We will use the data from the National Centres for Environmental Information(NCEI). It has
a massive amount of historical weather data that we can use for our data analysis.
Problem Statement:
Analyzing weather data of Fairbanks, Alaska to find cold and hot
days using MapReduce Hadoop.
Step 1:
We can download the dataset from this Link, For various cities in different years. choose the
year of your choice and select any one of the data text-file for analyzing. In my case, I have
selected CRND0103-2020-AK_Fairbanks_11_NE.txt dataset for analysis of hot and cold
days in Fairbanks, Alaska.
We can get information about data from README.txt file available on the NCEI website.
Step 2:
Below is the example of our dataset where column 6 and column 7 is showing Maximum and
Minimum temperature, respectively.
Step 3:
Make a project in Eclipse with below steps:
First Open Eclipse -> then select File -> New -> Java Project ->Name it MyProject -> then
select use an execution environment -> choose JavaSE-1.8 then next -> Finish.
In this Project Create Java class with name MyMaxMin -> then click Finish
Copy the below source code to this MyMaxMin java class
JAVA
// importing Libraries
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
// Mapper
/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/
@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {
// if maximum temperature is
// greater than 30, it is a hot day
if (temp_Max > 30.0) {
// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new Text(String.valueOf(temp_Max)));
}
// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}
// Reducer
/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/
/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/
}
}
Now we need to add external jar for the packages that we have import. Download the jar
package Hadoop Common and Hadoop MapReduce Core according to your Hadoop version.
You can check Hadoop Version:
hadoop version
Now we add these external jars to our MyProject. Right Click on MyProject -> then
select Build Path-> Click on Configure Build Path and select Add External jars…. and add
jars from it’s download location then click -> Apply and Close.
Now export the project as jar file. Right-click on MyProject choose Export.. and go to Java ->
JAR file click -> Next and choose your export destination then click -> Next.
choose Main Class as MyMaxMin by clicking -> Browse and then click -> Finish -> Ok.
Step 4:
Start our Hadoop Daemons
start-dfs.sh
start-yarn.sh
Step 5:
Move your dataset to the Hadoop HDFS.
Syntax:
Step 6:
Now Run your Jar File with below command and produce the output in MyOutput File.
Syntax:
hadoop jar /jar_file_location /dataset_location_in_HDFS /output-
file_name
Command:
hadoop jar /home/dikshant/Documents/Project.jar /CRND0103-2020-
AK_Fairbanks_11_NE.txt /MyOutput
Step 7:
Now Move to localhost:50070/, under utilities select Browse the file system and
download part-r-00000 in /MyOutput directory to see result.
Step 8:
In the above image, you can see the top 10 results showing the cold days. The second column
is a day in yyyy/mm/dd format. For Example, 20200101 means
year = 202
B. R/ Python
4. Implement Linear and logistic Regression
Linear Regression (Python Implementation)
This article discusses the basics of linear regression and its implementation in the Python
programming language.
Linear regression is a statistical method for modeling relationships between a dependent
variable with a given set of independent variables.
Note: In this article, we refer to dependent variables as responses and independent variables
as features for simplicity.
In order to provide a basic understanding of linear regression, we start with the most basic
version of linear regression, i.e. Simple linear regression.
Simple Linear Regression
Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
Let us consider a dataset where we have a value of response y for every feature x:
For generality, we define:
x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
A scatter plot of the above dataset looks like:-
Now, the task is to find a line that fits best in the above scatter plot so that we can predict the
response for any new feature values. (i.e a value of x not present in a dataset)
This line is called a regression line.
The equation of regression line is represented as:
Here,
• h(x_i) represents the predicted response value for ith observation.
• b_0 and b_1 are regression coefficients and represent y-
intercept and slope of regression line respectively.
• To create our model, we must “learn” or estimate the values of regression
coefficients b_0 and b_1. And once we’ve estimated these coefficients, we can
use the model to predict responses!
In this article, we are going to use the principle of Least Squares.
Now consider:
and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!
Without going into the mathematical details, we present the result here:
Note: The complete derivation for finding least squares estimates in simple linear
regression can be found here.
• Code: Python implementation of above technique on our small dataset
• Python
import numpy as np
import matplotlib.pyplot as plt
# putting labels
plt.xlabel('x')
plt.ylabel('y')
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
if __name__ == "__main__":
main()
Output:
Estimated coefficients:
b_0 = -0.0586206896552
b_1 = 1.45747126437
And graph obtained looks like this:
5. Implement SVM / Decision tree classification techniques
Introduction to SVMs: In machine learning, support vector machines (SVMs, also support
vector networks) are supervised learning models with associated learning algorithms that
analyze data used for classification and regression analysis. A Support Vector Machine
(SVM) is a discriminative classifier formally defined by a separating hyperplane. In other
words, given labeled training data (supervised learning), the algorithm outputs an optimal
hyperplane which categorizes new examples.
What is Support Vector Machine?
An SVM model is a representation of the examples as points in space, mapped so that the
examples of the separate categories are divided by a clear gap that is as wide as possible. In
addition to performing linear classification, SVMs can efficiently perform a non-linear
classification, implicitly mapping their inputs into high-dimensional feature spaces.
What does SVM do?
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier. Let you have basic
understandings from this article before you proceed further. Here I’ll discuss an example
about SVM classification of cancer UCI datasets using machine learning tools i.e. scikit-
learn compatible with Python. Pre-requisites: Numpy, Pandas, matplot-lib, scikit-learn Let’s
have a quick example of support vector classification. First we need to create a dataset:
• python3
# importing scikit learn with make_blobs
from sklearn.datasets.samples_generator import make_blobs
• python3
# plotting scatter
plt.scatter(X[:, 0], X[:, 1], c=Y, s=50, cmap='spring')
plt.xlim(-1, 3.5);
plt.show()
Importing datasets
This is the intuition of support vector machines, which optimize a linear discriminant model
representing the perpendicular distance between the datasets. Now let’s train the classifier
using our training data. Before training, we need to import cancer datasets as csv file where
we will train two features out of all features.
• python3
print (x),(y)
[[ 122.8 1001. ]
[ 132.9 1326. ]
[ 130. 1203. ]
...,
[ 108.3 858.1 ]
[ 140.1 1265. ]
[ 47.92 181. ]]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0.,
0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0.,
0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1.,
1.,
1., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 1.,
....,
1.])
Fitting a Support Vector Machine
Now we’ll fit a Support Vector Machine Classifier to these points. While the mathematical
details of the likelihood model are interesting, we’ll let read about those elsewhere. Instead,
we’ll just treat the scikit-learn algorithm as a black box which accomplishes the above task.
• python3
After being fitted, the model can then be used to predict new values:
• python3
clf.predict([[120, 990]])
clf.predict([[85, 550]])
array(])
Let’s have a look on the graph how does this
There are several such algorithms that will detect clusters. Some of these algorithms have
additional parameters e.g. Number of clusters. These parameters vary for each algorithm.
The input however is a set of data points X1…Xn in any dimensionality i.e. 2D, 3D, 100D
etc. For our purposes, we will stick with 2D points. It is hard to visualize data of higher
dimensions though there are dimensionality reduction techniques that reduce say 100
dimensions to 2 so that they can be plotted.
The output is a cluster assignment where each point either belongs to a cluster or could be an
outlier (noise).
Cluster analysis is a kind of unsupervised machine learning technique, as in general, we do
not have any labels. There may be some techniques that use class labels to do clustering but
this is generally not the case.
Summary
We discussed what clustering analysis is, various clustering algorithms, what are the inputs
and outputs of these algorithms. We discussed various applications of clustering – not
necessarily in the data science field.
Part 2
In this video, we will look at probably the most popular clustering algorithm i.e. k-means
clustering.
Once again I need to figure out which centroid each point is close to, which gives me the
following.
Once again, I update my cluster centroids as before.
If we try to do another shift, the centroids won’t move again. This is evident from the last 2
figures where the same points are assigned to the cluster centroids.
Convergence
Convergence mean that the cluster centroid don’t move at all or move a very very small
amount. We use a threshold value to indicate that if the centroid does not move at least that
much the k-means has converged.
K-means is sensitive to where you initialize the centroids. There are a few techniques to do
this:
One way is to plot the data points and try different values to see what works the best. Another
technique is called the elbow method.
Elbow method
Steps:
For each cluster, compute the within-cluster sum-of-squares between the centroid and each
data point.
Advantages Of k-means
Disadvantages of k-means
It may also not converge to the local minima i.e. the optimal solution.
It’s also not very robust against varying cluster shapes e.g. It
•
may not perform very well for elongated cluster shapes. This is
because we use the same parameters for each cluster.
This was a quick overview of k-means clustering. Lets now look at how it
performs on different kinds of data sets.
data = pd.read_csv("tips.csv")
display(data.head(10))
Output:
8. Implement an application that stores big data in Hbase / MongoDB / Pig using
Hadoop / R.
To get started on using MongoDB, you’ll first need to sign up for a free MongoDB Atlas
account here. Once you have created your account, you will be prompted to name your
organization, name your project, and choose the language for code samples and help.
Next, choose the type of account you need.
I chose the free option for this example. It’s worth noting that the free tier here remains free,
as opposed to other products which might offer a free trial period only.
Next, create a cluster. Unless you want to modify the cluster, you can choose the default and
click Create
Cluster.
It will take a few minutes for the cluster to provision. Once complete, you will see a screen
like
below:
Click on the Connect button to start setting up your connection. Here you will have to add your
local IP address and create a user for your database. The IP address will auto-populate with
your local IP address. Add a description if you want and click Add IP Address. Then add a
Username and Password and click Create Database User. After that, click the Choose a
connection method button, which will now be active.
The next screen will give you the option to choose how you will connect to your new database.
Since you are going to be connecting with Python, choose Connect your application.