Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views47 pages

Bda Manual

Uploaded by

B7 Pratik Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views47 pages

Bda Manual

Uploaded by

B7 Pratik Jadhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

MGM’s COLLEGE OF ENGINEERING AND

TECHNOLOGY
Navi Mumbai-410209

DEPARTMENT OF COMPUTER ENGINEERING

Regulations: 2022 Batch: 2024-25

Year: BE Semester: VII

CSL7012 – Big Data Analytics Lab

Prepared by

Ms Vrushali Thakur

LAB INCHARGE HOD

Ms Vrushali Thakur Dr. Rajesh Kadu


INSTITUTE VISION & MISSION

VISION:
To become one of the outstanding Engineering Institute in India by providing a
conductive and vibrant environment to achieve excellence in the field of Technology
MISSION:
To empower the aspiring professional students to be prudent enough to explore the
world of technology and mound them to be proficient to reach the pinnacle of success in the
competitive global economy.

COMPUTER ENGINEERING DEPARTMENT


VISION:
 To motivate and empower the students of computer engineering to become globally
competent citizens with ethics to serve and lead the society

 To provide a stimulating educational environment for computer engineering


graduates to face tomorrow’s challenges and to inculcate social responsibility in
them.
MISSION:
 To provide excellent academic environment by adopting an innovative teaching
techniques
through well-developed curriculum

 To foster a self-learning atmosphere for students to provide ethical solutions for


societal
Challenges

 To establish Center of Excellence in various domains of Computer Engineering and


promote active research and development.

 To enhance the competency of the faculty in the latest technology through continuous
development programs.
 To foster networking with alumni and industries.
PROGRAM EDUCATIONAL OBJECTIVES (PEO's)
To prepare the Learner with a sound foundation in the mathematical, scientific and
PEO 1 engineering fundamentals.
To motivate the Learner in the art of self-learning and to use modern tools for solving real life
PEO 2 problems.
To equip the Learner with broad education necessary to understand the impact of Computer
PEO 3 Science and Engineering in a global and social context.
To encourage, motivate and prepare the Learner’s for Lifelong learning.
PEO 4
To inculcate professional and ethical attitude, good leadership qualities and commitment to
PEO 5 social responsibilities in the Learner’s thought process.

PROGRAM OUTCOMES (POs)

Program Program Outcome


Outcome Description
Code
Basic Engineering knowledge: An ability to apply the fundamental knowledge in
PO1 mathematics, science and engineering to solve problems in Computer engineering.

Problem Analysis: Identify, formulate, research literature and analyze computer

PO2 engineering problems reaching substantiated conclusions using first principles of


mathematics, natural sciences and computer engineering and sciences
Design/ Development of Solutions: Design solutions for complex computer
engineering problems and design system components or processes that meet specified
needs with appropriate consideration for public health and safety, cultural, societal
PO3
and environmental considerations.
Conduct Investigations of Complex Engineering Problems:- Use research-based
knowledge and research methods including design of experiments, analysis and
PO4 interpretation of data and synthesis of information to provide valid conclusions.

Modern Tool Usage: Create, select and apply appropriate techniques, resources and

PO5 modern computer engineering and IT tools including prediction and modeling to
complex engineering activities with an understanding of the limitations.
The Engineer and Society: Apply reasoning informed by contextual knowledge to

PO6 assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to computer engineering practice.
Environment and Sustainability: Understand the impact of professional computer
PO7 engineering solutions in societal and environmental contexts and demonstrate
knowledge of and need for sustainable development.
PO8 Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of computer engineering practice.

PO9 Individual and Team Work: Function effectively as an individual, and as a member
or leader in diverse teams and in multidisciplinary settings.

Communication: Communicate effectively on complex engineering activities with


the engineering community and with society at large, such as being able to
comprehend and write effective reports and design documentation, make effective
PO10
presentations and give and receive clear instructions.
Project Management and Finance: Demonstrate knowledge and understanding of
computer engineering and management principles and apply these to one's own work,
as a member and leader in a team, to manage projects and in multidisciplinary
PO11
environments.
Life-long Learning: Recognize the need for and have the preparation and ability to

PO12 engage in independent and lifelong learning in the broadest context of technological
change.
PROGRAM SPECIFIC OUTCOMES (PSOs)

Acquire skills to design, analyze and develop algorithms and implement them using
PSO1
high-level programming languages

Contribute their engineering skills in computing and information engineering domains


PSO2
like network design and administration, database design and knowledge engineering.

Develop strong skills in systematic planning, developing, testing implementing and


PSO3
providing IT solutions for different domains which helps in the betterment of life.

Lab Objectives
1. Solve Big Data problems using Map Reduce Technique and apply to various algorithms.
2. Solve Big Data problems using Map Reduce Technique and apply to various algorithms.
3. Solve Big Data problems using Map Reduce Technique and apply to various algorithms.
4. Apply streaming analytics to real time applications

Lab Outcomes
1. To interpret business models and scientific computing paradigms, and apply software tools for
big data analytics.
2. To implement algorithms that uses Map Reduce to apply on structured and unstructured data
3. To perform hands-on NoSql databases such as Cassandra, Hadoop Hbase, MongoDB, etc.
4. To implement various data streams algorithms.
5. To develop and analyze the social network graphs with data visualization techniques.
MGM’s COLLEGE OF ENGINEERING AND TECHNOLOGY

NAVI MUMBAI-410209

DEPARTMENT OF COMPUTER ENGINEERING

CSL7012 – Big Data Analytics Lab

List of Experiments

Sr. No. Title


Hadoop HDFS Practical:
HDFS Basics, Hadoop Ecosystem Tools Overview.
Installing Hadoop.
1 Copying File to Hadoop.
Copy from Hadoop File system and deleting file.
Moving and displaying files in HDFS.
Programming exercises on Hadoop.
2 Write a program to implement word count program using MapReduce.
Use of Sqoop tool to transfer data between Hadoop and relational database servers.
3 a. Sqoop - Installation.
b. To execute basic commands of Hadoop eco system component Sqoop.
4 To install and configure MongoDB/ Cassandra/ HBase/ Hypertable to execute
NoSQL commands
Experiment on Hadoop Map-Reduce / PySpark:
5 -Implementing simple algorithms in Map-Reduce: Matrix
multiplication, Aggregates, Joins, Sorting, Searching, etc.
6 Create HIVE Database and Descriptive analytics-basic statistics

7 Implementing DGIM algorithm using any Programming Language/ Implement


Bloom Filter using any programming language./ Implement Flajolet Martin
algorithm using any programming language

8 Social Network Analysis using R (for example: Community Detection Algorithm)

9 Data Visualization using Hive/PIG/R/Tableau/.

10 Exploratory Data Analysis using Spark/ Pyspark.


Mini Project: One real life large data application to be implemented (Use standard
Datasets available on the web).
11 - Streaming data analysis – use flume for data capture, HIVE/PYSpark for analysis
of twitter data, chat data, weblog analysis etc.
- Recommendation System (for example: Health Care System, Stock Market
Prediction, Movie Recommendation, etc.) Spatio Temporal Data Analytics

Hardware and Software requirements

H/W Requirements Ubuntu operating system minimum 64 bit OS with 4 GB RAM

Hadoop, VM
S/W Requirements Java:Hadoop is written in Java. The recommended Java versionis
JDK 1.6 release and the recommended minimum revision is 31 (v
1.6.31).

How to install
Step 1 – Install VM Player

Step 2 – Setup ubuntu Virtual Machine

Step 3 – Install Hadoop

OR
Step 1 – Install Oracle Virtual Machine and Cloudera virtual box

Step 2 – Setup Ubuntu(64 bit) Virtual Machine

Step 3 – Start working on Hadoop.


Running Hadoop programs in
Cloudera
1. Open Eclipse
2. Open File --> New --> Project --> Java Project -->
Give name to the project eg. Word1 and Finish.
3. Right click on Word1 --> New --> package -->
Give name to package eg. Pkg1 and Finish
4. Right click on Pkg1 --> New --> Class -->
Give name to class eg. Word12 and Finish
5. Copy and paste your program here
6. Write click on Word12.java
Select Built Path --> Configure Built Path --> Libraries
Add External JAR
Select Hadoop --> hadoop-core.jar
7. Write click on Word12.java
Select Built Path --> Configure Built Path --> Libraries
Add External JAR
Select Hadoop --> Lib --> common-cli-1.2.jar
8. Write click on Word12.java
Click --> Export --> jar file
Give name Wordc

Create input text file in home directory eg. Inputf

Open terminal

Copy input file from local to HDFS


$ hadoop fs –copyFromLocal <local path> to <hadoop fs path>
$ hadoop fs –copyFromLocal /training/Inputf /user/training/inputfc/

Run jar file


$ hadoop jar <filename>.jar <package_name>.<class_name> Input Directory
Output directory
$ hadoop jar Wordc.jar wc.Word /user/training/inputfc/
/user/training/outputfc/
Experiment No.01
Aim: To install Hadoop and practice HDFS Commands.

Theory:
What Is Hadoop?

Hadoop is an open-source software platform designed to store and process quantities of data that
are too large for just one particular device or server. Hadoop‘s strength lies in its ability to scale
across thousands of commodity servers that don‘t share memory or disk space.

Hadoop Distributed File System (HDFS)


HDFS is the ―secret sauce‖ that enables Hadoop to store huge files. It‘s a scalable file system
that distributes and stores data across all machines in a Hadoop cluster (a group of servers). Each
HDFS cluster contains the following:

NameNode: Runs on a ―master node‖ that tracks and directs the storage of the cluster.
DataNode: Runs on ―slave nodes,‖ which make up the majority of the machines within
a cluster. The NameNode instructs data files to be split into blocks, each of which are
replicated three times and stored on machines across the cluster. These replicas ensure
the entire system won‘t go down if one server fails or is taken offline—known as ―fault
tolerance.‖
Client machine: neither a NameNode or a DataNode, Client machines have Hadoop
installed on them. They‘re responsible for loading data into the cluster, submitting
MapReduce jobs and viewing the results of the job once complete.

MapReduce
MapReduce is the system used to efficiently process the large amount of data Hadoop stores in
HDFS. Originally created by Google, its strength lies in the ability to divide a single large data
processing job into smaller tasks. All MapReduce jobs are written in Java, but other languages can
be used via the Hadoop Streaming API, which is a utility that comes with Hadoop.

JobTracker: The JobTracker oversees how MapReduce jobs are split up into tasks and
divided among nodes within the cluster.
TaskTracker: The TaskTracker accepts tasks from the JobTracker, performs the work and
alerts the JobTracker once it‘s done. TaskTrackers and DataNodes are located on thesame
nodes to improve performance.
Hadoop 2.6.5 - Installing on Ubuntu 16.04 (Single-
Node Cluster)
Installing Java

# Update the source list


staff_111_05@staff-111-05:~$ sudo apt-get update
# The OpenJDK project is the default version of Java
# that is provided from a supported Ubuntu repository.
staff_111_05@staff-111-05:~$ sudo apt-get install default-jdk
staff_111_05@staff-111-05:~$ java -version
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2ubuntu0.16.04.2-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
Adding a dedicated Hadoop user

staff_111_05@staff-111-05:~$ sudo addgroup hadoop


Adding group `hadoop' (GID 1001) ...
Done.
staff_111_05@staff-111-05:~$ sudo adduser --ingroup hadoop hduser
Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Changing the user information for hduser
Enter the new value, or press ENTER for the defaultFull
Name []:
Room Number []:
Work Phone []:
Home Phone []:
BE Computer Big Data Analytics Lab CSL7012
BE Computer
Other []:
Is the information correct? [Y/n] Y
We can check if we create the hadoop group and hduser user:

$ groups hduser
hduser : hadoop

Installing SSH

staff_111_05@staff-111-05:~$ sudo apt-get install ssh

staff_111_05@staff-111-05:~$ which ssh


/usr/bin/ssh
staff_111_05@staff-111-05:~$ which sshd
/usr/sbin/sshd
This will install ssh on our machine. If we get something similar to the following, we can think it
is setup properly:

BE Computer Big Data Analytics Lab CSL7012


BE Computer
Create and Setup SSH Certificates

staff_111_05@staff-111-05:~$ su hduser
Password:
a–
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:/M18Dv+ku5js8npZvYi45Fr4F84SzoqXBUO5xAfo+/8 hduser@laptop
The key's randomart image is:
+---[RSA 2048] --- +
| o.o |
| .=. |
| .oo |
| .= |
| .S .|
| . .+ + o|

BE Computer Big Data Analytics Lab CSL7012


BE Computer
| ..=o* * .oo|
| .+== *.B++ || ..o+==EB*B+.|
+----[SHA256] ---- +
hduser@staff-111-05:/home/staff_111_05$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys

The second command adds the newly created key to the list of authorized keys so that Hadoop
can use ssh without prompting for a password.

We can check if ssh works:

hduser@staff-111-05:/home/staff_111_05$ ssh localhost


The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is
SHA256:e8SM2INFNu8NhXKzdX9bOyKIKbMoUSK4dXKonloN8JY.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 16.04.1 LTS (GNU/Linux 4.4.0-47-generic x86_64)...

Install Hadoop

hduser@staff-111-05:~$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-


2.6.5/hadoop-2.6.5.tar.gz
hduser@staff-111-05:~$ tar xvzf hadoop-2.6.5.tar.gz

We want to move the Hadoop installation to the /usr/local/hadoop directory. So, we should create
the directory first:

hduser@staff-111-05:~$sudo mkdir -p /usr/local/hadoop


[sudo] password for hduser:

BE Computer Big Data Analytics Lab CSL7012


BE Computer
hduser is not in the sudoers file. This incident will be reported.

We can check again if hduser is not in sudo group:

hduser@staff-111-05:~$sudo -v
Sorry, user hduser may not run sudo on staff-111-05.

This can be resolved by logging in as a root user, and then add hduser to sudo group:

hduser@staff-111-05:~$ cd hadoop-2.6.5
hduser@staff-111-05:~/hadoop-2.6.5$ su staff_111_05
Password:
hduser@staff-111-05:/home/hduser$ sudo adduser hduser sudo
[sudo] password for k:
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.

Now, the hduser has root priviledge, we can move the Hadoop installation to
the /usr/local/hadoop directory without any problem:

staff_111_05@staff-111-05:/home/hduser/hadoop-2.6.5$ sudo adduser hduser sudo


Password:
staff_111_05@staff-111-05:~/hadoop-2.6.5$ sudo mv /home/hduser/hadoop-
2.6.5/*/usr/local/hadoop
mv: target '/usr/local/hadoop/' is not a directory
hduser@staff-111-05:~$ sudo mkdir /usr/local/hadoop

BE Computer Big Data Analytics Lab CSL7012


staff_111_05@staff-111-05:~/hadoop-2.6.5$ sudo mv /home/hduser/hadoop-
2.6.5/*/usr/local/hadoop
hduser@staff-111-05:~$ sudo chown -R hduser:hadoop /usr/local/hadoop

Setup Configuration Files


1. ~/.bashrc: Before editing the .bashrc file in hduser's home directory, we need to find the path
where Java has been installed to set the JAVA_HOME environment variable using the following
command:

hduser@staff-111-05:~$ update-alternatives --config java


There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-8-
openjdk-amd64/jre/bin/java
Nothing to configure.

Now we can append the following to the end of ~/.bashrc:

hduser@staff-111-05:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL
export YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END

BE Computer Big Data Analytics Lab CSL7012


hduser@staff-111-05:~$ source ~/.bashrc

2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh

We need to set JAVA_HOME by modifying hadoop-env.sh file.

hduser@staff-111-05:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME
variable will be available to Hadoop whenever it is started up.

3. /usr/local/hadoop/etc/hadoop/core-site.xml:

The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration properties that


Hadoop uses when starting up.
This file can be used to override the default settings that Hadoop starts with.

hduser@staff-111-05:~$ sudo mkdir -p /app/hadoop/tmp


hduser@staff-111-05:~$ sudo chown hduser:hadoop /app/hadoop/tmp

Open the file and enter the following in between the <configuration></configuration> tag:

hduser@staff-111-05:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>

BE Computer Big Data Analytics Lab CSL7012


</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>

4. /usr/local/hadoop/etc/hadoop/mapred-site.xml
By default, the /usr/local/hadoop/etc/hadoop/ folder contains
/usr/local/hadoop/etc/hadoop/mapred-site.xml.template
file which has to be renamed/copied with the name mapred-site.xml:

hduser@staff-111-05:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml

The /usr/local/hadoop/etc/hadoop/mapred-site.xml file is used to specify which framework is


being used for MapReduce.
We need to enter the following content in between the <configuration></configuration> tag:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs

BE Computer Big Data Analytics Lab CSL7012


at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
</configuration>

5. /usr/local/hadoop/etc/hadoop/hdfs-site.xml
The /usr/local/hadoop/etc/hadoop/hdfs-site.xml file needs to be configured for each host in the
cluster that is being used.
It specifies the directories which will be used as the namenode and the datanode on that host.
Before editing this file, we need to create two directories which will contain the namenode and
the datanode for this Hadoop installation.
This can be done using the following commands:

hduser@staff-111-05:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode


hduser@staff-111-05:~$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
hduser@staff-111-05:~$ sudo chown -R hduser:hadoop /usr/local/hadoop_store

Open the file and enter the following content in between the <configuration></configuration>
tag:

hduser@staff-111-05:~$ vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

BE Computer Big Data Analytics Lab CSL7012


</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>

Format the New Hadoop Filesystem


Now, the Hadoop file system needs to be formatted so that we can start to use it.
The format command should be issued with write permission since it creates current directory
under /usr/local/hadoop_store/hdfs/namenode folder:

hduser@staff-111-05:~$ hadoop namenode -format


DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
16/11/10 13:07:15 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = laptop/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.6.5
...
...
...
16/11/10 13:07:23 INFO util.ExitUtil: Exiting with status 0

BE Computer Big Data Analytics Lab CSL7012


16/11/10 13:07:23 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at laptop/127.0.1.1
************************************************************/

Note that hadoop namenode -format command should be executed once before we start using
Hadoop.
If this command is executed again after Hadoop has been used, it'll destroy all the data on the
Hadoop file system.

Starting Hadoop
Now it's time to start the newly installed single node cluster.
We can use start-all.sh or (start-dfs.sh and start-yarn.sh)

hduser@staff-111-05:~$ cd /usr/local/hadoop/sbin
hduser@staff-111-05:/usr/local/hadoop/sbin$ ls
distribute-exclude.sh start-all.cmd stop-balancer.sh
hadoop-daemon.sh start-all.sh stop-dfs.cmd
hadoop-daemons.sh start-balancer.sh stop-dfs.sh
hdfs-config.cmd start-dfs.cmd stop-secure-dns.sh
hdfs-config.sh start-dfs.sh stop-yarn.cmd
httpfs.sh start-secure-dns.sh stop-yarn.sh
kms.sh start-yarn.cmd yarn-daemon.sh
mr-jobhistory-daemon.sh start-yarn.sh yarn-daemons.sh
slaves.sh stop-all.sh
hduser@staff-111-05:/usr/local/hadoop/sbin$ start-all.sh

We can check if it's really up and running:

hduser@ staff-111-05:/usr/local/hadoop/sbin$ jps


14689 DataNode

BE Computer Big Data Analytics Lab CSL7012


14660 ResourceManager
14505 SecondaryNameNode
14205 NameNode
14765 NodeManager
15166 Jps

Stopping Hadoop
In order to stop all the daemons running on our machine, we can run stop-all.sh or (stop-
dfs.sh and stop-yarn.sh) :

hduser@ staff-111-05:/usr/local/hadoop/sbin$ stop-all.sh


stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop

Hadoop Web Interfaces


Type http://localhost:50070/ into our browser, then we'll see the web UI of the NameNode
daemon

Conclusion:
Thus we studied the Hadoop ecosystem and successfully installed Hadoop.

BE Computer Big Data Analytics Lab CSL7012


Experiment No.02
Aim: Write a program to implement word count program using MapReduce.
Theory:
WordCount example reads text files and counts how often words occur. The input is text files
and the output is text files, each line of which contains a word and the count of how often it
occurred, separated by a tab. Each mapper takes a line as input and breaks it into words. It then
emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits
a single key/value with the word and sum.

All of the files in the input directory (called in-dir in the command line above) are read and the
counts of words in the input are written to the output directory (called out-dir above). It is assumed
that both inputs and outputs are stored in HDFS. If your input is not already in HDFS, but is rather
in a local file system somewhere, you need to copy the data into HDFS using a command like this:

Steps to run WordCount Application in Eclipse


Step-1
Download eclipse if you don’t have.
Step-2
Open Eclipse and Make Java Project.
In eclipse Click on File menu-> new -> Java Project. Write there your project
name. Hereis WordCount. Make sure Java version must be 1.6 and above.
Click on Finish.

BE Computer Big Data Analytics Lab CSL7012


Step-3
Make Java class File and write a code.
Click on WordCount project. There will be ‘src’ folder. Right click on ‘src’ folder -
> New ->Class. Write Class file name. Here is Wordcount. Click on Finish.

Copy and Paste below code in Wordcount.java. Save it.


You will get lots of error but don’t panic. It is because of requirement of external
libraryof hadoop which is required to run mapreduce program.
1 package org.myorg;
2
3 import java.io.IOException;
4 import java.util.*;
5
6 import org.apache.hadoop.fs.Path;
7 import org.apache.hadoop.conf.*;
8 import org.apache.hadoop.io.*;
9 import org.apache.hadoop.mapreduce.*;
10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
14
15 public class WordCount {
16
17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
18 private final static IntWritable one = new IntWritable(1);
19 private Text word = new Text();
20
21 public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
22 String line = value.toString();
23 StringTokenizer tokenizer = new StringTokenizer(line);
24 while (tokenizer.hasMoreTokens()) {
25 word.set(tokenizer.nextToken());

BE Computer Big Data Analytics Lab CSL7012


26 context.write(word, one);27
}
28 }
29 }
30
31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
32
33 public void reduce(Text key, Iterable<IntWritable> values, Context context)
34 throws IOException, InterruptedException {
35 int sum = 0;
36 for (IntWritable val : values) {
37 sum += val.get();38
}
39 context.write(key, new IntWritable(sum));
40 }
41 }
42
43 public static void main(String[] args) throws Exception {
44 Configuration conf = new Configuration();
45
46 Job job = new Job(conf, "wordcount");
47
48 job.setOutputKeyClass(Text.class);
49 job.setOutputValueClass(IntWritable.class);
50
51 job.setMapperClass(Map.class);
52 job.setReducerClass(Reduce.class);
53
54 job.setInputFormatClass(TextInputFormat.class);
55 job.setOutputFormatClass(TextOutputFormat.class);
56
57 FileInputFormat.addInputPath(job, new Path(args[0]));
58 FileOutputFormat.setOutputPath(job, new Path(args[1]));
59
60 job.waitForCompletion(true);
61 }
62
63 }

Step-4
Add external libraries from hadoop.
Right click on WordCount Project -> Build Path -> Configure Build Path -> Click on
Libraries -
> click on ‘Add External Jars..’ button.

BE Computer Big Data Analytics Lab CSL7012


Select below files from hadoop folder.
In my case:- /usr/local/hadoop/share/hadoop
4.1 Add jar files from /usr/local/hadoop/share/hadoop/common folder.

4.2 Add jar files from /usr/local/hadoop/share/hadoop/common/lib folder.

4.3 Add jar files from /usr/local/hadoop/share/hadoop/mapreduce folder (Don’t need to


addhadoop-mapreduce-examples-2.7.3.jar)

4.4 Add jar files from /usr/local/hadoop/share/hadoop/yarn folder.

BE Computer Big Data Analytics Lab CSL7012


Click on ok. Now you can see, all error in code is gone.
Step 5
Running Mapreduce Code.
5.1 Make input file for WordCount Project.
Right Click on WordCount project-> new -> File. Write File name and click on ok.
You cancopy and paste below contains into your input file.
car bus bike
bike bus
aeroplane
truck car bus
5.2 Right click on WordCount Project -> click on Run As. -> click on Run
Configuration…Make new configuration by clicking on ‘new launch configuration’.
Set Configuration Name, Project Name and Class file name.

Output of WordCount Application and output logs in console.


Refresh WordCount Project. Right Click on project -> click on Refresh. You can find
‘out’ directory in project explorer. Open ‘out’ directory. There will be ‘part-r-00000’
file. Double clickto open it.

Conclusion: Word count example is implemented in Hadoop.

BE Computer Big Data Analytics Lab CSL7012


Experiment No.03
Aim: Use of Sqoop tool to transfer data between Hadoop and relational database servers.
a. Sqoop - Installation.
b. To execute basic commands of Hadoop eco system component Sqoop.

Theory:
What is Sqoop?
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and
external datastores such as relational databases, enterprise data warehouses.
Sqoop is used to import data from external datastores into Hadoop Distributed File System or
related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used to extract data
from Hadoop or its eco-systems and export it to external datastores such as relational databases,
enterprise data warehouses. Sqoop works with relational databases such as Teradata, Netezza,
Oracle, MySQL, Postgres etc.
Why is Sqoop used?
For Hadoop developers, the interesting work starts after data is loaded into HDFS. Developers play
around the data in order to find the magical insights concealed in that Big Data. For this, the data
residing in the relational database management systems need to be transferred to HDFS,play
around the data and might need to transfer back to relational database management systems. Sqoop
automates most of the process, depends on the database to describe the schema of the data to be
imported. Sqoop uses MapReduce framework to import and export the data, which provides
parallel mechanism as well as fault tolerance. Sqoop makes developers life easy by providing
command line interface. Developers just need to provide basic information like source, destination
and database authentication details in the sqoop command. Sqoop takes care of remaining part. The
scoop architecture is as follows:

BE Computer Big Data Analytics Lab CSL7012


Implementation:
Step 1- Sqoop Installation
However, to extract the Sqoop tarball and move it to ―/usr/lib/sqoop‖ directory we use the
following command.
$tar -xvf sqoop-1.4.4.bin hadoop-2.0.4-alpha.tar.gz
$ su
password:

# mv sqoop-1.4.4.bin hadoop-2.0.4-alpha /usr/lib/sqoop


#exit

Step 2- Configuring bashrc


Also, by appending the following lines to ~/.bashrc file we have to set up the Sqoop
environment
#Sqoop
export SQOOP_HOME=/usr/lib/sqoop export PATH=$PATH:$SQOOP_HOME/bin
Now, to execute ~/.bashrc file we use the following command.
$ source ~/.bashrc

Step 3: Configuring Sqoop


While, we need to edit the sqoop-env.sh file, that is placed in the $SQOOP_HOME/conf
directory, in order to configure Sqoop with Hadoop. Now, using the following command redirect
to Sqoop config directory and copy the template file
$ cd $SQOOP_HOME/conf
$ mv sqoop-env-template.sh sqoop-env.sh
Also, open sqoop-env.sh and edit the following lines
export HADOOP_COMMON_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=/usr/local/hadoop
Step 4: Download and Configure MySQL-connector-java
From the following link, we can download MySQL-connector-java-5.1.30.tar.gz file.
In adddition, to extract MySQL-connector-java tarball and move MySQL-connector-java-5.1.30-
bin.jar to /usr/lib/sqoop/lib directory we use the following command.
$ tar -zxf mysql-connector-java-5.1.30.tar.gz
$ su
password:
# cd mysql-connector-java-5.1.30
# mv mysql-connector-java-5.1.30-bin.jar /usr/lib/sqoop/lib
Step 5: Verifying Sqoop
At last, to verify the Sqoop version we use the following command.
$ cd $SQOOP_HOME/bin
$ sqoop-version
Expected output
14/12/17 14:52:32 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5
Sqoop 1.4.5 git commit id 5b34accaca7de251fc91161733f906af2eddbe83
Compiled by abe on Fri Aug 1 11:19:26 PDT 2014
Hence, in this way Sqoop installation is complete.

BE Computer Big Data Analytics Lab CSL7012


Basic Commands and Syntax for Sqoop

1. Sqoop-Import
Sqoop import command imports a table from an RDBMS to HDFS. Each record from a table is
considered as a separate record in HDFS. Records can be stored as text files, or in binary
representation as Avro or SequenceFiles.
Generic Syntax:
$ sqoop import (generic args) (import args)
The Hadoop specific generic arguments must precede any import arguments, and the import
arguments can be of any order.

2. Importing a Table into HDFS


Syntax:

$ sqoop import --connect --table --username --password --target-dir


--connect Takes JDBC url and connects to database
--table Source table name to be imported
--username Username to connect to database
--password Password of the connecting user
--target-dir Imports data to the specified directory

3. Importing Selected Data from Table


Syntax:

$ sqoop import --connect --table --username --password --columns --where


--columns Selects subset of columns
--where Retrieves the data which satisfies the condition

4. Importing Data from Query


Syntax:

$ sqoop import --connect --table --username --password --query

--query Executes the SQL query provided and imports the results

5. Incremental Exports
Syntax:

$ sqoop import --connect --table --username --password --incremental --check-column --last-


value

Sqoop import supports two types of incremental imports:


1. Append
2. Lastmodified.

BE Computer Big Data Analytics Lab CSL7012


6. Sqoop-Eval
Sqoop-eval command allows users to quickly run simple SQL queries against a database and the
results are printed on to the console. Generic Syntax:

$ sqoop eval (generic args) (eval args)

7. Sqoop-List-Database
Used to list all the database available on RDBMS server. Generic Syntax:

$ sqoop list-databases (generic args) (list databases args)


$ sqoop-list-databases (generic args) (list databases args)

Syntax:

$ sqoop list-databases --connect

8. Sqoop-List-Tables
Used to list all the tables in a specified database. Generic Syntax:

$ sqoop list-tables (generic args) (list tables args)


$ sqoop-list-tables (generic args) (list tables args)

Syntax:

$ sqoop list-tables –connect

Conclusion:
Thus we installed Scoop successfully and studied the basic commands.

BE Computer Big Data Analytics Lab CSL7012


Experiment No.04
Aim: To install and configure MongoDB/ Cassandra/ HBase/ Hypertable to execute NoSQL
commands

Theory:
NoSQL stands for ―Not only SQL,‖ to emphasize the fact that NoSQL databases are an
alternative to SQL and can, in fact, apply SQL-like query concepts.
NoSQL covers any database that is not a traditional relational database management system
(RDBMS). The motivation behind NoSQL is mainly simplified design, horizontal scaling, and finer
control over the availability of data. NoSQL databases are more specialized for types of data,
which makes them more efficient and better performing than RDBMS servers in most instances.

Document Store Databases


Document store databases apply a document-oriented approach to storing data. The idea is that
all the data for a single entity can be stored as a document, and documents can be stored together
in collections.
A document can contain all the necessary information to describe an entity. This includes the
capability to have subdocuments, which in RDBMS are typically stored as an encoded string or
in a separate table. Documents in the collection are accessed via a unique key.

Key-Value Databases
The simplest type of NoSQL database is the key-value stores. These databases store data in a
completely schema-less way, meaning that no defined structure governs what is being stored. A
key can point to any type of data, from an object, to a string value, to a programming language
function.
The advantage of key-value stores is that they are easy to implement and add data to. That makes
them great to implement as simple storage for storing and retrieving data based on a key. The
downside is that you cannot find elements based on the stored values.

Column Store Databases


Column store databases store data in columns within a key space. The key space is based on a
unique name, value, and timestamp. This is similar to the key-value databases; however, column
store databases are geared toward data that uses a timestamp to differentiate valid content from
stale content. This provides the advantage of applying aging to the data stored in the database.

Graph Store Databases


Graph store databases are designed for data that can be easily represented as a graph. This means
that elements are interconnected with an undetermined number of relations between them, as in
examples such as family and social relations, airline route topology, or a standard road map.
# MongoDB_Lab - Installation Steps
Steps to install mongodb

BE Computer Big Data Analytics Lab CSL7012


1. Update ubuntu system
sudo apt-get update

2. Upgrade softwares
sudo apt-get upgrade

3. Install mongodb
sudo apt-get install mongodb

4. Download following json dataset


https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/dataset.json

Rename it as primer-dataset.json

save it in the home directory

5. Start mongodb services


sudo service mongodb start

6. Import restaurants dataset (primer-dataset.json) into mongodb database

sudo mongoimport --db test --collection restaurants --drop --file primer-dataset.json

7. Get into mongo prompt for db query

mongo

# End of File

# MongoDB Operations
# Save primer-dataset.json in the home directory
# Start mongodb services

sudo service mongodb start

# Import restaurants dataset (primer-dataset.json) into mongodb database

sudo mongoimport --db test --collection restaurants --drop --file primer-dataset.json

# Get into mongo prompt for db query

mongo

# Insert a Document

BE Computer Big Data Analytics Lab CSL7012


# Insert a document into a collection named restaurants. The operation will create the collection
if the collection does not currently exist.

db.restaurants.insert(
{
"address" : {
"street" : "2 Avenue",
"zipcode" : "10075",
"building" : "1480",
"coord" : [ -73.9557413, 40.7720266 ]
},
"borough" : "Manhattan",
"cuisine" : "Italian",
"grades" : [
{
"grade" : "A",
"score" : 11
},
{
"grade" : "B",
"score" : 17
}
],
"name" : "Vella",
"restaurant_id" : "41704620" }
)
# Find or Query Data
# You can use the find() method to issue a query to retrieve data from a collection in MongoDB.
All queries in MongoDB have the scope of a single collection.
# Queries can return all documents in a collection or only the documents that match a specified
filter or criteria. You can specify the filter or criteria in a document and pass as a parameter to
the find() method.
# The result set contains all documents in the restaurants collection.
db.restaurants.find()

# The result set includes only the matching documents.


db.restaurants.find( { "borough": "Manhattan" } )

# The following operation specifies an equality condition on the zipcode field in the address
embedded document.
# The result set includes only the matching documents.
db.restaurants.find( { "address.zipcode": "10075" } )

# Update Data
# You can use the update() method to update documents of a collection. The method accepts as
its parameters:

BE Computer Big Data Analytics Lab CSL7012


1. a filter document to match the documents to update,
2. an update document to specify the modification to perform, and
3. an options parameter (optional).

1. db.restaurants.update({'cuisine':'American'},{$set:{'cuisine':'Indian'}})
2. db.restaurants.update(
{ "name" : "Juni" },
{
$set: { "cuisine": "American (New)" }
})
3. db.restaurants.update(
{ "restaurant_id" : "41156888" },
{ $set: { "address.street": "East 31st Street" } }

# Remove Data
# You can use the remove() method to remove documents from a collection. The method takes a
conditions document that determines the documents to remove.
db.restaurants.remove( { "borough": "Manhattan" } )

# End of File

Conclusion:
Mongodb is successfully installed and basic commands are executed.

BE Computer Big Data Analytics Lab CSL7012


Experiment No.05
Aim: Implementing simple algorithms in Map- Reduce - Matrix multiplication, Aggregates,
joins, sorting, searching etc.

Theory:
If M is a matrix with element mij in row i and column j, and N is a matrix with element njk
in row j and column k, then the product:

P =MN

is the matrix P with element pik in row i and column k, where:


pik = ∑ mij njk
j
We can think of a matrix M and N as a relation with three attributes: the row number, the
column number, and the value in that row and column, i.e.,:

M (I, J, V ) and N (J, K, W )

with the following tuples, respectively:

(i, j, mij ) and (j, k, njk ).

• In case of sparsity of M and N , this relational representation is very efficient in terms of


space.
• The product M N is almost a natural join followed by grouping and aggregation.
Analogously, send each matrix element njk to the key value pair:
(j, (N, k, njk)) .

• Map: Send each matrix element mij to the key value pair:
(j, (M, i, mij )) .
Analogously, send each matrix element njk to the key value pair:
(j, (N, k, njk)) .
• Reduce: For each key j, examine its list of associated values. For each value that comes from
M , say (M, i, mij ), and each value that comes from N , say (N, k, njk), produce the tuple
(i, k, v = mij njk ),

The output of the Reduce function is a key j paired with the list of all the tuples of this form that
we get from j:

BE Computer Big Data Analytics Lab CSL7012


(j, [(i1, k1, v1), (i2, k2, v2), . . . , (ip, kp, vp)]) .

• Map: Send each matrix element mij to the key value pair:
(j, (M, i, mij )) .
Analogously, send each matrix element njk to the key value pair:
(j, (N, k, njk)) .
• Reduce: For each key j, examine its list of associated values. For each value that comes from
M , say (M, i, mij ), and each value that comes from N , say (N, k, njk), produce the tuple
(i, k, v = mij njk ),
The output of the Reduce function is a key j paired with the list of all the tuples of this form that
we get from j:

(j, [(i1, k1, v1), (i2, k2, v2), . . . , (ip, kp, vp)]) .
• Map: For each element mij of M , produce a key-value pair

((i, k), (M, j, mij )) ,

for k = 1, 2, . . ., up to the number of columns of N .


Also, for each element njk of N , produce a key-value pair

((i, k), (N, j, njk)) ,


for i = 1, 2, . . ., up to the number of rows of M .

Reduce: Each key (i; k) will have an associated list with all the values (M; j;mij) and (N; j; njk);
for all possible values of j. We connect the two values on the list that have the same value of j, for
each j:
We sort by j the values that begin with M and sort by j the values that begin with N, in separate
lists, The jth values on each list must have their third components, mij and njk extracted and
multiplied,
Then, these products are summed and the result is paired with (i; k) in the output of the Reduce
function.

CONCLUSION:
An algorithm in MAP Reduce is implemented.

BE Computer Big Data Analytics Lab CSL7012


Experiment No.06
Aim: Create HIVE Database and Descriptive analytics-basic statistics, visualization using
Hive/PIG/R.

Theory:
What is Hive?
Apache Hive is a Data warehouse system which is built to work on Hadoop. It is used to querying
and managing large datasets residing in distributed storage. Before becoming an open source
project of Apache Hadoop, Hive was originated in Facebook. It provides a mechanism to project
structure onto the data in Hadoop and to query that data using a SQL-like language called HiveQL
(HQL).Hive is used because the tables in Hive are similar to tables in a relational database.Many
users can simultaneously query the data using Hive-QL.

What is HQL?
Hive defines a simple SQL-like query language to querying and managing large datasets called
Hive-QL ( HQL ). It‘s easy to use if you‘re familiar with SQL Language. Hive allows programmers
who are familiar with the language to write the custom MapReduce framework to perform more
sophisticated analysis.

Uses of Hive:
1. The Apache Hive distributed storage.

2. Hive provides tools to enable easy data extract/transform/load (ETL)

3. It provides the structure on a variety of data formats.

4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to
querying and managing large datasets residing in) or in other data storage systems such asApache
HBase.

Hive Commands :
Data Definition Language (DDL )

DDL statements are used to build and modify the tables and other objects in the database.

Go to Hive shell by giving the command sudo hive and enter the command ‗create database
<data base name>‘ to create the new database in the Hive.

BE Computer Big Data Analytics Lab CSL7012


To list out the databases in Hive warehouse, enter the command ‗show databases‘.

The command to use the database is USE <data base name>

Describe provides information about the schema of the table.

Data Manipulation Language (DML )

DML statements are used to retrieve, store, modify, delete, insert and update data in the database.

Example :

LOAD, INSERT Statements.

Syntax :

LOAD data <LOCAL> inpath <file path> into table [tablename]

The Load operation is used to move the data into corresponding Hive table. If the keyword
local is specified, then in the load command will give the local file system path. If the keyword
local is not specified we have to use the HDFS path of the file.

The insert command is used to load the data Hive table. Inserts can be done to a table or a partition.

• INSERT OVERWRITE is used to overwrite the existing data in the table or partition.

• INSERT INTO is used to append the data into existing data in a table. (Note: INSERT INTO
syntax is work from the version 0.8)

CONCLUSION:
Hive is demonstrated successfully.

BE Computer Big Data Analytics Lab CSL7012


Experiment No.7
Aim: Implementing DGIM algorithm using any Programming Language/ Implement Bloom
Filter using any programming language.

Theory:
The Datar-Gionis-Indyk-Motwani (DGIM) Algorithm
To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on. Since we only need to distinguish positions
within the window of length N, we shall represent timestamps modulo N, so they can be represented
by log2 N bits. If we also store the total number of bits ever seen in the stream (i.e., the most recent
timestamp) modulo N, then we can determine from a timestamp modulo N where in the current
window the bit with that timestamp is.

We divide the window into buckets consisting of:


1. The timestamp of its right (most recent) end.
2. The number of 1‘s in the bucket. This number must be a power of 2, and we refer to the
number of 1‘s as the size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right end.
To represent the number of 1‘s we only need log2 log2 N bits. The reason is that we know this
number i is a power of 2, say 2j , so we can represent i by coding j in binary. Since j is at most
log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a bucket.
There are six rules that must be followed when representing a stream by buckets.
1. The right end of a bucket is always a position with a 1.
2. Every position with a 1 is in some bucket.
3. No position is in more than one bucket.
4. There are one or two buckets of any given size, up to some maximum size.
5. All sizes must be a power of 2.
6. Buckets cannot decrease in size as we move to the left (back in time).

..1011011000101110110010110

..1011011000101110110 010110

At least one Two of size 4 One of size 2 Two of size 1

Figure 7: A bit-stream divided into buckets following the DGIM rules

Example : Figure shows a bit stream divided into buckets in a way that satisfies the DGIM rules.
BE Computer Big Data Analytics Lab CSL7012
At the right (most recent) end we see two buckets of size 1. To its left we see one bucket of size
2. Note that this bucket covers four positions, but only two of them are 1. Proceeding left, we see
two buckets of size 4, and we suggest that a bucket of size 8 exists further left.

Notice that it is OK for some O's to lie between buckets. Also, observe from above Figure that the
buckets do not overlap; there is one or two of each size up to the largest size, and sizes only increase
moving left.

CONCLUSION:
One data streaming algorithm is explained and implemented.

BE Computer Big Data Analytics Lab CSL7012


Experiment No.08

Aim: Streaming data analysis – use flume for data capture, HIVE/PYSpark for analysis of twitter
data, chat data, weblog analysis etc.

Theory: Apache Flume is a tool for data ingestion in HDFS. It collects, aggregates and transports
large amount of streaming data such as log files, events from various sources like network traffic,
social media, email messages etc. to HDFS. Flume is a highly reliable & distributed.

The main idea behind the Flume‘s design is to capture streaming data from various web servers
to HDFS. It has simple and flexible architecture based on streaming data flows. It is fault- tolerant
and provides reliability mechanism for Fault tolerance & failure recovery.

There is a Flume agent which ingests the streaming data from various data sources to HDFS. From
the diagram, you can easily understand that the web server indicates the data source. Twitter is
among one of the famous sources for streaming data.
The flume agent has 3 components: source, sink and channel.

1. Source: It accepts the data from the incoming streamline and stores the data in the channel.
2. Channel: In general, the reading speed is faster than the writing speed. Thus, we need some
buffer to match the read & write speed difference. Basically, the buffer acts as a
intermediary storage that stores the data being transferred temporarily and therefore
prevents data loss. Similarly, channel acts as the local storage or a temporary storage
between the source of data and persistent data in the HDFS.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and commits or
writes the data in the HDFS permanently.

Step 1:
Login to the twitter account

BE Computer Big Data Analytics Lab CSL7012


Step 2:
Go to the following link and click the ‗create new app‘ button.
https://apps.twitter.com/app Then, create an application
After creating this application, you will find Key & Access token. Copy the key and the access
token. We will pass these tokens in our Flume configuration file to connect to this application.

Now create a flume.conf file in the flume‘s root directory.


As we discussed, in the Flume‘s Architecture, we will configure our Source, Sink and Channel.
Our Source is Twitter, from where we are streaming the data and our Sink is HDFS, where we are
writing the data.
In source configuration we are passing the Twitter source type as
org.apache.flume.source.twitter.TwitterSource. Then, we are passing all the four tokens which
we received from Twitter. At last in source configuration we are passing the keywords on which
we are going to fetch the tweets.

In the Sink configuration we are going to configure HDFS properties. We will set HDFS path, write
format, file type, batch size etc. At last we are going to set memory channel

Now we are all set for execution.


$FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f $FLUME_HOME/flume.conf
After executing this command for a while, and then you can exit the terminal using CTRL+C. Then
you can go ahead in your Hadoop directory and check the mentioned path, whether the file is
created or not.
Download the file and open it.

CONCLUSION:
Thus the Streaming data from twitter is successfully analyzed using flume

BE Computer Big Data Analytics Lab CSL7012


Experiment No.9

Aim: To study and implement basic functions and commands in R Programming and Implement
predictive Analytics techniques (regression / time series, etc.) using R.

Theory
:
Predictive models are extremely useful, when learning r language, for forecasting future outcomes
and estimating metrics that are impractical to measure. For example, data scientists could use
predictive models to forecast crop yields based on rainfall and temperature, or to determine
whether patients with certain traits are more likely to react badly to a new medication. Before we
talk about linear regression specifically, let‘s remind ourselves what a typical data science
workflow might look like. A lot of the time, we‘ll start with a question we want to answer, and do
something like the following:
1. Collect some data relevant to the problem (more is almost always better).
2. Clean, augment, and preprocess the data into a convenient form, if needed.
3. Conduct an exploratory analysis of the data to get a better sense of it.
4. Using what you find as a guide, construct a model of some aspect of the data.
5. Use the model to answer the question you started with, and validate your results.

Linear regression is one of the simplest and most common supervised machine learning algorithms
that data scientists use for predictive modeling. In this post, we‘ll use linear regression to build a
model that predicts cherry tree volume from metrics that are much easier forfolks who study trees
to measure.
We‘ll use R in this blog post to explore this data set and learn the basics of linear regression. If
you‘re new to learning the R language, we recommend our R Fundamentals and R Programming:
Intermediate courses from our R Data Analyst path. It will also help to have some very basic
statistics knowledge, but if you know what a mean and standard deviation are, you‘ll be able to
follow along. If you want to practice building the models and visualizations yourself, we‘ll be
using the following R packages:
data sets This package contains a wide variety of practice data sets. We‘ll be using one of them,
―trees‖, to learn about building linear regression models.
ggplot2 We‘ll use this popular data visualization package to build plots of our models.
GGally This package extends the functionality of ggplot2. We‘ll be using it to create a plot
matrix as part of our initial exploratory data visualization.
scatterplot3d We‘ll use this package for visualizing more complex linear regression models
with multiple predictors.

Procedure: # Install rstudio in Ubuntu


# sudo dpkg -i /path/to/deb/file

sudo dpkg -i rstudio-1.0.136-i386.deb

BE Computer Big Data Analytics Lab CSL7012


sudo apt-get -f install

sudo apt-get install r-cran-testthat

# add sources.list
# deb https://cran.csiro.au/ trusty-backports main restricted universe
# deb hhttps://cran.csiro.au//bin/linux/ubuntu trusty/
sudo apt-get update
sudo apt-get install r-base

# Users who need to compile R packages from source [e.g. package maintainers, or anyone
installing packages with install.packages()] should also install the r-base-dev package:

sudo apt-get install r-base-dev

Conclusion:
Thus the predictive analysis is demonstrated in R.

BE Computer Big Data Analytics Lab CSL7012


Experiment No.10
Aim: Mini Project: One real life large data application to be implemented (Use standard
Datasets available on the web) a) Twitter data analysis b) Fraud Detection c) Text Mining etc.

Theory:
Text Mining is the discovery by computer of new, previously unknown information, by
automatically extracting information from different written resources. A key element is the linking
together of the extracted information together to form new facts or new hypotheses to be explored
further by more conventional means of experimentation.

Text mining is different from what we're familiar with in web search. In search, the user is typically
looking for something that is already known and has been written by someone else. The problem is
pushing aside all the material that currently isn't relevant to your needs in order to find the relevant
information.

The difference between regular data mining and text mining is that in text mining the patterns are
extracted from natural language text rather than from structured databases of facts. Databases are
designed for programs to process automatically; text is written for people to read. We do not have
programs that can "read" text and will not have such for the forseeable future. Many researchers
think it will require a full simulation of how the mind works before we can write programs that read
the way people do.

Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal
of discovering useful information, suggesting conclusions, and supporting decision-making. Data
analysis has multiple facets and approaches, encompassing diverse techniques under a variety of
names, in different business, science, and social science domains.

Data mining is a particular data analysis technique that focuses on modeling and knowledge
discovery for predictive rather than purely descriptive purposes. Business intelligence coversdata
analysis that relies heavily on aggregation, focusing on business information. In statistical
applications, some people divide data analysis into descriptive statistics, exploratory data analysis
(EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the
data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on
application of statistical models for predictive forecasting or classification, while text analytics
applies statistical, linguistic, and structural techniques to extract and classify information from
textual sources, a species of unstructured data. All are varieties of data analysis.

Social data analysis comprises two main constituent parts: 1) data generated from social networking
sites (or through social applications), and 2) sophisticated analysis of that data, in many cases
requiring real-time (or near real-time) data analytics, measurements which understand and
appropriately weigh factors such as influence, reach, and relevancy, an understanding of the context
of the data being analyzed, and the inclusion of time horizon considerations. In short, social data
analytics involves the analysis of social media in order to understand and surface insights which is
embedded within the data.

BE Computer Big Data Analytics Lab CSL7012


When talking about social data analytics, there are a number of factors it's important to keep in
mind (which we noted earlier):
Sophisticated Data Analysis: what distinguishes social data analytics from sentiment
analysis is the depth of the analysis. Social data analysis takes into consideration a
number of factors (context, content, sentiment) to provide additional insight.
Time consideration: windows of opportunity are significantly limited in the field of social
networking. What's relevant one day (or even one hour) may not be the next. Being able to
quickly execute and analyze the data is an imperative.
Influence Analysis: understanding the potential impact of specific individuals can be key in
understanding how messages might be resonating. It's not just about quantity, it's also very
much about quality.
Network Analysis: social data is also interesting in that it migrates, grows (or dies) based
on how the data is propagated throughout the network. It's how viral activity starts—and
spreads.

CONCLUSION:

A mini project is introduced and implemented.

BE Computer Big Data Analytics Lab CSL7012


BE Computer Big Data Analytics Lab CSL7012

You might also like