Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
78 views26 pages

HADOOP AND BIG DATA - Final

The document provides details on implementing matrix multiplication using MapReduce. It outlines the input parameters, including the paths of the input matrices A and B and output matrix C. It describes the strategy, number of reducers, and dimensions of the matrices. The algorithm steps include: 1) setup, 2) calculating block sizes, 3) mapping keys and values from matrices A and B, and 4) emitting intermediate keys and values.

Uploaded by

Deeq Huseen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views26 pages

HADOOP AND BIG DATA - Final

The document provides details on implementing matrix multiplication using MapReduce. It outlines the input parameters, including the paths of the input matrices A and B and output matrix C. It describes the strategy, number of reducers, and dimensions of the matrices. The algorithm steps include: 1) setup, 2) calculating block sizes, 3) mapping keys and values from matrices A and B, and 4) emitting intermediate keys and values.

Uploaded by

Deeq Huseen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

HADOOP AND BIG DATA

EXERCISE-1
AIM:
To implement the following Data Structures in Java
a)Linked Lists b)Stack c)Queues d)Set e)Map

ALGORITHM:

Steps:
Linked: Use an ArrayList for storing and accessing data and LinkedList to manipulate data.
Stack: addElement(item) method of the Vector class. It passes a parameter item to be pushed
into the stack.
Set: Demonstrating Set using HashSet Declaring object of type String
Map: A Map can't be traversed, so you need to convert it into Set using keySet() or
entrySet() method.
Queues: In order to add an element in a queue, we can use the add() method. The insertion
order is not retained in the Priority Queue.

PROGRAM:
import java.util.*;
class Main {
public static void main(String[] args)
{
//Linked
LinkedList<String>lAnimals = new LinkedList<>();
lAnimals.add("Dog");
lAnimals.add("Cat");
lAnimals.add("Cow");
System.out.print("-----LINKED-----" + "\n");
System.out.println("LinkedList: " + lAnimals + "\r\n");
//Stack
Stack<String>sAnimals = new Stack<>();
sAnimals.push("Dog");
sAnimals.push("Horse");
sAnimals.push("Cat");
System.out.print("-----STACK-----" + "\r");
System.out.println("Stack: " + sAnimals + "\r\n");
//Queues
Queue<Integer> numbers = new LinkedList<>();
numbers.offer(1);
numbers.offer(2);
numbers.offer(3);
System.out.println('\n'+"-----QUEUE-----");
System.out.print("Queue: " + numbers);
int accessedNumber = numbers.peek();
System.out.println('\n'+"Accessed Element: " + accessedNumber);
int removedNumber = numbers.poll();
System.out.println('\n'+"Removed Element: " + removedNumber);
System.out.println('\n'+"Updated Queue: " + numbers);
//Set
Set<Integer> set1 = new HashSet<>();
set1.add(2);
set1.add(3);
System.out.print('\n'+"-----SET-----" + "\n");
System.out.println("Set1: " + set1);
Set<Integer> set2 = new HashSet<>();
set2.add(1);
set2.add(2);
System.out.println("Set2: " + set2);
set2.addAll(set1);
System.out.println("Union is: " + set2 +'\n');
//Map
Map<String, Integer>mNumbers = new HashMap<>();
mNumbers.put("One", 1);
mNumbers.put("Two", 2);
System.out.print('\n'+"-----MAP-----" + "\n");
System.out.println("Map: " + mNumbers);
System.out.println("Keys: " + mNumbers.keySet());
System.out.println("Values: " + mNumbers.values());
System.out.println("Entries: " + mNumbers.entrySet());
int value = mNumbers.remove("Two");
System.out.println("Removed Value: " + value);
}
}

OUTPUT:
EXERCISE-2

AIM:
To perform setting up and Installing Hadoop in its three operating modes:

 Standalone
 Pseudo Distributed
 Fully Distributed

ALGORITHM:

STEPS INVOLVED IN INSTALLING HADOOP IN STANDALONE MODE:

1. Command for installing ssh is “sudo apt-get install ssh”.


2. Command for key generation is ssh-keygen –t rsa –P “ ”.
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfz jdk-8u60-linux-i586.tar.gz.
5. Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-
gtk.tar.gz
6. Extract the hadoop by using the command tar xvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in
the eclipse.ini file
8. Export java path and hadoop path in ./bashrc
9. Check the installation successful or not by checking the java version and hadoop
version
10. Check the hadoop instance in standalone mode working correctly or not by using an
implicit hadoop jar file named as word count.
11. If the word count is displayed correctly in part-r-00000 file it means that standalone
mode is installed successfully.

STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE:

1. In order install pseudo distributed mode we need to configure the hadoop configuration
files resides in the directory /home/lendi/hadoop-2.7.1/etc/hadoop.
2. First configure the hadoop-env.sh file by changing the java path.
3. Configure the core-site.xml which contains a property tag, it contains name and value.
Name as fs.defaultFS and value as hdfs://localhost:9000
4. Configure hdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template to
mapred-site.xml.
7. Now format the name node by using command hdfsnamenode –format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like
NameNode,DataNode,SecondaryNameNode ,ResourceManager,NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using command
hdfsdfs –mkdr /csedir and enter some data into lendi.txt using command nano lendi.txt
and copy from local directory to hadoop using command hdfsdfs – copyFromLocal
lendi.txt /csedir/and run sample jar file wordcount to check whether pseudo distributed
mode is working or not.
10. Display the contents of file by using command hdfsdfs –cat /newdir/part-r-00000.

FULLY DISTRIBUTED MODE INSTALLATION:

1. Stop all single node clusters


$stop-all.sh
2. Decide one as NameNode (Master) and remaining as DataNodes(Slaves).
3. Copy public key to all three hosts to get a password less SSH access
$ssh-copy-id –I $HOME/.ssh/id_rsa.pub lendi@l5sys24
4. Configure all Configuration files, to name Master and Slave Nodes.
$cd $HADOOP_HOME/etc/hadoop
$nano core-site.xml
$ nano hdfs-site.xml
5. Add hostnames to file slaves and save it.
$ nano slaves
6. Configure $ nano yarn-site.xml
7. Do in Master Node
$ hdfsnamenode –format
$ start-dfs.sh
$start-yarn.sh
8. Format NameNode
9. Daemons Starting in Master and Slave Nodes
10. END

INPUT :

ubuntu @localhost>jps

OUTPUT:

 Data node
 Name node
 Secondary name node,
 NodeManager
 Resource Manager
EXERCISE-3

AIM:

Implement the following file management tasks in Hadoop:

• Adding files and directories


• Retrieving files
• Deleting Files

ALGORITHM:

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS

Step-1 : Adding Files and Directories to HDFS


hadoop fs -mkdir /vimal/Hadoop
hadoop fs -put example.txt
Step-2 : Retrieving Files from HDFS
hadoop fs -cat example.txt
Step-3 : Deleting Files from HDFS
hadoop fs -rm example.txt
INPUT:

Input as any data format of type structured, Unstructured or Semi Structured

EXPECTED OUTPUT:
EXERCISE-4

AIM:

To run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm

ALGORITHM :

MAPREDUCE PROGRAM
Our implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver
Step-1: Write a Mapper
Step-2: Write a Reducer
Step-3: Write Driver

STEPS:

1.Mapper:
Pseudo-code void Map (key, value)
{
for each word x in value:
output.collect(x, 1);
}

2.Reducer:
aPseudo-code void Reduce(keyword,<list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(keyword, sum);
}

3.Driver:
The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:
Job Name : name of this Job
Executable (Jar) Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.
Output Key: type of output key. For here, Text.
Output Value: type of output value. For here, IntWritable.
File Input Path
File Output Path

INPUT:-
Set of Data Related Shakespeare Comedies, Glossary, Poems

OUTPUT:-
EXERCISE-5:

AIM:
To Write a Map Reduce Program that mines Weather Data

ALGORITHM :

Step-1: Write a Mapper

Pseudo-code void Map (key, value)


{
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value)
{
for each min_temp x in value:
output.collect(x, 1);
}

Step-2: Write a Reducer

Pseudo-code void Reduce (max_temp, <list of value>)


{
for each x in <list of value>:
sum+=x;
final_output.collect(max_temp, sum);
}
void Reduce (min_temp, <list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(min_temp, sum);
}

3. Write Driver

Job Name : name of this Job


Executable (Jar) Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.
Output Key: type of output key. For here, Text.
Output Value: type of output value. For here, IntWritable.
File Input Path
File Output Path
INPUT:
Set of Weather Data over the years

OUTPUT:

EXERCISE-6:

AIM:

To write a Map Reduce Program that implements Matrix Multiplication.

ALGORITHM:

We have the following input parameters:

The path of the input file or directory for matrix A.


The path of the input file or directory for matrix
B. The path of the directory for the output files
for matrix C. strategy = 1, 2, 3 or 4.
R = the number of reducers.
I = the number of rows in A and C.
K = the number of columns in A and rows in B.
J = the number of columns in B and C.
IB = the number of rows per A block and C block.
KB = the number of columns per A block and rows per B block.
JB = the number of columns per B block and C block.

Steps:

1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(i,k) and value=a(i,k)
7. for 0 <= jb< NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib< NIB emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))

Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb,
then by jb, then by m. Note that m = 0 for A data and m = 1 for B data.

The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:

11. r = ((ib*JB + jb)*KB + kb) mod R

12. These definitions for the sorting order and partitioner guarantee that each
reducer R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb],
with the data for the A block immediately preceding the data for the B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1

Reduce (key, valueList)


if key is (ib, kb, jb, 0)
// Save the A block.
sib = ib
skb = kb
Zero matrix A
for each value = (i, k, v) in valueList A(i,k) = v
if key is (ib, kb, jb, 1)
if ib != sib or kb != skb return // A[ib,kb] must be zero!
// Build the B block.
Zero matrix B
for each value = (k, j, v) in valueList B(k,j) = v
// Multiply the blocks and emit the result.
ibase = ib*IB
jbase = jb*JB
for 0 <= i< row dimension of A
for 0 <= j < column dimension of B
sum = 0
for 0 <= k < column dimension of A = row dimension of B
sum += A(i,k)*B(k,j)
if sum != 0 emit (ibase+i, jbase+j), sum

INPUT:

Set of Data sets over different Clusters are taken as Rows and Columns

OUTPUT:
EXERCISE-7:

AIM:

To install and Run Pig then write Pig Latin scripts to sort, group, join, project and
filter the data.

ALGORITHM:

STEPS FOR INSTALLING APACHE PIG:

1) Extract the pig-0.15.0.tar.gz and move to home directory


2) Set the environment of PIG in bashrc file.
3) Pig can run in two modes
Local Mode and Hadoop Mode
Pig –x local and pig
4) Grunt Shell
Grunt >
5) LOADING Data into Grunt Shell
DATA = LOAD <CLASSPATH> USING PigStorage(DELIMITER) as
(ATTRIBUTE :
DataType1, ATTRIBUTE : DataType2…..)
6) Describe Data
Describe DATA;
7) DUMP Data
Dump DATA;
8) FILTER Data
FDATA = FILTER DATA by ATTRIBUTE = VALUE;
9) GROUP Data
GDATA = GROUP DATA by ATTRIBUTE;
10) Iterating Data
FOR_DATA = FOREACH DATA GENERATE GROUP AS GROUP_FUN,
ATTRIBUTE = <VALUE>
11) Sorting Data
SORT_DATA = ORDER DATA BY ATTRIBUTE WITH CONDITION;
12) LIMIT Data
LIMIT_DATA = LIMIT DATA COUNT;
13) JOIN Data
JOIN DATA1 BY (ATTRIBUTE1,ATTRIBUTE2….) , DATA2 BY
(ATTRIBUTE3,ATTRIBUTE….N)

INPUT:

Input as Website Click Count Data

OUTPUT:
EXERCISE-8:

AIM:

Install and Run Hive then use Hive to Create, alter and drop databases, tables, views,
functions and Indexes

ALGORITHM:

APACHE HIVE INSTALLATION STEPS

1) Install MySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName and Password
3) Creating User and granting all Privileges
Mysql –uroot –proot
Create user <USER_NAME> identified by <PASSWORD>
4) Extract and Configure Apache Hive
tar xvfz apache-hive-1.0.1.bin.tar.gz
5) Move Apache Hive from Local directory to Home directory
6) Set CLASSPATH in bashrc
Export HIVE_HOME = /home/apache-hive
Export PATH = $PATH:$HIVE_HOME/bin
7) Configuring hive-default.xml by adding My SQL Server Credentials
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotEx
ist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8)Copying mysql-java-connector.jar to hive/lib directory.

SYNTAX for HIVE Database Operations

DATABASE Creation

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

Drop Database Statement


DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE];

Creating and Dropping Table in HIVE

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]


[db_name.] table_name [(col_namedata_type [COMMENT col_comment], ...)]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS
file_format]
Loading Data into table log_data
Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;

Alter Table in HIVE

Syntax:
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_namenew_namenew_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Creating and Dropping View

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT


column_comment], ...) ] [COMMENT table_comment] AS SELECT ...

Dropping View
Syntax:

DROP VIEW view_name

Functions in HIVE

String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc


Date and Time Functions:- year(), month(), day(), to_date() etc
Aggregate Functions :- sum(), min(), max(), count(), avg() etc

INDEXES
CREATE INDEX index_name ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]

Creating Index
CREATE INDEX index_ip ON TABLE log_data(ip_address) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;

Altering and Inserting Index

ALTER INDEX index_ip_address ON log_data REBUILD;

Storing Index Data in Metastore

SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipad
dress_result;

SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFor
mat;

Dropping Index

DROP INDEX INDEX_NAME on TABLE_NAME;

INPUT

Input as Web Server Log Data


OUTPUT
EXERCISE-9:

AIM:

To solve some real life big data problems.

ALGORITHM:

Step 1: No organization can function without data these days

Step 2: All this data gets piled up in a huge data set that is referred to as Big Data.

Step 3: This data needs to be analyzed to enhance decision making

Step 4: There are some challenges of Big Data

Step 5: Few Challenges such as Data Security, Confusion while Big Data tool selection etc

INPUT:

Data Security

Security can be one of the most daunting Big Data challenges especially for organizations
that have sensitive company data or have access to a lot of personal user information.
Vulnerable data is an attractive target for cyberattacks and malicious hackers.

When it comes to data security, most organizations believe that they have the right security
protocols in place that are sufficient for their data repositories. Only a few organizations
invest in additional measures exclusive to Big Data such as identity and access authority, data
encryption, data segregation, etc. Often, organizations are more immersed in activities
involving data storage and analysis. Data security is usually put on the back burner, which is
not a wise move at all as unprotected data can fast become a serious problem. Stolen records
can cost an organization millions.
Confusion while Big Data tool selection
Companies often get confused while selecting the best tool for Big Data analysis and
storage. Is HBase or Cassandra the best technology for data storage? Is Hadoop
MapReduce good enough or will Spark be a better option for data analytics and storage?

These questions bother companies and sometimes they are unable to find the answers. They
end up making poor decisions and selecting inappropriate technology. As a result, money,
time, efforts and work hours are wasted.

EXPECTED OUTPUT:

Solution for Data Security:

The following are the ways how an enterprise can tackle the security challenges of Big Data:

 Recruiting more cybersecurity professionals


 Data encryption and segregation
 Identity and access authorization control
 Endpoint security
 Real-time monitoring
 Using Big Data security tools such as IBM Guardium

Solution for Confusion while Big Data tool selection

The best way to go about it is to seek professional help. You can either hire
experienced professionals who know much more about these tools. Another way is to go
for Big Data consulting. Here, consultants will give a recommendation of the best tools,
based on your company’s scenario. Based on their advice, you can work out a strategy and
then select the best tool for you.

You might also like