HADOOP AND BIG DATA
EXERCISE-1
AIM:
To implement the following Data Structures in Java
a)Linked Lists b)Stack c)Queues d)Set e)Map
ALGORITHM:
Steps:
Linked: Use an ArrayList for storing and accessing data and LinkedList to manipulate data.
Stack: addElement(item) method of the Vector class. It passes a parameter item to be pushed
into the stack.
Set: Demonstrating Set using HashSet Declaring object of type String
Map: A Map can't be traversed, so you need to convert it into Set using keySet() or
entrySet() method.
Queues: In order to add an element in a queue, we can use the add() method. The insertion
order is not retained in the Priority Queue.
PROGRAM:
import java.util.*;
class Main {
public static void main(String[] args)
{
//Linked
LinkedList<String>lAnimals = new LinkedList<>();
lAnimals.add("Dog");
lAnimals.add("Cat");
lAnimals.add("Cow");
System.out.print("-----LINKED-----" + "\n");
System.out.println("LinkedList: " + lAnimals + "\r\n");
//Stack
Stack<String>sAnimals = new Stack<>();
sAnimals.push("Dog");
sAnimals.push("Horse");
sAnimals.push("Cat");
System.out.print("-----STACK-----" + "\r");
System.out.println("Stack: " + sAnimals + "\r\n");
//Queues
Queue<Integer> numbers = new LinkedList<>();
numbers.offer(1);
numbers.offer(2);
numbers.offer(3);
System.out.println('\n'+"-----QUEUE-----");
System.out.print("Queue: " + numbers);
int accessedNumber = numbers.peek();
System.out.println('\n'+"Accessed Element: " + accessedNumber);
int removedNumber = numbers.poll();
System.out.println('\n'+"Removed Element: " + removedNumber);
System.out.println('\n'+"Updated Queue: " + numbers);
//Set
Set<Integer> set1 = new HashSet<>();
set1.add(2);
set1.add(3);
System.out.print('\n'+"-----SET-----" + "\n");
System.out.println("Set1: " + set1);
Set<Integer> set2 = new HashSet<>();
set2.add(1);
set2.add(2);
System.out.println("Set2: " + set2);
set2.addAll(set1);
System.out.println("Union is: " + set2 +'\n');
//Map
Map<String, Integer>mNumbers = new HashMap<>();
mNumbers.put("One", 1);
mNumbers.put("Two", 2);
System.out.print('\n'+"-----MAP-----" + "\n");
System.out.println("Map: " + mNumbers);
System.out.println("Keys: " + mNumbers.keySet());
System.out.println("Values: " + mNumbers.values());
System.out.println("Entries: " + mNumbers.entrySet());
int value = mNumbers.remove("Two");
System.out.println("Removed Value: " + value);
}
}
OUTPUT:
EXERCISE-2
AIM:
To perform setting up and Installing Hadoop in its three operating modes:
Standalone
Pseudo Distributed
Fully Distributed
ALGORITHM:
STEPS INVOLVED IN INSTALLING HADOOP IN STANDALONE MODE:
1. Command for installing ssh is “sudo apt-get install ssh”.
2. Command for key generation is ssh-keygen –t rsa –P “ ”.
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfz jdk-8u60-linux-i586.tar.gz.
5. Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-
gtk.tar.gz
6. Extract the hadoop by using the command tar xvfz hadoop-2.7.1.tar.gz
7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in
the eclipse.ini file
8. Export java path and hadoop path in ./bashrc
9. Check the installation successful or not by checking the java version and hadoop
version
10. Check the hadoop instance in standalone mode working correctly or not by using an
implicit hadoop jar file named as word count.
11. If the word count is displayed correctly in part-r-00000 file it means that standalone
mode is installed successfully.
STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE:
1. In order install pseudo distributed mode we need to configure the hadoop configuration
files resides in the directory /home/lendi/hadoop-2.7.1/etc/hadoop.
2. First configure the hadoop-env.sh file by changing the java path.
3. Configure the core-site.xml which contains a property tag, it contains name and value.
Name as fs.defaultFS and value as hdfs://localhost:9000
4. Configure hdfs-site.xml.
5. Configure yarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template to
mapred-site.xml.
7. Now format the name node by using command hdfsnamenode –format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like
NameNode,DataNode,SecondaryNameNode ,ResourceManager,NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using command
hdfsdfs –mkdr /csedir and enter some data into lendi.txt using command nano lendi.txt
and copy from local directory to hadoop using command hdfsdfs – copyFromLocal
lendi.txt /csedir/and run sample jar file wordcount to check whether pseudo distributed
mode is working or not.
10. Display the contents of file by using command hdfsdfs –cat /newdir/part-r-00000.
FULLY DISTRIBUTED MODE INSTALLATION:
1. Stop all single node clusters
$stop-all.sh
2. Decide one as NameNode (Master) and remaining as DataNodes(Slaves).
3. Copy public key to all three hosts to get a password less SSH access
$ssh-copy-id –I $HOME/.ssh/id_rsa.pub lendi@l5sys24
4. Configure all Configuration files, to name Master and Slave Nodes.
$cd $HADOOP_HOME/etc/hadoop
$nano core-site.xml
$ nano hdfs-site.xml
5. Add hostnames to file slaves and save it.
$ nano slaves
6. Configure $ nano yarn-site.xml
7. Do in Master Node
$ hdfsnamenode –format
$ start-dfs.sh
$start-yarn.sh
8. Format NameNode
9. Daemons Starting in Master and Slave Nodes
10. END
INPUT :
ubuntu @localhost>jps
OUTPUT:
Data node
Name node
Secondary name node,
NodeManager
Resource Manager
EXERCISE-3
AIM:
Implement the following file management tasks in Hadoop:
• Adding files and directories
• Retrieving files
• Deleting Files
ALGORITHM:
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step-1 : Adding Files and Directories to HDFS
hadoop fs -mkdir /vimal/Hadoop
hadoop fs -put example.txt
Step-2 : Retrieving Files from HDFS
hadoop fs -cat example.txt
Step-3 : Deleting Files from HDFS
hadoop fs -rm example.txt
INPUT:
Input as any data format of type structured, Unstructured or Semi Structured
EXPECTED OUTPUT:
EXERCISE-4
AIM:
To run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm
ALGORITHM :
MAPREDUCE PROGRAM
Our implementation consists of three main parts:
1. Mapper
2. Reducer
3. Driver
Step-1: Write a Mapper
Step-2: Write a Reducer
Step-3: Write Driver
STEPS:
1.Mapper:
Pseudo-code void Map (key, value)
{
for each word x in value:
output.collect(x, 1);
}
2.Reducer:
aPseudo-code void Reduce(keyword,<list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(keyword, sum);
}
3.Driver:
The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:
Job Name : name of this Job
Executable (Jar) Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.
Output Key: type of output key. For here, Text.
Output Value: type of output value. For here, IntWritable.
File Input Path
File Output Path
INPUT:-
Set of Data Related Shakespeare Comedies, Glossary, Poems
OUTPUT:-
EXERCISE-5:
AIM:
To Write a Map Reduce Program that mines Weather Data
ALGORITHM :
Step-1: Write a Mapper
Pseudo-code void Map (key, value)
{
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value)
{
for each min_temp x in value:
output.collect(x, 1);
}
Step-2: Write a Reducer
Pseudo-code void Reduce (max_temp, <list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(max_temp, sum);
}
void Reduce (min_temp, <list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(min_temp, sum);
}
3. Write Driver
Job Name : name of this Job
Executable (Jar) Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.
Output Key: type of output key. For here, Text.
Output Value: type of output value. For here, IntWritable.
File Input Path
File Output Path
INPUT:
Set of Weather Data over the years
OUTPUT:
EXERCISE-6:
AIM:
To write a Map Reduce Program that implements Matrix Multiplication.
ALGORITHM:
We have the following input parameters:
The path of the input file or directory for matrix A.
The path of the input file or directory for matrix
B. The path of the directory for the output files
for matrix C. strategy = 1, 2, 3 or 4.
R = the number of reducers.
I = the number of rows in A and C.
K = the number of columns in A and rows in B.
J = the number of columns in B and C.
IB = the number of rows per A block and C block.
KB = the number of columns per A block and rows per B block.
JB = the number of columns per B block and C block.
Steps:
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB = (K-1)/KB+1
4. var NJB = (J-1)/JB+1
5. map (key, value)
6. if from matrix A with key=(i,k) and value=a(i,k)
7. for 0 <= jb< NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib< NIB emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb,
then by jb, then by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
11. r = ((ib*JB + jb)*KB + kb) mod R
12. These definitions for the sorting order and partitioner guarantee that each
reducer R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb],
with the data for the A block immediately preceding the data for the B block.
13. var A = new matrix of dimension IBxKB
14. var B = new matrix of dimension KBxJB
15. var sib = -1
16. var skb = -1
Reduce (key, valueList)
if key is (ib, kb, jb, 0)
// Save the A block.
sib = ib
skb = kb
Zero matrix A
for each value = (i, k, v) in valueList A(i,k) = v
if key is (ib, kb, jb, 1)
if ib != sib or kb != skb return // A[ib,kb] must be zero!
// Build the B block.
Zero matrix B
for each value = (k, j, v) in valueList B(k,j) = v
// Multiply the blocks and emit the result.
ibase = ib*IB
jbase = jb*JB
for 0 <= i< row dimension of A
for 0 <= j < column dimension of B
sum = 0
for 0 <= k < column dimension of A = row dimension of B
sum += A(i,k)*B(k,j)
if sum != 0 emit (ibase+i, jbase+j), sum
INPUT:
Set of Data sets over different Clusters are taken as Rows and Columns
OUTPUT:
EXERCISE-7:
AIM:
To install and Run Pig then write Pig Latin scripts to sort, group, join, project and
filter the data.
ALGORITHM:
STEPS FOR INSTALLING APACHE PIG:
1) Extract the pig-0.15.0.tar.gz and move to home directory
2) Set the environment of PIG in bashrc file.
3) Pig can run in two modes
Local Mode and Hadoop Mode
Pig –x local and pig
4) Grunt Shell
Grunt >
5) LOADING Data into Grunt Shell
DATA = LOAD <CLASSPATH> USING PigStorage(DELIMITER) as
(ATTRIBUTE :
DataType1, ATTRIBUTE : DataType2…..)
6) Describe Data
Describe DATA;
7) DUMP Data
Dump DATA;
8) FILTER Data
FDATA = FILTER DATA by ATTRIBUTE = VALUE;
9) GROUP Data
GDATA = GROUP DATA by ATTRIBUTE;
10) Iterating Data
FOR_DATA = FOREACH DATA GENERATE GROUP AS GROUP_FUN,
ATTRIBUTE = <VALUE>
11) Sorting Data
SORT_DATA = ORDER DATA BY ATTRIBUTE WITH CONDITION;
12) LIMIT Data
LIMIT_DATA = LIMIT DATA COUNT;
13) JOIN Data
JOIN DATA1 BY (ATTRIBUTE1,ATTRIBUTE2….) , DATA2 BY
(ATTRIBUTE3,ATTRIBUTE….N)
INPUT:
Input as Website Click Count Data
OUTPUT:
EXERCISE-8:
AIM:
Install and Run Hive then use Hive to Create, alter and drop databases, tables, views,
functions and Indexes
ALGORITHM:
APACHE HIVE INSTALLATION STEPS
1) Install MySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName and Password
3) Creating User and granting all Privileges
Mysql –uroot –proot
Create user <USER_NAME> identified by <PASSWORD>
4) Extract and Configure Apache Hive
tar xvfz apache-hive-1.0.1.bin.tar.gz
5) Move Apache Hive from Local directory to Home directory
6) Set CLASSPATH in bashrc
Export HIVE_HOME = /home/apache-hive
Export PATH = $PATH:$HIVE_HOME/bin
7) Configuring hive-default.xml by adding My SQL Server Credentials
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotEx
ist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8)Copying mysql-java-connector.jar to hive/lib directory.
SYNTAX for HIVE Database Operations
DATABASE Creation
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Drop Database Statement
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE];
Creating and Dropping Table in HIVE
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS]
[db_name.] table_name [(col_namedata_type [COMMENT col_comment], ...)]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS
file_format]
Loading Data into table log_data
Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;
Alter Table in HIVE
Syntax:
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_namenew_namenew_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])
Creating and Dropping View
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT
column_comment], ...) ] [COMMENT table_comment] AS SELECT ...
Dropping View
Syntax:
DROP VIEW view_name
Functions in HIVE
String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc
Date and Time Functions:- year(), month(), day(), to_date() etc
Aggregate Functions :- sum(), min(), max(), count(), avg() etc
INDEXES
CREATE INDEX index_name ON TABLE base_table_name (col_name, ...)
AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
Creating Index
CREATE INDEX index_ip ON TABLE log_data(ip_address) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;
Altering and Inserting Index
ALTER INDEX index_ip_address ON log_data REBUILD;
Storing Index Data in Metastore
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipad
dress_result;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFor
mat;
Dropping Index
DROP INDEX INDEX_NAME on TABLE_NAME;
INPUT
Input as Web Server Log Data
OUTPUT
EXERCISE-9:
AIM:
To solve some real life big data problems.
ALGORITHM:
Step 1: No organization can function without data these days
Step 2: All this data gets piled up in a huge data set that is referred to as Big Data.
Step 3: This data needs to be analyzed to enhance decision making
Step 4: There are some challenges of Big Data
Step 5: Few Challenges such as Data Security, Confusion while Big Data tool selection etc
INPUT:
Data Security
Security can be one of the most daunting Big Data challenges especially for organizations
that have sensitive company data or have access to a lot of personal user information.
Vulnerable data is an attractive target for cyberattacks and malicious hackers.
When it comes to data security, most organizations believe that they have the right security
protocols in place that are sufficient for their data repositories. Only a few organizations
invest in additional measures exclusive to Big Data such as identity and access authority, data
encryption, data segregation, etc. Often, organizations are more immersed in activities
involving data storage and analysis. Data security is usually put on the back burner, which is
not a wise move at all as unprotected data can fast become a serious problem. Stolen records
can cost an organization millions.
Confusion while Big Data tool selection
Companies often get confused while selecting the best tool for Big Data analysis and
storage. Is HBase or Cassandra the best technology for data storage? Is Hadoop
MapReduce good enough or will Spark be a better option for data analytics and storage?
These questions bother companies and sometimes they are unable to find the answers. They
end up making poor decisions and selecting inappropriate technology. As a result, money,
time, efforts and work hours are wasted.
EXPECTED OUTPUT:
Solution for Data Security:
The following are the ways how an enterprise can tackle the security challenges of Big Data:
Recruiting more cybersecurity professionals
Data encryption and segregation
Identity and access authorization control
Endpoint security
Real-time monitoring
Using Big Data security tools such as IBM Guardium
Solution for Confusion while Big Data tool selection
The best way to go about it is to seek professional help. You can either hire
experienced professionals who know much more about these tools. Another way is to go
for Big Data consulting. Here, consultants will give a recommendation of the best tools,
based on your company’s scenario. Based on their advice, you can work out a strategy and
then select the best tool for you.