Faculty of Engineering & Technology
Big Data Analytics (203105348)
B. Tech CSE 4rd Year 7th Semester
PRACTICAL 2
Aim: Write a program of Word Count in Map Reduce over HDFS.
Description:
MapReduce is a framework for processing large datasets using a large number of computers
(nodes), collectively referred to as a cluster. Processing can occur on data stored in a file
system (HDFS).A method for distributing computation across multiple nodes. Each node
processes the data that is stored at that node.
Consists of two main phases
Mapper Phase
Reduce phase
Input data set is split into independent blocks – processed in parallel. Each input split is
converted in Key Value pairs. Mapper logic processes each key value pair and produces and
intermediate key value pairs based on the implementation logic. Resultant key value pairs can
be of different type from that of input key value pairs. The output of Mapper is passed to the
reducer. Output of Mapper function is the input for Reducer. Reducer sorts the intermediate
key value pairs. Applies reducer logic upon the key value pairs and produces the output in
desired format. Output is stored in HDFS
Enrollment No.: 2203051057106
Roll Number: 68
Div: 7A9(CSE)
Faculty of Engineering & Technology
Big Data Analytics (203105348)
B. Tech CSE 4rd Year 7th Semester
Code:
import urllib.request
import random
from operator import itemgetter
current_word={}
current_count=0
story ='http://sixty-north.com/c/t.txt'
request=urllib.request.urlopen(story)
response=urllib.request.urlopen(story)
each_word=[]
words=None
count=1
same_words={}
word=[]
for line in response:
line_words=line.split()
for word in line_words:
each_word.append(word)
for words in each_word:
if words.lower() not in same_words.keys():
same_words[words.lower()]=1
else:
same_words[words.lower()]+=1
for each in same_words.keys():
print("word =",each,"count =",same_words[each])
Enrollment No.: 2203051057106
Roll Number: 68
Div: 7A9(CSE)
Faculty of Engineering & Technology
Big Data Analytics (203105348)
B. Tech CSE 4rd Year 7th Semester
Output:
Enrollment No.: 2203051057106
Roll Number: 68
Div: 7A9(CSE)