CS8091 - Big Data Analytics - Unit 1
CS8091 - Big Data Analytics - Unit 1
Contents
1.1 Introduction
1.2 Evolution of Big data
1.3 Best Practices for Big Data Analytics
1.4 Big Data Characteristics
1.5 Validating the Promotion of the Value of Big Data
1.6 Big Data Use Cases
1.7 Characteristics of Big Data Applications
1.8 Perception and Quantification of Value
1.9 Understanding Big Data Storage
1.10 A Genral Overview Of High-Performance Architecture
1.11 Architecture of Hadoop
1.12 Hadoop Distributed File System (HDFS)
1.13 Architecture of HDFS
1.14 Map Reduce and YARN
1.15 Map Reduce Programming Model
Summary
Two Marks Questions with Answers [Part - A Questions]
Part - B Questions
(1 - 1)
Big Data Analytics 1-2 Introduction to Big Data
1.1 Introduction
Due to the massive digitalization, a large amount of data is being generated by web
applications and Social networking sites that runs on internet by many organizations. In
today’s technological world the high computational power and large storage size is the
basic need and it has been significantly increased over the period of time. The
organizations are producing huge amount of data at rapid rate today and as per global
internet usage report by Wikipedia, the 51% of the world's population uses internet to
perform their day to day activities. Most of them use internet for web surfing, online
shopping, or interacting using Social Medias sites like Facebook, twitter or LinkedIn etc.
These websites generate massive amount of data that involve uploading and
downloading of videos, pictures or text messages whose size is almost unpredictable with
large number of users.
The recent survey on data generation says that Facebook produces 600 TB of data per
day and analyzes 30+ Petabytes of user generated data, Boeing jet airplane generates more
than 10 TBs of data per flight including geo maps and other information, Walmart
handles more than 1 million customer transactions every hour with estimated more than
2.5 petabytes of data per day, there are 0.4 million tweets generated by twitter per minute,
400 hours of new videos uploads on YouTube with access by 4.1 million users. Therefore,
it becomes necessary to manage such a huge amount of data generally called “Big data”
in the perspective of its storage, processing and analytics.
In big data, the data generated in many formats like structured, semi structured or
unstructured. The structured data has fixed pattern or schema which can be stored and
managed using tables in RDBMS, The semi-structured data does not have pre-defined
structure or pattern as it involves scientific or bibliographic data which can be
represented using Graph data structures while unstructured data also do not have a
standard structure, pattern or schema. The examples of unstructured data are videos,
audios, images, pdfs, compressed, log or JSON files. The traditional database
management techniques are incapable of storing, processing, handling and analyzing big
data with various formats which includes images, audio, videos, maps, text, xml etc.
The processing of big data using traditional database management system is very
difficult because of its four characteristics called 4 Vs of Big data shown in Fig. 1.1.1. In
Big data, the Volume refers to size of data being generated per minute or seconds, Variety
means types of data generated including structured, unstructured or semi structured
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-3 Introduction to Big Data
data, Velocity refers to speed of data generated per minute or per seconds and Veracity
refers to uncertainty of data generated being generated.
Because of above four V’s, it becomes more and more difficult to capture, store,
organize, process and analyze the data generated by various web applications or
websites. In traditional analytics system, the cleansed or meaningful data is collected and
stored by RDBMS in a data ware house. This data was analyzed by means of performing
Extract, Transform and Load (ETL) operations. It has support of only cleansed structured
data used for batch processing. The parallel processing of such data by traditional
analytics were costlier because of expensive hardware. Therefore, big data analytics
solutions came into picture which has many advantages over the traditional analytics
solutions. The major advantages of Big data analytics are supporting real time or batch
processing data, analyzes different formats of data, can process uncleansed or uncertain
data, does not require an expensive hardware, supports huge volume of data generated at
any velocity and perform data analytics at low cost.
Therefore, it is best to begin with a definition of big data. The analyst firm Gartner can
be credited with the most-frequently used (and perhaps, somewhat abused) definition:
Big data is high-volume, high-velocity and high-variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and
decision making.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-4 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-5 Introduction to Big Data
Therefore, every successful Big Data project always tend to start with smaller data
sets and targeted goals.
2) Think big for scalability : While defining a Big data system, always follow a
futuristic approach. That means determining how much data will be collected in next
six months from now or calculating how many greater numbers of servers are
needed to handle it. This approach will allow applications to be scaled easily without
having any bottleneck.
3) Avoid bad practices : There are many potential reasons for failing Big Data projects.
So, for making the successful Big data project, the following wrong practices must be
avoided
a) Rather than blindly adopting and deploying something, first understand the
business purposes of the technology you are using for the deployment so as to
implement the right analytics tools for the job at hand. Without a solid
understanding of business requirements, the project will end up without having
an intended outcome.
b) Do not assume that the software will have all of the solutions for your problem as
the business requirements, environment, input/output varies from project to
project.
c) Do not consider the solution of one problem relevant for every problem as each
problem has unique requirements and needs a unique solution which can’t be
used to solve other problems. As a result, new methods and tools might be
required to capture, cleanse, store, process at least some of your Big Data.
d) Do not appoint same person for handling multiple types of analytical operations
as lack of business requisite and analytical expertise may leads to failure of
project. So, they require analytics professionals with statistical, actuarial, and other
sophisticated skills, with expertise in advanced analytics operations.
4) Treat big data problem as a scientific experiment : In Big data project, collecting
and analyzing the data is just a part of procedure while analytics is only producing
the business value which needs to be incorporated into business processes intended
to improve the performance and results. Therefore, every Big data problem requires
a feedback loop for passing the success of actions taken as a result of analytical
findings, followed by improvement of the analytical models based on the business
results.
5) Decide what data can be included and what to leave out : As Big Data analytics
projects involve large data sets that doesn’t means all the data generated by a system
can be analyzed. Therefore, it is required to select the appropriate datasets for
analysis based on their value and outcomes.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-6 Introduction to Big Data
6) Must have a periodic maintenance plan : The success of Big Data analytics initiative
requires regular maintenance of analytics programs on the top of changes in business
requirements.
7) In-memory processing : The In-memory processing of large datasets must be
analyzed for getting the improvements in data-processing, speed of execution and
volume of data. It gives hundreds of times of increased performance compared to
older technologies, Better price-to-performance ratios, reductions in the cost of
central processing units and memory and can handle rapidly expanding volumes of
information.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-7 Introduction to Big Data
• Does not needed high end servers as it can be run on commodity hardware
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-8 Introduction to Big Data
There are some more benefits promoted by inculcating business intelligence and data
warehouse tools in big data like enhanced business planning with product analysis,
optimized supply chain management with fraud detection and analysis of waste, and
abuse of products.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1-9 Introduction to Big Data
3. Large data volumes : Due to the huge volume of data, the analytical application
needs high rates of data creation and delivery.
4. Significant data variety : Due to the diversity of applications, the data generated by
applications may have variety of data like structured or unstructured generated by
different data sources.
5. Data parallelization : As big data application needs to process huge amount of data;
the application’s runtime can be improved through task or thread-level
parallelization applied to independent data segments.
Some of the big data applications and their characteristics are given in Table 1.7.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 10 Introduction to Big Data
b) Weather Big data system is “Lowering the costs” of organizations spending’s like
capital expenses (Capex) and Operational expenses (Opex)
c) Weather Big data system is “Increasing the productivity” by speeding up the process
of execution with efficient results.
d) Weather Big data system is “Reducing the risk” while using big data platform
collecting the data from streams of automated sensors and can provide full visibility.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 11 Introduction to Big Data
To get a better understanding of the architecture for big data platform, we will
examine the Apache Hadoop software stack, since it is a collection of open source projects
that are combined to enable a software-based big data appliance. The generalized general
overview of high-performance architecture of Hadoop is shown in Fig. 1.10.1.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 12 Introduction to Big Data
The different components of Hadoop ecosystem are explained in following Table II.
1) HDFS It is a Hadoop distributed file system which is used to split the data
in to blocks and is distributed among servers for processing. It runs
multiple clusters to store several copies of data blocks which can be
used in case of failure occurs.
2) Map reduce It’s a programming method to process big data comprising of two
programs written in Java such as mapper and reducer. The mapper
extracts data from HDFS and put in to maps while reducer aggregate
the results generated by mappers.
3) Zookeeper It is a centralized service used for maintaining configuration
information with distributed synchronization and coordination.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 13 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 14 Introduction to Big Data
3) HDFS Client : In Hadoop distributed file system the user applications access the file
system using the HDFS client. Like any other file systems, HDFS supports various
operations to read, write and delete files, and operations to create and delete
directories. The user references files and directories by paths in the namespace. The
user application does not need to aware that file system metadata and storage are on
different servers, or that blocks have multiple replicas. When an application reads a
file, the HDFS client first asks the name node for the list of data nodes that host
replicas of the blocks of the file. The client contacts a data node directly and requests
the transfer of the desired block. When a client writes, it first asks the name node to
choose data nodes to host replicas of the first block of the file. The client organizes a
pipeline from node-to-node and sends the data. When the first block is filled, the
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 16 Introduction to Big Data
client requests new data nodes to be chosen to host replicas of the next block. The
Choice of data nodes for each block is likely to be different.
4) HDFS Blocks : In general the users data stored in HDFS in terms of block. The files
in file system are divided in to one or more segments called blocks. The default size
of HDFS block is 64 MB that can be increase as per need.
The HDFS is fault tolerance such that if data node fails then current block write
operation on data node is re-replicated to some other node. The block size, number
of replicas and replication factors are specified in hadoop configuration file. The
synchronization between name node and data node is done by heartbeats functions
which are periodically generated by data node to name node.
Apart from above components the job tracker and task trackers are used when map
reduce application runs over the HDFS. Hadoop Core consists of one master job
tracker and several task trackers. The job tracker runs on name node like a master
while task trackers runs on data nodes like slaves.
The job tracker is responsible for taking the requests from a client and assigning task
trackers to it with tasks to be performed. The job tracker always tries to assign tasks
to the task tracker on the data nodes where the data is locally present. If for some
reason the node fails the job tracker assigns the task to another task tracker where the
replica of the data exists since the data blocks are replicated across the data nodes.
This ensures that the job does not fail even if a node fails within the cluster.
The HDFS can be manipulated either using command line. All the commands used
for manipulating HDFS through command line interface begins with “hadoop fs”
command. Most of the linux commands are supported over HDFS which starts with
“-”sign.
For example: The command for listing the files in hadoop directory will be
#hadoop fs –ls
The general syntax of HDFS command line manipulation is
#hadoop fs -<command>
The most popular HDFS commands are given in Table 1.13.1
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 17 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 18 Introduction to Big Data
been centralized and management of resources at each node is now performed by a local
Node. Each application that directly negotiates with the central Resource Manager for
resources there is the idea of an Application Master related with every application that
legitimately consults with the Resource Manager for resource allocation, effective
scheduling to improve node utilization and to provide monitoring progress with tracking
status.
Last, the YARN approach enables applications to be better mindful of the data
allocation over the topology of the resource inside a cluster. This mindfulness considers
improved colocation of compute and data resources, reducing data movement, and thus,
lessening delay related with data access latencies. The outcome ought to be expanded
scalability and performance.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 19 Introduction to Big Data
• Error handling : Map reduce engine provides different fault tolerance mechanisms in
case of failure. When the tasks are running on different cluster nodes during which if
any failure occurs then map reduce engine find out those incomplete tasks and
reschedule them for execution on different nodes.
• Scheduling : The map reduce involves map and reduce operations that divide large
problems in to smaller chunks and those are run in parallel by different machines so
there is a need to schedule different tasks on computational nodes on priority basis
which is taken care by map reduce engine.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 20 Introduction to Big Data
The shuffle and sort are the components of reducer. The shuffling is a process of
partitioning and moving a mapped output to the reducer where intermediate keys are
assigned to the reducer. Each partition is called subset and so each subset becomes input
to the reducer. In general shuffle phase ensures that the partitioned splits reached at
appropriate reducers where reducer uses http protocol to retrieve their own partition
from mapper.
The sort phase is responsible for sorting the intermediate keys on single node
automatically before they are presented to the reducer. The shuffle and sort phases occur
simultaneously where mapped output is being fetched and merged.
The reducer reduces a set of intermediate values which share unique keys with set of
values. The reducer uses sorted input to generate the final output. The final output is
written using record writer by the reducer in to output file with standard output format.
The final output of each map reduce program is generated with key value pairs
written in output file which is written back to the HDFS store.
For example, of Word count process using map reduce with all phases of execution are
illustrated in Fig. 1.15.3
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 21 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 22 Introduction to Big Data
The split size can be calculated using Compute SplitSize () method by providing
FileInputFormat. The input split is associated with record reader that loads data from
source and convert it into key, value pair defined by Input Format.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 23 Introduction to Big Data
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 24 Introduction to Big Data
The input to mapper is input key, input value and context which give information
about current task while input to reducer is set of key values with output collector that
collects key, value pair output from mapper or reducer. The reporter is used to report
progress, update counters and status information of map reduce job.
Summary
• Due to the massive digitalization, a large amount of data is being generated by
web applications and Social networking sites that runs on internet by many
organizations such data is called Big data.
• In big data, the data is generated in many formats like structured, semi structured
or unstructured.
• The structured data has fixed pattern or schema which can be stored and
managed using tables in RDBMS, The semi-structured data does not have pre-
defined structure or pattern as it involves scientific or bibliographic data which
can be represented using Graph data structures while unstructured data also do
not have a standard structure, pattern or schema.
• The processing of big data using traditional database management system is very
difficult because of its four characteristics called 4 Vs of Big data those are
Volume, Variety, Velocity and Veracity.
• The first Big Data challenge came in to picture at Census Bureau, US, in 1880,
where the information concerning approximately 60 million people had to be
collected, classified, and reported that process took more than 10 years to process.
• The Apache Hadoop is the open source framework for solving Big data problem.
• The common Big data use cases are Business intelligence by querying, reporting
and searching the datasets, using big data analytics tools for report generation,
trend analysis, search optimization, and information retrieval and performing
predictive, prescriptive and descriptive analytics.
• The popular applications of Big data analytics are Fraud detection, Data profiling,
Clustering, Price modelling and Recommendation System.
Two Marks Questions with Answers [Part A - Questions]
Q.1 Define Big data and also enlist the advantages of Big data analytics.
Ans. : Big data is high-volume, high-velocity and high-variety information assets that
demand cost-effective, innovative forms of information processing for enhanced insight
and decision making.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 25 Introduction to Big Data
• Does not needed high end servers as it can be run on commodity hardware
1. Volume - The quantity of data that is generated is very important in this context. It
is the size of the data which determines the value and potential of the data under
consideration and whether it can actually be considered Big Data or not. The name
'Big Data' itself contains a term which is related to size and hence the characteristic.
2. Variety - The next aspect of Big Data is its variety. This means that the category to
which Big Data belongs to is also an essential fact that needs to be known by the
data analysts. This helps the people, who are closely analyzing the data and are
associated with it, to effectively use the data to their advantage and thus upholding
the importance of the Big Data.
3. Velocity - The term 'velocity' in the context refers to the speed of generation of data
or how fast the data is generated and processed to meet the demands and the
challenges which lie ahead in the path of growth and development.
4. Variability - This is a factor which can be a problem for those who analyze the data.
This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
5. Veracity - The quality of the data being captured can vary greatly. Accuracy of
analysis depends on the veracity of the source data.
6. Complexity - Data management can become a very complex process, especially
when large volumes of data come from multiple sources. These data need to be
linked, connected and correlated in order to be able to grasp the information that is
supposed to be conveyed by these data. This situation, is therefore, termed as the
'complexity' of Big Data.
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 26 Introduction to Big Data
Q.3 What are Features of HDFS ? AU : May 17
Ans. : The features of HDFS are given as follows :
• It supports file operations like read, write, delete but append not update.
• It provides Java APIs and command line command line interfaces to interact with
HDFS.
• It provides different File permissions and authentications for files on HDFS.
The shuffle and sort are the components of reducer. The shuffling is a process of
partitioning and moving a mapped output to the reducer where intermediate keys are
assigned to the reducer. Each partition is called subset and so each subset becomes
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
Big Data Analytics 1 - 28 Introduction to Big Data
input to the reducer. In general shuffle phase ensures that the partitioned splits reached
at appropriate reducers where reducer uses http protocol to retrieve their own partition
from mapper. The sort phase is responsible for sorting the intermediate keys on single
node automatically before they are presented to the reducer. The shuffle and sort
phases occur simultaneously where mapped output is being fetched and merged.
The reducer reduces a set of intermediate values which share unique keys with set of
values. The reducer uses sorted input to generate the final output. The final output is
written using record writer by the reducer in to output file with standard output
format. The final output of each map reduce program is generated with key value pairs
written in output file which is written back to the HDFS store.
Part - B Questions
Q.1 Explain Big data characteristics along with their use cases. AU : May-17
Q.2 What are 4 V’s of Big data ? Also explain best practices for Big data analytics
Q.3 Explain generalized architecture of ?
Q.4 Explain architecture of Hadoop along with components of High-performance
architecture of Big data.
Q.5 Explain the functionality of map-Reduce Programming model. AU : May-17
Q.6 Explain the functionality of HDFS and map-reduce in detail. AU : Nov.-18
❑❑❑
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge