Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
23 views65 pages

CIS721 - Big Data Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views65 pages

CIS721 - Big Data Introduction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

BIG DATA

MANAGEMENT
Mohammed Shatnawi
Background Information
 The amount of data produced by mankind
is growing rapidly due to:
 advent of new technologies and devices.
 communication means
• The amount of data produced by us
from the beginning of time till 2003
was 5 billion GB.
• The same amount was created in
every two days in 2011,
• and in every ten minutes in 2013.
?What is Big Data

 Big data is a collection of large datasets


that cannot be processed using
traditional computing techniques.

 It is not a single technique or a tool.


?What Comes Under Big Data
 Big data involves the data produced by different devices and
applications:
 Black Box Data − It is a component of helicopter, airplanes, and jets, etc.

 Social Media Data − Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the globe.

 Stock Exchange Data − The stock exchange data holds information about
the ‘buy’ and ‘sell’ decisions.

 Power Grid Data − The power grid data holds information consumed by a
particular node with respect to a base station.

 Transport Data − Transport data includes model, capacity, distance and


availability of a vehicle.

 Search Engine Data − Search engines retrieve lots of data from different
databases.
Types of Data
 Structured data − Relational data.

 Semi Structured data − XML data.

 Unstructured data − Word, PDF, Text,


Media Logs.
Benefits of Big Data

 the marketing agencies learn about the


response for their campaigns,
promotions, and other advertising
mediums.

 production planning.

 better and quick service.


Big Data Technologies

 Operational Big Data: This include systems like


MongoDB that provide operational capabilities for real-
time, interactive workloads where data is primarily
captured and stored.

 Analytical Big Data: These includes systems like


Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities
Operational vs. Analytical Systems
Big Data Challenges

The major challenges associated with big


data are as follows :−
 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
Traditional Approach
Limitation of Traditional Approach

 Works fine with those applications that


process less voluminous data.

 or up to the limit of the processor that is


processing the data.
Google’s Solution
Hadoop

 Using the solution provided by


Google, Doug Cutting and his team
developed an Open Source Project
called HADOOP.
 Hadoop runs applications using the
MapReduce algorithm,
 Data is processed in parallel with others.
 In short, Hadoop is used to develop
applications that could perform complete
statistical analysis on huge amounts of
data.
Hadoop
Hadoop Architecture

 At its core, Hadoop has two major layers


namely:-
 Processing/Computation layer (MapReduce),
and
 Storage layer (Hadoop Distributed File
System).
Hadoop
MapReduce

 MapReduce is a parallel programming


model for writing distributed applications
 Devised at Google for efficient processing
of large amounts of data (multi-terabyte
data-sets), on large clusters (thousands
of nodes) of commodity hardware in a
reliable, fault-tolerant manner.
 The MapReduce program runs on Hadoop
which is an Apache open-source
framework.
Hadoop Distributed File System

 The Hadoop Distributed File System (HDFS) is based


on the Google File System (GFS)
 Provides a distributed file system that is designed
to run on commodity hardware.
 It has many similarities with existing distributed file
systems.
 However, the differences from other distributed file
systems are significant. It is highly fault-tolerant
and is designed to be deployed on low-cost
hardware.
 It provides high throughput access to application
data and is suitable for applications having large
datasets.
Hadoop Models
Hadoop framework also includes the 
− following two modules
Hadoop Common − These are Java 
libraries and utilities required by other
.Hadoop modules
Hadoop YARN − This is a framework for 

job scheduling and cluster resource


.management

You might also like