MODULE:I
Data Analytics
INTRODUCTION
The Definition
Data Analytics (DA) is the process of examining data
sets in order to find trends and draw conclusions
about the information they contain.
Data analytics is the science of analyzing raw data
to make conclusions about that information.
Data analytics helps individuals and organization
make sense of data. DA typically analyze raw data
for insight and trends.
Data analytics help a business optimize its
performance, maximize profit, or make more
strategically-guided decisions.
MODULE-I DATA ANALYTICS 2
Why Big Data ?
MODULE-I DATA ANALYTICS 3
Why Big Data?
According to IBM “ 90% of data in the world
today created in last two years”
As data continues to grow, so does need to
organize it. Collecting such huge amount of
data would just be waste of time, effort, and
storage space if it cannot be put to any
logical use.
The need to sort, organize, analyze, and offer
this critical data in a systematic manner leads
to the rise of much discussed term, Big Data.
MODULE-I DATA ANALYTICS 4
Source of Data
Sensors used to collect the climate information
Post to social media sites
Digital pictures and videos
Purchase transaction records
Cell phone GPS signals
Web logs
Chat History
MODULE-I DATA ANALYTICS 5
Examples
Mobile devices
(Tracking all objects all the time)
Social media and networks Scientific instruments
(All of us are generating data) (Collecting all sorts of data)
Sensor technology and
networks
(Measuring all kinds of
data)
MODULE-I DATA ANALYTICS 6
Real World Examples
Consumer product companies and retail
organizations are observing data on social
media websites such as Facebook and Twitter.
Customer behaviour, preferences, and
product perception are analyzed and
accordingly the companies can line up their
products to gain profits.
MODULE-I DATA ANALYTICS 7
Real World Examples
Manufactures are also monitoring social networks
but with a different goal. They are using it to
detect after market support issues before a
warranty failure becomes publicly detrimental.
Financial service organizations are using the data
mined from customer interaction to slice and
dice their users into finely tuned segments. This
enables these financial institutions to create
increasingly relevant and sophisticated offers.
MODULE-I DATA ANALYTICS 8
Real World Examples
Advertising and marketing agencies are
tracking social media to understand
responsiveness to campaign, promotions,
and other advertising medium.
Insurance companies are using data analysis to
see which home insurance applications can be
immediately processed and which ones need
a validating in-person visit from an agent.
MODULE-I DATA ANALYTICS 9
Real World Examples
Hospitals are analyzing medical data and
patient records to predict those patients that
are likely to seek readmission within a few
months of discharge. The hospital can then
intervene in hopes of preventing another costly
hospital stay.
Health Bands/ Personal Fitness Device
MODULE-I DATA ANALYTICS 10
Real World Examples
Google Analytics
Google Analytics is a free web analytics tool offered
by Google to help you analyze your website traffic.
If you are running any marketing activities such as
search ads or social media ads, your users are most
likely going to visit your website somewhere along
their user journey.
Google Analytics is a free tool that can help you
track your digital marketing effectiveness.
MODULE-I DATA ANALYTICS 11
Google Analytics (Customer Behaviour Analytics)
Google Analytics puts several lines of tracking code (scripts)
into the code of your website. The code records various
activities of your users when they visit your website, along with
the attributes (such as age, gender, interests) of those users, . It
then sends all that information to the GA (Google Analytics)
server once the user exits your website.
No. of users, bounce rates, average session durations, items
added to cart, goal completion, pages for session, impulsive
buying etc.
Next, Google Analytics aggregates the data collected from your website in
Audience
multiple ways, primarily by four levels:
Acquisition
User level (related to actions by each user)
Behaviour
Session level (each individual visit) --------------
Page-view level (each individual page visited) # Bounce Rate
# Session
Event level (button clicks, video views, etc)
MODULE-I DATA ANALYTICS 12
Uber/Ola - Data Analytics
All of this data is collected, crunched, analyzed
and used to predict everything from the
customer’s wait time, Estimated Time of Arrival
(ETA)to recommending where drivers should
place themselves via heat map in order to take
advantage of the best fares and most passengers.
MODULE-I DATA ANALYTICS 13
Amazon Uses Data Analytics
Personalized Recommendation System
Book Recommendations from Kindle Highlighting
One-Click Ordering
MODULE-I DATA ANALYTICS 14
The Vs of Big Data
MODULE-I DATA ANALYTICS 15
Volume
MODULE-I DATA ANALYTICS 16
Volume
The Earthscope is the world's largest science
project. Designed to track North America's
geological evolution, this observatory records data
over 3.8 million square miles, amassing 67
terabytes of data. It analyzes seismic slips in the
San Andreas fault, sure, but also the plume of
magma underneath Yellowstone and much, much
more.
MODULE-I DATA ANALYTICS 17
The 5 Vs
MODULE-I DATA ANALYTICS 18
Variety
MODULE-I DATA ANALYTICS 19
Variety
Data is generated from internal, external,
social, and behavioural sources .
It comes in different formats, such as images,
text, videos etc.
Data Source Definition Source Application
Internal Structured data CRM, ERP Support daily
business
operations
External Unstructured Internet Understand
data customers,
competitors,
markets.
MODULE-I DATA ANALYTICS 20
Variety
MODULE-I DATA ANALYTICS 21
Structuring Big Data
In simple terms, is arranging the available data in a
manner such that it becomes easy to study,
analyze, and derive conclusion format .
MODULE-I DATA ANALYTICS 22
Why is structuring required?
In our daily life, you may have come across
questions like,
‒ How do I use my advantage the vast amount of
data and information I come across?
‒ Which news articles should I read of the thousands I
come across?
‒ How do I choose a book of the millions available on
my favorite sites or stores?
‒ How do I keep myself updated about new events,
sports, inventions, and discoveries taking place
across the globe?
MODULE-I DATA ANALYTICS 23
Structuring Big Data
Solution to these questions can be found by information
processing systems.
Analysis can be done based on:
What you searched?
What you looked at?
How far you remained at a particular website?
Structuring data helps in understanding user
behaviour, requirements, and preferences to make
personalized recommendation for every individual.
MODULE-I DATA ANALYTICS 24
Characteristics of Data
Deals with the structure of the
data i.e. source, the granularity,
the type, nature whether static Composition
or real-time streaming
Deals with the state of the data
i.e. usability for analysis, does Condition Data
it require cleaning for further
enhancement and enrichment?
Deals with “where it has been Context
generated”, “ why was this
generated”, “how sensitive is
this”, “what are the associated
events” and so on.
MODULE-I DATA ANALYTICS 25
Classification of Digital Data
Digital data is classified into the following categories:
Structured data
Semi-structured data
Unstructured data
Approximate percentage distribution of digital data
MODULE-I DATA ANALYTICS 26
Structured Data
It is defined as the data that has a well-defined repeating pattern and this
pattern makes it easier for any program to sort, read, and process the data.
This is data is in an organized form (e.g., in rows and columns) and can be easily
used by a computer program.
Relationships exist between entities of data.
Structured data:
Organize data in a pre-defined format.
Is stored in a tabular form.
Is the data that resides in a fixed fields within a record of file.
Is formatted data that has entities and their attributes mapped.
Is used to query and report against predetermined data types.
Sources:
Multidimensional
Relational
databases
database
Structured data
Legacy
Flat files
databases
MODULE-I DATA ANALYTICS 27
Ease with Structured Data
Insert/ DML operations provide the required ease with data
input, storage, access, process , analysis etc.
Update/Delete
Encryption and tokenization solution to warrant the
security of information throughout life cycle.
Security Organization able to retain control and maintain
compliance adherence by ensuring that only authorized
are able to decrypt and view sensitive information.
Indexing speed up the data retrieval operation at the
Structured data Indexing cost of additional writes and storage space, but the
benefits that ensure in search operation are worth the
additional writes and storage spaces.
The storage and processing capabilities of the traditional
Scalability DBMS can be easily be scaled up by increasing the
horsepower of the database server.
Transaction RDBMS has support of ACID properties of transaction
to ensure accuracy, completeness and data integrity.
Processing
MODULE-I DATA ANALYTICS 28
MODULE-I DATA ANALYTICS 29
Semi-structured Data
Semi-structured data, also known as having a schema-less or self-describing
structure, refers to a form which does not conform to a data model as in
relational database but has some structure.
In other words, data is stored inconsistently in rows and columns of a
database.
However, it is not in a form which can be used easily by a computer program.
Example, emails, XML, markup languages like HTML, etc. Metadata for this data
is available but is not sufficient.
Sources:
Web data in
the form of XML
cookies Semi-structured
data
Other Markup JSON
languages
MODULE-I DATA ANALYTICS 30
XML, JSON, BSON format
Source (XML & JSON): http://sqllearnergroups.blogspot.com/2014/03/how-to-get-json-format-through-sql.html
Source (JSON & BSON): http://www.expert-php.fr/mongodb-bson/
MODULE-I DATA ANALYTICS 31
JSON & XML
var myObj = {name: "John", age: 31, city: "New
York"};
It is used primarily to transmit data between a server
and web application
XML (eXtensible Markup Language)
XML is a software- and hardware-independent tool
for storing and transporting data.
Refer: https://www.w3schools.com/xml/xml_whatis.asp
Example of XML :
MODULE-I DATA ANALYTICS 32
Characteristics of Semi-structured Data
Inconsistent
Structure
Self-describing
(level/value pair)
Other schema
Semi-structured information is
data blended with data
values
Data objects may
have different
attributes not
known beforehand
MODULE-I DATA ANALYTICS 33
Cookies
Cookies allow us to understand who has seen which
pages, and to determine the most popular areas of
our web site.
We also use cookies to store visitors’ preferences,
and to record session information, such as length of
visit.
Depending on the type of cookie we use, cookies also
allow us to make our web site more user friendly, for
example, permanent cookies allow us to save your
password so that you do not have to re-enter it every
time you visit our web site
MODULE-I DATA ANALYTICS 34
Cookies
These cookies are used to collect information
about how visitors use our site.
We use the information to compile reports and
to help us improve the site. The cookies collect
information in an anonymous form, including
the number of visitors to the site , where
visitors have come to the site from and the
pages they visited.
MODULE-I DATA ANALYTICS 35
Web Data
It refers to the data that is publicly available on the
web sites.
The web data has documents in pdf, doc, plain text as
well as images, music, and videos.
The most widely used and best‐known source of big
data today is the detailed data collected from web
sites.
The data is unstructured and inappropriate for access
by software application, and hence is converted to
either semi-structured or structured format that is
well suited for both humans and machines.
MODULE-I DATA ANALYTICS 36
Unstructured Data
Unstructured data is a set of data that might or might not have any logical or
repeating patterns and is not recognized in a pre-defined manner.
About 80 percent of enterprise data consists of unstructured content.
Unstructured data:
Typically consists of metadata i.e. additional information related to data.
Comprises of inconsistent data such as data obtained from files, social
media websites, satellites etc
Consists of data in different formats such as e-mails, text, audio, video, or
images.
Sources: Body of email
Text both Chats, Text
internal and messages
external to
org. Mobile data
Unstructured data
Social Media Images,
data audios, videos
MODULE-I DATA ANALYTICS 37
Un Structured Data
The CCTV footage in a super market are thoroughly
analyzed to identify
The routes customer take to navigate through store
Customer behaviour during a bottleneck during
network traffic
Places where customer typically halt while shopping.
This unstructured data is combined with the structured data,
comprising the details obtained from the bill counters,
product sold, the amount and nature of payment etc.
This helps the management to provide a pleasant
shopping experience to customers as well as improves
sales figure.
MODULE-I DATA ANALYTICS 38
MODULE-I DATA ANALYTICS 39
Challenges associated with
Unstructured data
Working with unstructured data poses certain challenges, which are as follows:
Identifying the unstructured data that can be processed.
Sorting, organizing, and arranging unstructured data indifferent sets and
formats.
Combining and linking unstructured data in a more structured format to
derive any logical conclusions out of the available information.
Costing in terms of storage space and human resources need to deal with the
exponential growth of unstructured data.
Data Analysis of Unstructured Data
The complexity of unstructured data lies within the language that created it. Human
language is quite different from the language used by machines, which prefer
structured information. Unstructured data analysis is referred to the process of
analyzing data objects that doesn’t follow a predefine data model and/or is
unorganized. It is the analysis of any data that is stored over time within an
organizational data repository without any intent for its orchestration, pattern or
categorization.
MODULE-I DATA ANALYTICS 40
Dealing with Unstructured data
Data Mining (DM)
Natural Language Processing (NLP)
Dealing with
Unstructured data Text Analytics (TA)
Noisy Text Analytics
MODULE-I DATA ANALYTICS 41
Velocity (Speed)
MODULE-I DATA ANALYTICS 42
Velocity
ebay analyzes around 5 million transactions per day in real
time to detect and prevent frauds arising from the use of
Paypal.
Social media messages going viral in minutes, the speed at
which credit card transactions are checked for fraudulent
activities.
Big data technology now allows us to analyze the data
while it is being generated without ever putting it into
databases.
MODULE-I DATA ANALYTICS 43
Real-Time Analytics
MODULE-I DATA ANALYTICS 44
Veracity
Veracity refers to the messiness or trustworthiness
(quality) of data. With many forms of bigdata,
quality and accuracy are less controllable, for
example Twitter posts with hashtags,
abbreviations and typos.
Big data and analytics technology now allows
us to work with these types of data. The
volumes often makeup for the lack of quality
or accuracy.
MODULE-I DATA ANALYTICS 45
Veracity
It refers to uncertainty of data –
Is data is correct and consistent?
Big data is messy in nature – unstructured &
semi structured form
Clean the data for further analysis
MODULE-I DATA ANALYTICS 46
Value
But all the volumes of fast-moving data of
different variety and veracity have to be
turned into value!
This is why value is the one V of bigdata that
matters the most.
MODULE-I DATA ANALYTICS 47
Value
Value is defined as the usefulness of data for an
enterprise.
The value characteristic is intuitively related to the
veracity characteristic in that the higher the data
fidelity, the more value it holds for the business.
Value is also dependent on how long data processing
takes because analytics results have a shelf-life; for
example, a 20 minute delayed stock quote has little to
no value for making a trade compared to a quote that
is 20 milliseconds old.
Data that has high veracity and can be analyzed quickly,
has more value to business.
MODULE-I DATA ANALYTICS 48
The Vs (Extended)
MODULE-I DATA ANALYTICS 49
The 7 Vs
Representation of Data: Data clustering or using tree maps, sunbursts,
parallel coordinates, circular network diagrams, or cone trees.
MODULE-I DATA ANALYTICS 50
Definition of Big Data
Big Data is high-volume, high-velocity,
and high-variety information assets that
demand cost effective, innovative forms
of information processing for enhanced
insight and decision making.
Source: Gartner IT Glossary
MODULE-I DATA ANALYTICS 51
What is Big Data?
Think of following:
Every second, there are around 822 tweets on Twitter.
Every minutes, nearly 510 comments are posted, 293 K statuses are updated,
and 136K photos are uploaded in Facebook.
Every hour, Walmart, a global discount departmental store chain, handles more
than 1 million customer transactions.
Everyday, consumers make around 11.5 million payments by using PayPal.
In the digital world, data is increasing rapidly because of the ever increasing use of
the internet, sensors, and heavy machines at a very high rate. The sheer volume,
variety, velocity, and veracity of such data is signified the term ‘Big Data’.
Semi- Big
Structured Unstructur
structure Data
Data ed Data
d Data
MODULE-I DATA ANALYTICS 52
Why Big Data?
More data for analysis will result into greater analytical accuracy and greater
confidence in the decisions based on the analytical findings. This would entail a
greater positive impact in terms of enhancing operational efficiencies, reducing cost and
time, and innovating on new products, new services and optimizing existing services.
More data
More accurate analysis
Greater confidence in decision making
Greater operational efficiencies, cost
reduction, time reduction, new
product development, and optimized
offering etc.
MODULE-I DATA ANALYTICS 53
Challenges of Traditional Systems
The main challenge in the traditional approach for computing systems to manage
‘Big Data’ because of immense speed and volume at which it is generated. Some of
the challenges are:
Traditional approach cannot work on unstructured data efficiently.
Traditional approach is built on top of the relational data model, relationships
between the subjects of interests have been created inside the system and the
analysis is done based on them. This approach will not adequate for big data.
Traditional approach is batch oriented and need to wait for nightly ETL
(extract, transform and load) and transformation jobs to complete before the
required insight is obtained.
Traditional data management, warehousing, and analysis systems fizzle to analyze
this type of data. Due to it’s complexity, big data is processed with parallelism.
Parallelism in a traditional system is achieved through costly hardware like
MPP (Massively Parallel Processing) systems.
Inadequate support of aggregated summaries of data.
MODULE-I DATA ANALYTICS 54
Challenges of Traditional Systems
cont’d
Other challenges can be categorized as:
Data Challenges:
Volume, velocity, veracity, variety
Data discovery and comprehensiveness
Scalability
Process challenges
Capturing Data
Aligning data from different sources
Transforming data into suitable form for data analysis
Modeling data(Mathematically, simulation)
Management Challenges:
Security
Privacy
Governance
Ethical issues MODULE-I DATA ANALYTICS 55
Evolution of Analytics Scalability
As the amount of data organizations process continue to
increase, the world of big data requires new levels of
scalability. Organizations need to update the technology to
provide a higher level of scalability.
Luckily, there are multiple technologies available that address
different aspects of the process of taming big data and making
use of it in analytic processes.
The technologies are:
MPP (massively parallel processing)
Cloud computing (Appendix)
Grid computing
MapReduce (Hadoop)
MODULE-I DATA ANALYTICS 56
Traditional Analytics Architecture
Database 1
Analytic Server
Database 2
Extract
Database 3
The heavy processing occurs in the analytic environment.
Database n This may even a PC.
MODULE-I DATA ANALYTICS 57
Modern In-Database Analytics Architecture
Database 1
Analytic Server
Database 2
Submit
Consolidate
Request
Database 3 Enterprise Data
Warehouse (EDW)
Database n
In an in-database environment, the processing stays in the database where the data
has been consolidated. EDWs collect and aggregate data from multiple sources, acting
as a repository for most or all organizational data to facilitate broad access and analysis.
The user’s machine just submits the request; it doesn’t do heavy lifting.
MODULE-I DATA ANALYTICS 58
Distributed vs. Parallel Computing
Parallel Computing Distributed Computing
Shared memory system Distributed memory system
Multiple processors share a Autonomous computer nodes
single bus and memory unit connected via network
Processor is order of Tbps Processor is order of Gbps
Limited Scalability Better scalability and cheaper
Distributed computing in local
network (called cluster
computing). Distributed
computing in wide-area network
(grid computing)
MODULE-I DATA ANALYTICS 59
EDW & MPP
Enterprise Data Warehouse: An enterprise data warehouse (EDW) is a database,
or collection of databases, that centralizes a business's information from
multiple sources and applications, and makes it available for analytics and use
across the organization. EDWs can be housed in an on-premise server or in the
cloud. The data stored in this type of digital warehouse can be one of a business’s
most valuable assets, as it represents much of what is known about the business, its
employees, its customers, and more.
Massively Parallel Processing (MPP): It is a storage structure designed to handle
the coordinated processing of program operations by multiple processors. This
coordinated processing can work on different parts of a program, with each
processor using its own operating system and memory. This allows MPP
databases to handle massive amounts of data and provide much faster analytics
based on large datasets.
MODULE-I DATA ANALYTICS 60
MPP Analytics Architecture
Massively parallel processing (MPP) database systems is the most mature, proven, and
widely deployed mechanism for storing and analyzing large amounts of data. An MPP
database spreads data out into independent pieces managed by independent
storage and central processing unit (CPU) resources. Conceptually, it is like having
pieces of data loaded onto multiple network connected personal computers
around a house. The data in an MPP system gets split across a variety of disks managed
by a variety of CPUs spread across a number of servers.
In stead of single
overloaded database, an Single overloaded server
MPP database breaks the
data into independent
chunks with independent
disk and CPU.
Multiple lightly loaded server
MODULE-I DATA ANALYTICS 61
MPP Database Example
100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks
One-terabyte
table 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte 100-gigabyte
chunks chunks chunks chunks chunks
A Traditional database will query
a one-terabyte table one row at time 10 simultaneous 100-gigabyte queries
MPP database is based on the principle of SHARE THE WORK!
•A MPP database spreads data out across multiple sets of CPU and disk space.
•This allows much faster query execution, since many independent smaller queries are
running simultaneously instead of just one big query.
•If more processing power and more speed are required, just bolt on additional capacity
in the form of additional processing units.
•MPP systems build in redundancy to make recovery easy and have resource
management tools to manage the CPU and disk space.
MODULE-I DATA ANALYTICS 62
MPP Database Example cont’d
An MPP system allows the different sets of CPU and disk to run the process concurrently
An MPP system
breaks the job into pieces
Single Threaded
Process ★ Parallel Process ★
MODULE-I DATA ANALYTICS 63
Grid Computing
Grid Computing can be defined as a network of computers working
together to perform a task that would rather be difficult for a single
machine.
The task that they work on may include analyzing huge datasets or
simulating situations which require high computing power.
Computers on the network contribute resources like processing
power and storage capacity to the network.
Grid Computing is a subset of distributed computing, where a
virtual super computer comprises of machines on a network
connected by some bus, mostly Ethernet or sometimes the Internet.
It can also be seen as a form of parallel computing where instead of
many CPU cores on a single machine, it contains multiple cores
spread across various locations.
MODULE-I DATA ANALYTICS 64
How Grid Computing works?
MODULE-I DATA ANALYTICS 65
Hadoop
Hadoop is an open-source project of the Apache Foundation. Apache
Hadoop is written in Java and a collection of open-source software utilities
that facilitate using a network of many computers to solve problems involving
massive amounts of data and computation. It provides a software
framework for distributed storage and processing of big data and uses
Google’s MapReduce and Google File System as its foundation.
Hadoop
Apache open-source software framework
Inspired by:
- Google MapReduce
- Google File System
Hadoop provides various tools and technologies, collectively termed as Hadoop
ecosystem, to enable development and deployment of Big Data solutions. It
accomplishes two tasks namely i) Massive data storage, and ii) Faster data
processing. 66
Flood of data/ Source of Big Data
Few stastics to get an idea of data gets generated every day, every minute, and
every second.
Every day
NYSE generates 1.5 billion shares and trade data
Facebook stores 2.7 billion comments and likes
Google processes about 24 petabytes of data
Every minutes
Facebook users share nearly 2.5 million pieces of content.
Amazon generates over $ 80,000 in online sale
Twitter users tweet nearly 300,000 times.
Instagram users post nearly 220,000 new photos
Apple users download nearly 50,000 apps.
Email users send over 2000 million messages
YouTube users upload 72 hrs of new video content
Every second
Banking applications process more than 10,000 credit card
67
transactions.
Data Challenges
To process, analyze and made sense of these different kinds of data, a system is
needed that scales and address the challenges as shown:
“I have data in various sources. I have
“I am flooded with
data that rich in variety – structured,
data”. How to store
semi-structured and unstructured”. How
terabytes of mounting
to work with data that is so very
data?
different?
“I need this data to be
proceed quickly. My
decision is pending”.
How to access the
information quickly?
68
Why Hadoop
Its capability to handle massive amounts of data, different categories of data –
fairly quickly.
Considerations
69
Hadoop History
Hadoop was created by Doug Cutting, the creator of Apache Lucene (text search
library). Hadoop was part of Apace Nutch (open-source web search engine of
Yahoo project) and also part of Lucene project. The name Hadoop is not an
acronym; it’s a made-up name.
70
Key Aspects of Hadoop
71
Hadoop Components
72
Hadoop Components cont’d
Hadoop Core Components:
HDFS
Storage component
Distributed data across several nodes
Natively redundant
MapReduce
Computational Framework
Splits a task across multiple nodes
Process data in parallel
Hadoop Ecosystems: These are support projects to enhance the functionality
of Hadoop Core components. The projects are as follows:
Hive Flume HBase
Pig Oozie
Sqoop Mahout
73
Hadoop Ecosystem
Data Management
Data Access
Data Processing
Data Storage
74
Version of Hadoop
YARN (Yet Another Resource
There are 3 versions of Hadoop available: Negotiator) is the resource
Hadoop 1.x Hadoop 3.x management(allocating resources to
Hadoop 2.x various applications) and job/ task
scheduling technology
Hadoop 1.x vs. Hadoop 2.x
Hadoop 1.x Hadoop 2.x
Other Data Processing
MapReduce MapReduce
Framework
Data Processing & Resource
Management YARN
Resource Management
HDFS HDFS2
Distributed File Storage Distributed File Storage
(redundant, reliable storage) (redundant, highly-available, reliable storage)
75
Hadoop 2.x vs. Hadoop 3.x
Characteristics Hadoop 2.x Hadoop 3.x
Minimum Java 7 Java 8
supported version
of java
Fault tolerance Handled by replication (which is Handled by erasure coding
wastage of space).
Data Balancing Uses HDFS balancer Uses Intra-data node balancer,
which is invoked via the HDFS
disk balancer CLI.
Storage Scheme Uses 3X replication scheme. E.g. If Support for erasure encoding in
there is 6 block so there will be 18 HDFS. E.g. If there is 6 block so
blocks occupied the space because there will be 9 blocks occupied
of the replication scheme. the space 6 block and 3 for parity.
Scalability Scale up to 10,000 nodes per Scale more than 10,000 nodes per
cluster. cluster.
76
High Level Hadoop 2.0 Architecture
Hadoop is distributed Master-Slave architecture.
Distributed data storage Distributed data processing
Client
HDFS YARN
HDFS Master Node YARN Master Node
Active Namenode Resource Manager
Master
Standby Namenode
Secondary Namenode
HDFS Slave Node YARN Slave Node
DataNode 1 Slave Node Manager 1
DataNode n Node Manager n
77
High Level Hadoop 2.0 Architecture cont’d
Resource Node Node Node
YARN Manager Manager Manager Manager
HDFS
Cluster NameNode DataNode DataNode DataNode
78
Hadoop HDFS
The Hadoop Distributed File System (HDFS) is the primary data storage
system used by Hadoop applications.
HDFS holds very large amount of data and employs a NameNode and
DataNode architecture to implement a distributed file system that provides
high-performance access to data across highly scalable Hadoop clusters.
To store such huge data, the files are stored across multiple machines.
These files are stored in redundant fashion to rescue the system from
possible data losses in case of failure.
It’s run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault-tolerant and
designed using low-cost hardware.
79
Hadoop HDFS Key points
Some key points of HDFS are as follows:
1. Storage component of Hadoop.
2. Distributed File System.
3. Modeled after Google File System.
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
5. One can replicate a file /configure it number of times, which is tolerant in
terms of both software and hardware.
6. Re-replicates data blocks automatically on nodes that have failed.
7. Sits on top of native file system
80
HDFS Physical Architecture
Key components of HDFS are as follows:
1. NameNode 3. Secondary NameNode
2. DataNodes 4. Standby NameNode
Blocks: Generally the user data is stored in the files of HDFS. HDFS breaks a
large file into smaller pieces called blocks. In other words, the minimum
amount of data that HDFS can read or write is called a block. By default
the block size is 128 MB in Hadoop 2.x and 64 MB in Hadoop 1.x. But it can
be increased as per the need to change in HDFS configuration.
Hadoop 2.X Hadoop 1.X
200 MB – abc.txt 200 MB – abc.txt
128 MB – Block 1
72 MB – Block 2
Why block size is large?
1. Reduce the cost of seek time and 2. Proper usage of storage space
81
Rack
A rack is a collection of 30 or 40 nodes that are physically stored close together
and are all connected to the same network switch. Network bandwidth between
any two nodes in rack is greater than bandwidth between two nodes on
different racks. A Hadoop Cluster is a collection of racks.
Switch
Node 1 Node 1 Node 1
S S S
Node 2 Node 2 Node 2
W W W
I I I
T T T
C C C
H H H
Node N Node N Node N
Rack 1 Rack 2 Rack N 82
NameNode
1. NameNode is the centerpiece of HDFS.
2. NameNode is also known as the Master.
3. NameNode only stores the metadata of HDFS – the directory tree of all files in the
file system, and tracks the files across the cluster.
4. NameNode does not store the actual data or the dataset. The data itself is actually
stored in the DataNodes
5. NameNode knows the list of the blocks and its location for any given file in HDFS.
With this information NameNode knows how to construct the file from blocks.
6. NameNode is usually configured with a lot of memory (RAM).
7. NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop
cluster is inaccessible and considered down.
8. NameNode is a single point of failure in Hadoop cluster.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 128 GB
Disk: 6 x 1TB SATA
Network: 10 Gigabit Ethernet
83
NameNode Metadata
1. Metadata stored about the file consists of file name, file path, number of
blocks, block Ids, replication level.
2. This metadata information is stored on the local disk. Namenode uses two
files for storing this metadata information.
FsImage EditLog
3. NameNode in HDFS also keeps in it’s memory, location of the DataNodes
that store the blocks for any given file. Using that information Namenode
can reconstruct the whole file by getting the location of all the blocks
of a given file.
Example
(File Name, numReplicas, rack-ids, machine-ids, block-ids, …)
/user/in4072/data/part-0, 3, r:3, M3, {1, 3}, …
/user/in4072/data/part-1, 3, r:2, M1, {2, 4, 5}, …
/user/in4072/data/part-2, 3, r:1, M2, {6, 9, 8}, …
84
DataNode
1. DataNode is responsible for storing the actual data in HDFS.
2. DataNode is also known as the Slave
3. NameNode and DataNode are in constant communication.
4. When a DataNode starts up it announce itself to the NameNode along with
the list of blocks it is responsible for.
5. When a DataNode is down, it does not affect the availability of data or the
cluster. NameNode will arrange for replication for the blocks managed
by the DataNode that is not available.
6. DataNode is usually configured with a lot of hard disk space. Because the
actual data is stored in the DataNode.
Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 64 GB
Disk: 12-24 x 1TB SATA
Network: 10 Gigabit Ethernet
85
Secondary NameNode
1. Secondary NameNode in Hadoop is more of a helper to NameNode, it is not
a backup NameNode server which can quickly take over in case of
NameNode failure.
2. EditLog– All the file write operations done by client applications are first
recorded in the EditLog.
3. FsImage– This file has the complete information about the file system
metadata when the NameNode starts. All the operations after that are
recorded in EditLog.
4. When the NameNode is restarted it first takes metadata information
from the FsImage and then apply all the transactions recorded in
EditLog. NameNode restart doesn’t happen that frequently so EditLog
grows quite large. That means merging of EditLog to FsImage at the time of
startup takes a lot of time keeping the whole file system offline during that
process.
5. Secondary NameNode take over this job of merging FsImage and EditLog and keep
the FsImage current to save a lot of time. Its main function is to check point the file
system metadata stored on NameNode.
Secondary NameNode cont’d
The process followed by Secondary NameNode to periodically merge the
fsimage and the edits log files is as follows:
1.Secondary NameNode pulls the latest FsImage and EditLog files from the
primary NameNode.
2.Secondary NameNode applies each transaction from EditLog file to FsImage to
create a new merged FsImage file.
3.Merged FsImage file is transferred back to primary NameNode.
1
2
Secondary
NameNode
NameNode
3
It’s been an
hour, provide
your metadata
87
Standby NameNode
With Hadoop 2.0, built into the platform, HDFS now has automated failover
with a hot standby, with full stack resiliency.
1.Automated Failover: Hadoop pro-actively detects NameNode host and
process failures and will automatically switch to the standby NameNode to
maintain availability for the HDFS service. There is no need for human
intervention in the process – System Administrators can sleep in peace!
2.Hot Standby: Both Active and Standby NameNodes have up to date HDFS
metadata, ensuring seamless failover even for large clusters – which means no
downtime for your HDP cluster!
3.Full Stack Resiliency: The entire Hadoop stack (MapReduce, Hive, Pig,
HBase, Oozie etc.) has been certified to handle a NameNode failure scenario
without losing data or the job progress. This is vital to ensure long running jobs
that are critical to complete on schedule will not be adversely affected during a
NameNode failure scenario.
88
Replication
HDFS provides a reliable way to store huge data in a distributed environment as
data blocks. The blocks are also replicated to provide fault tolerance. The
default replication factor is 3 which is configurable. Therefore, if a file to be
stored of 128 MB in HDFS using the default configuration, it would occupy a
space of 384 MB (3*128 MB) as the blocks will be replicated three times and
each replica will be residing on a different DataNode.
89
Rack Awareness
All machines in rack are connected using the same network switch and if that
network goes down then all machines in that rack will be out of service. Rack
Awareness was introduced by Apache Hadoop to overcome this issue. In Rack
Awareness, NameNode chooses the DataNode which is closer to the same rack
or nearby rack. NameNode maintains Rack ids of each DataNode to achieve rack
information. Thus, this concept chooses DataNodes based on the rack
information. NameNode in Hadoop makes ensures that all the replicas
should not stored on the same rack or single rack. Default replication factor
is 3. Therefore according to Rack Awareness Algorithm:
When a Hadoop framework creates new block, it places first replica on the
local node, and place a second one in a different rack, and the third one is on
different node on same remote node.
When re-replicating a block, if the number of existing replicas is one, place the
second on a different rack.
When number of existing replicas are two, if the two replicas are in the same
rack, place the third one on a different rack.
90
Rack Awareness & Replication
File B1 Block 1 B3 Block 3
B1 B2 B3 B2 Block 2
B3 DN 1 B1 DN 1 B2 DN 1
B1 DN 2 B2 DN 2 B3 DN 2
B3 DN 3 B1 DN 3 B2 DN 3
DN 4 DN 4 DN 4
Rack 1 Rack 2 Rack 3
91
Hadoop Ecosystem
Following are the components that collectively form a Hadoop ecosystem:
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
Spark: In-memory data processing
PIG, HIVE: query based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Solar, Lucene: Searching and Indexing
Zookeeper: Managing cluster
Oozie: Job Scheduling
Sqoop: Data transfer between Hadoop and RDBMS or mainframes
HCatlog: Metadata services
92
Hadoop Ecosystem cont…
PIG
It was developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data
sets.
Pig does the work of executing commands and in the background, all the activities
of MapReduce are taken care of. After the processing, pig stores the result in
HDFS.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
Hbase
It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus
able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in
a huge database, the request must be processed within a short quick span of time.93 At
such times, HBase comes handy as it gives us a tolerant way of storing limited data.
Hadoop Ecosystem cont…
HBase is a distributed column-oriented database built on top
of the Hadoop file system. It is an open-source project and is
horizontally scalable.
HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts of
structured data
Storage Mechanism :
Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.
HBase maps (rowkey, column family, column, timestamp) to a value.
MODULE-I DATA ANALYTICS 94
Hadoop Ecosystem cont…
HBase is schema-less, it doesn't have the concept of fixed
columns schema; defines only column families.
HBase is a database built on top of the HDFS.
HBase provides fast lookups for larger tables (random access).
HBase is horizontally scalable.
MODULE-I DATA ANALYTICS 95
Hadoop Ecosystem cont…
HIVE
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive Query
Language).
It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components:
JDBC Drivers and HIVE Command Line. JDBC, along with ODBC drivers work on
establishing the data storage permissions and connection whereas HIVE Command
line helps in the processing of queries.
Sqoop
It is a tool designed to transfer data between Hadoop and relational database.
It is used to import data from relational databases such as MySQL, Oracle to
Hadoop HDFS, and export from Hadoop file system to relational databases.
96
MapReduce
1. MapReduce is a processing technique and a program model for distributed
computing based on java. It is built on divide and conquer algorithm.
2. In MapReduce Programming, the input dataset is split into independent
chunks.
3. It contains two important tasks, namely Map and Reduce.
4. Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). The processing
primitive is called mapper. The processing is done in parallel manner. The
output produced by the map tasks serves as intermediate data and is stored on
the local disk of that server.
5. Reduce task takes the output from a map as an input and combines those
data tuples into a smaller set of tuples. The processing primitive is called
reducer. The input and output are stored in a file system.
6. Reduce task is always performed after the map job.
7. The major advantage of MapReduce is that it is easy to scale data processing
over multiple computing nodes and takes care of other tasks such as scheduling,
monitoring, re-executing failed tasks etc. 97
MapReduce cont’d
98
MapReduce cont’d
The main advantages is that we write an application in the MapReduce
form, scaling the application to run over hundreds, thousands, or even tens
of thousands of machines in a cluster with a configuration change.
MapReduce program executes in three stages: map stage, shuffle &
sorting stage, and reduce stage.
Map Stage: The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line by
line. The mapper processes the data and creates several small chunks of
data.
Shuffle & Sorting Stage: Shuffle phase in Hadoop transfers the map output
from Mapper to a Reducer in MapReduce. Sort phase in MapReduce covers
the merging and sorting of map outputs.
Reducer Stage: The Reducer’s job is to process the data that comes from
the mapper. After processing, it produces a new set of output, which will be
stored in the HDFS. 99
MapReduce: The Big Picture
MODULE-I DATA ANALYTICS 100
How MapReduce Work?
At the crux of MapReduce are two functions: Map and Reduce. They are
sequenced one after the other.
The Map function takes input from the disk as <key,value> pairs, processes
them, and produces another set of intermediate <key,value> pairs as output.
The Reduce function also takes inputs as <key,value> pairs, and produces
<key,value> pairs as output.
101
MapReduce Example
MODULE-I DATA ANALYTICS 102
More examples
Draw the MapReduce process to generate the total sales
MODULE-I DATA ANALYTICS 103
More examples
MODULE-I DATA ANALYTICS 104
Example Contd…
MODULE-I DATA ANALYTICS 105
Example Comtinued…
MODULE-I DATA ANALYTICS 106
Working of MapReduce
The types of keys and values differ based on the use case. All inputs and outputs
are stored in the HDFS. While the map is a mandatory step to filter and sort the
initial data, the reduce function is optional.
<k1, v1> -> Map() -> list(<k2, v2>)
<k2, list(v2)> -> Reduce() -> list(<k3, v3>)
Mappers and Reducers are the Hadoop servers that run the Map and Reduce
functions respectively. It doesn’t matter if these are the same or different
servers.
Map: The input data is first split into smaller blocks. Each block is then
assigned to a mapper for processing. For example, if a file has 100 records to be
processed, 100 mappers can run together to process one record each. Or maybe
50 mappers can run together to process two records each. The Hadoop
framework decides how many mappers to use, based on the size of the data to
be processed and the memory block available on each mapper server.
107
Working of MapReduce cont’d
Reduce: After all the mappers complete processing, the framework shuffles
and sorts the results before passing them on to the reducers. A reducer
cannot start while a mapper is still in progress. All the map output values
that have the same key are assigned to a single reducer, which then
aggregates the values for that key.
Class Exercise 1 Class Exercise 2
Draw the MapReduce process to Draw the MapReduce process to find the
count the number of words for maximum electrical consumption for each
the input: year:
Dog Cat Rat Year
Car Car Rat
Dog car Rat
Rat Rat Rat
108
Hadoop Limitations
Not fit for small data: Hadoop does not suit for small data. HDFS lacks the ability
to efficiently support the random reading of small files because of its high capacity
design. The solution to this drawback of Hadoop to deal with small file issue is
simple. Just merge the small files to create bigger files and then copy bigger files to
HDFS.
Security concerns: Hadoop is challenging in managing the complex application. If
the user doesn’t know how to enable a platform who is managing the platform, data
can be a huge risk. At storage and network levels, Hadoop is missing encryption,
which is a major point of concern. Hadoop supports Kerberos authentication, which
is hard to manage. Spark provides a security bonus to overcome the limitations of
Hadoop.
Vulnerable by nature: Hadoop is entirely written in Java, a language most widely
used, hence java been most heavily exploited by cyber criminals and as a result,
implicated in numerous security breaches.
No caching: Hadoop is not efficient for caching. In Hadoop, MapReduce cannot
cache the intermediate data in memory for a further requirement which
diminishes the performance of Hadoop. Spark can overcome this limitation.
109
NoSQL
NoSQL database stands for "Not Only SQL" or "Not SQL."
It is a non-relational database, that does not require a fixed schema, and avoids joins.
It is used for distributed data stores and specifically targeted for big data, for
example Google or Facebook which collects terabytes of data every day for their
users.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights.
Instead, a NoSQL database system encompasses a wide range of database
technologies that can store structured, semi-structured, and unstructured data.
It adhere to Brewer’s CAP theorem.
The tables are stored as ASCII files and each field is separated by tabs
The data scale horizontally.
110
NoSQL cont…
Database
RDBMS NoSQL
OLAP OLTP
111
RDBMS vs. NoSQL
RDBMS NoSQL
Relational database Non-relational, distributed database
Relational model Model-less approach
Pre-defined schema Dynamic schema for unstructured data
Table based databases Document-based or graph-based or wide
column store or key-value pairs databases
Vertically scalable (by increasing system Horizontally scalable (by creating a cluster of
resources) commodity machines)
Uses SQL Uses UnQL (Unstructured Query Language)
Not preferred for large datasets Largely preferred for large datasets
Not a best fit for hierarchical data Best fit for hierarchical storage as it follows
the key-value pair of storing data similar to
JSON
Emphasis on ACID properties Follows Brewer’s CAP theorem
112
RDBMS vs. NoSQL cont’d
RDBMS NoSQL
Excellent support from vendors Relies heavily on community support
Supports complex querying and data keeping Does not have good support for complex
needs querying
Can be configured for strong consistency Few support strong consistency (e.g.,
MongoDB), few others can be configured for
eventual consistency (e.g., Cassandra)
Examples: Oracle, DB2, MySQL, MS SQL, Examples: MongoDB, HBase, Cassandra,
PostgreSQL, etc. Redis, Neo4j, CouchDB, Couchbase, Riak,
etc.
113
MODULE-I DATA ANALYTICS 114