Agenda
Big Data Systems – Characteristics: Failures; Reliability and Availability; Consistency –
Notions of Consistency.
1
Big Data Systems
Contents
Chapter 4 The Big Data Technology Landscape
Acharya, Seema; Subhashini Chellappan. Big Data and Analytics, 2ed (p. 14).
4.2 Hadoop
4.2.1 Features of Hadoop
4.2.2 Key Advantages of Hadoop
4.2.3 Versions of Hadoop
4.2.4 Overview of Hadoop Ecosystems
4.2.5 Hadoop Distributions
4.2.6 Hadoop versus SQL
4.2.7 Integrated Hadoop Systems Offered by Leading Market
Vendors
4.2.8 Cloud-Based Hadoop Solutions
Big Data Systems
Big Data Systems
4.2 Hadoop
• Hadoop is an open-source project of the Apache foundation.
• It is a framework written in Java, originally developed by Doug Cutting in 2005 who
named it after his son’s toy elephant.
• He was working with Yahoo then.
• It was created to support distribution for “Nutch”, the text search engine.
• Hadoop uses Google’s MapReduce and Google File System technologies as its
foundation.
• Hadoop is now a core part of the computing infrastructure for companies such as
Yahoo, Facebook, LinkedIn, Twitter, etc.
Refer Figure 4.8.
Big Data Systems
Big Data Systems
4.2.1 Features of Hadoop Let us cite a few
features of Hadoop:
1. It is optimized to handle massive quantities of structured, semi-structured, and
unstructured data, using commodity hardware, that is, relatively inexpensive
computers.
2. Hadoop has a shared nothing architecture.
3. It replicates its data across multiple computers so that if one goes down, the data
can still be processed from another machine that stores its replica.
4. Hadoop is for high throughput rather than low latency. It is a batch operation
handling massive quantities of data; therefore the response time is not immediate.
5. It complements On-Line Transaction Processing (OLTP) and On-Line Analytical
Processing (OLAP). However, it is not a replacement for a relational database
management system.
6. It is NOT good when work cannot be parallelized or when there are dependencies
within the data.
7. It is NOT good for processing small files. It works best with huge data files and
datasets.
Big Data Systems
4.2.2 Key Advantages of Hadoop Refer Figure
Refer Figure 4.9 for a quick look at the key advantages of Hadoop. Some of them are as
follows:
1.Stores data in its native format:
•Hadoop’s data storage framework (HDFS – Hadoop Distributed File System) can store
data in its native format.
•There is no structure that is imposed while keying in data or storing data. HDFS is pretty
much schema-less.
•It is only later when the data needs to be processed that structure is imposed on the
raw data.
2. Scalable: Hadoop can store and distribute very large datasets (involving thousands of
terabytes of data) across hundreds of inexpensive servers that operate in parallel.
Big Data Systems
Big Data Systems
3. Cost-effective: Owing to its scale-out architecture, Hadoop has a much reduced
cost/terabyte of storage and processing.
4. Resilient to failure: Hadoop is fault-tolerant. It practices replication of data diligently
which means whenever data is sent to any node, the same data also gets replicated to
other nodes in the cluster, thereby ensuring that in theevent of a nodefailure, there
will always be anothercopy of data availablefor use.
5. Flexibility: One of the key advantages of Hadoop is its ability to work with all kinds
of data: structured, semi-structured, and unstructured data. It can help derive
meaningful business insights from email conversations, social media data,
click-stream data, etc. It can be put to several purposes such as log analysis, data
mining, recommendation systems, market campaign analysis, etc.
6. Fast: Processing is extremely fast in Hadoop as compared to other conventional
systems owing to the “move code to data” paradigm. Hadoop has a shared-nothing
architecture.
Big Data Systems
4.2.3 Versions of Hadoop There are two versions of Hadoop available:
1.Hadoop 1.0
2. 2. Hadoop 2.0
Let us take a look at the features of both. Refer Figure 4.10.
4.2.3.1 Hadoop 1.0 It has two main parts:
1. Data storage framework: It is a general-purpose file system called Hadoop
Distributed File System (HDFS). HDFS is schema-less. It simply stores data files.
These data files can be in just about any
Big Data Systems
Big Data Systems
format. The idea is to store files as close to their original form as possible. This is turn
provides the business units and the organization the much needed flexibility and
agility without being overly worried by what it can implement.
2. Data processing framework: This is a simple functional programming model initially
popularized by Google as MapReduce.
It essentially uses two functions: the MAP and the REDUCE functions to process data.
The “Mappers” take in a set of key–value pairs and generate intermediate data (which
is another list of key–value pairs).
The “Reducers” then act on this input to produce the output data.
The two functions seemingly work in isolation from one another, thus enabling the
processing to be highly distributed in a highly-parallel, fault-tolerant, and scalable way.
Big Data Systems
There were, however, a few limitations of Hadoop 1.0. They are as follows:
1.The first limitation was therequirement for MapReduce programming expertise along
with proficiency required in other programming languages, notably Java.
2.It supported only batch processing which although is suitable for tasks such as log
analysis, large-scale data mining projects but pretty much unsuitable for other kinds
of projects.
3.One major limitation was that Hadoop 1.0 was tightly computationally coupled with
MapReduce, which meant that the established data management vendors were left
with two options:
Either rewrite their functionality in MapReduce so that it could be executed in
Hadoop or extract the data from HDFS and process it outside of Hadoop.
None of the options were viable as it led to process inefficiencies caused by
the data being moved in and out of the Hadoop cluster.
Let us look at whether these limitations have been wholly or in parts resolved
by Hadoop 2.0.
Big Data Systems
4.2.3.2 Hadoop 2.0
• In Hadoop 2.0, HDFS continues to be the data storage framework.
• However, a new and separate resource management framework called Yet Another
Resource Negotiator (YARN) has been added.
• Any application capable of dividing itself into parallel tasks is supported by YARN.
• YARN coordinates the allocation of subtasks of the submitted application, thereby
further enhancing the flexibility, scalability, and efficiency of the applications.
• It works by having an ApplicationMaster in place of the erstwhile JobTracker, running
applications on resources governed by a new NodeManager (in place of the erstwhile
TaskTracker).
• ApplicationMaster is able to run any application and not just MapReduce.
Big Data Systems
• This, in other words, means that the MapReduce Programming expertise is no
longer required.
• Furthermore, it not only supports batch processing but also real-time processing.
• MapReduce is no longer the only data processing option;
• other alternative data processing functions such as data standardization, master
data management can now be performed natively in HDFS.
Big Data Systems
Big Data Systems
4.2.4 Overview of Hadoop Ecosystems
There are components available in the Hadoop ecosystem for data ingestion,
processing, and analysis.
Data Ingestion → Data Processing → Data Analysis
Components that help with Data Ingestion are:
1.Sqoop
2.Flume
Components that help with Data Processing are:
1.MapReduce
2.Spark
Components that help with Data Analysis are:
1.Pig
2.Hive
3.Impala
Big Data Systems
HDFS
• It is the distributed storage unit of Hadoop.
• It provides streaming access to file system data as well as file permissions and
authentication. It is based on GFS (Google File System).
• It is used to scale a single cluster node to hundreds and thousands of nodes.
• It handles large datasets running on commodity hardware. HDFS is highly
fault-tolerant. It stores files across multiple machines.
• These files are stored in redundant fashion to allow for data recovery in case of
failure.
Big Data Systems
• An e-commerce website stores millions of customers’ data in a distributed
manner. Data has been collected over 4–5 years.
• It then runs batch analytics on the archived data to analyze customer’s behavior,
buying patterns, their preferences, their requirements, etc.
• This helps to understand which products are purchased by customers in which
months, etc.
Big Data Systems
HBase
• It stores data in HDFS.
• It is the first non-batch component of the Hadoop Ecosystem.
• It is a database on top of HDFS. It provides a quick random access to the stored data.
• It has very low latency compared to HDFS.
• It is a NoSQL database, is non-relational and is a column-oriented database.
• A table can have thousands of columns.
• A table can have multiple rows.
• Each row can have several column families.
• Each column family can have several columns.
• Each column can have several key values.
• It is based on Google BigTable.
• This is widely used by Facebook, Twitter, Yahoo, etc.
Big Data Systems
Difference between HBase and Hadoop/HDFS
1. HDFS is the file system whereas HBase is a Hadoop database. It is like NTFS
and MySQL.
2. HDFS is WORM (Write once and read multiple times or many times). Latest
versions support appending of data but this feature is rarely used. However,
HBase supports real-time random read and write.
3. HDFS is based on Google File System (GFS) whereas HBase is based on Google
Big Table.
4. HDFS supports only full table scan or partition table scan. Hbase supports
random small range scan or table scan.
5. Performance of Hive on HDFS is relatively very good but for HBase it becomes
4–5 times slower.
6. The access to data is via MapReduce job only in HDFS whereas in HBase the
access is via Java APIs, Rest, Avro, Thrift APIs.
7. HDFS does not support dynamic storage owing to its rigid structure whereas
HBase supports dynamic storage.
8. HDFS has high latency operations whereas HBase has low latency operations.
9. HDFS is most suitable for batch analytics whereas HBase is for real-time
analytics.
Big Data Systems
Hadoop Ecosystem Components for Data Ingestion
1.Sqoop: Sqoop stands for SQL to Hadoop. Its main functions are
a) Importing data from RDBMS such as MySQL, Oracle, DB2, etc. to Hadoop file
system (HDFS, HBase, Hive).
b) Exporting data from Hadoop File system (HDFS, HBase, Hive) to RDBMS (MySQL,
Oracle, DB2).
Uses of Sqoop
a) It has a connector-based architecture to allow plug-ins to connect to external
systems such as MySQL, Oracle, DB2, etc.
b) It can provision the data from external system on to HDFS and populate tables in
Hive and HBase.
c) It integrates with Oozie allowing you to schedule and automate import and export
tasks. Big Data Systems
2. Flume: Flume is an important log aggregator (aggregates logs from different
machines and places them in HDFS)component in the Hadoop ecosystem.
Flume has been developed by Cloudera.
It is designed for high volume ingestion of event-based data into Hadoop.
The default destination in Flume (called as sink in flume parlance) is HDFS.
However it can also write to HBase or Solr.
Big Data Systems
• There is a bank of web servers.
• Flume moves log events from those files into new
aggregated files in HDFS for processing.
Big Data Systems
Hadoop Ecosystem Components for Data Processing
1. MapReduce: It is a programing paradigm that allows distributed and parallel
processing of huge datasets.
2. It is based on Google MapReduce.
3. Google released a paper on MapReduce programming paradigm in 2004 and that
became the genesis of Hadoop processing model.
4. The MapReduce framework gets the input data from HDFS.
5. There are two main phases: Map phase and the Reduce phase.
6. The map phase converts the input data into another set of data (key–value pairs).
7. This new intermediate dataset then serves as the input to the reduce phase.
8. The reduce phase acts on the datasets to combine (aggregate and consolidate)
and reduce them to a smaller set of tuples. T
9. he result is then stored back in HDFS.
Big Data Systems
2. Spark:
• It is both a programming model as well as a computing model.
• It is an open-source big data processing framework.
• It was originally developed in 2009 at UC Berkeley’s AmpLab and
became an open-source project in 2010.
• It is written in Scala.
• It provides in-memory computing for Hadoop.
• In Spark, workloads execute in memory rather than on disk owing to
which it is much faster (10 to 100 times) than when the workload is
executed on disk.
However, if the datasets are too large to fit into the available system
memory, it can perform conventional disk-based processing.
It serves as a potentially faster and more flexible alternative to
MapReduce.
It accesses data from HDFS (Spark does not have its own distributed
file system) but bypasses the MapReduce processing.
Big Data Systems
• Spark can be used with Hadoop coexisting smoothly with MapReduce (sitting on
top of Hadoop YARN) or used independently of Hadoop (standalone).
• As a programming model, it works well with Scala, Python (it has API
connectors for using it with Java or Python) or R programming language.
Big Data Systems
The following are the Spark libraries:
a)Spark SQL: Spark also has support for SQL. Spark SQL uses SQL to help query
data stored in disparate applications.
b)Spark streaming: It helps to analyze and present data in real time.
c)MLib: It supports machine learning such as applying advanced statistical
operations on data in Spark Cluster.
d)GraphX: It helps in graph parallel computation.
Spark and Hadoop are usually used together by several companies.
Hadoop was primarily designed to house unstructured data and run batch
processing operations on it.
Big Data Systems
• Spark is used extensively for its high speed in memory computing and ability to run
advanced real-time analytics.
• The two together have been giving very good results.
Big Data Systems
Hadoop Ecosystem Components for Data Analysis
• Pig: It is a high-level scripting language used with Hadoop.
• It serves as an alternative to MapReduce.
• It has two parts:
(a) Pig Latin: It is SQL-like scripting language. Pig Latin scripts are translated
into MapReduce jobs which can then run on YARN and process data in the HDFS
cluster.
It was initially developed by Yahoo. It is immensely popular with developers who
are not comfortable with MapReduce.
However, SQL developers may have a preference for Hive.
Big Data Systems
• How it works?
• There is a “Load” command available to load the data from “HDFS” into Pig.
• Then one can perform functions such as grouping, filtering, sorting, joining etc.
• The processed or computed data can then be either displayed on screen or
placed back into HDFS.
• It gives you a platform for building data flow for ETL (Extract, Transform and
Load), processing and analyzing huge data sets.
(b) Pig runtime: It is the runtime environment.
Big Data Systems
2. Hive:
•Hive is a data warehouse software project built on top of Hadoop.
•Three main tasks performed by Hive are summarization, querying and analysis.
•It supports queries written in a language called HQL or HiveQL which is a
declarative SQL-like language.
•It converts the SQL-style queries into MapReduce jobs which are then executed
on the Hadoop platform.
Big Data Systems
Difference between Hive and RDBMS
Both Hive and traditional databases such as MySQL, MS SQL Server, PostgreSQL
support SQL interface.
However, Hive is better known as a datawarehouse (D/W) rather than a database.
Big Data Systems
Let us look at the difference between Hive and traditional databases as regards the
schema.
1.Hive enforces schema on Read Time whereas RDBMS enforces schema on Write Time. In
RDBMS, at the time of loading/inserting data, the table’s schema is enforced.
•If the data being loaded does not conform to the schema then it is rejected.
•Thus, the schema is enforced on write (loading the data into the database).
•Schema on write takes longer to load the data into the database; however it makes up
for it during data retrieval with a good query time performance.
However, Hive does not enforce the schema when the data is being loaded into the
D/W. It is enforced only when the data is being read/retrieved.
This is called schema on read.
It definitely makes for fast initial load as the data load or insertion operation is just a file
copy or move.
Big Data Systems
2. Hive is based on the notion of write once and read many times whereas the RDBMS
is designed for read and write many times.
3.
•Hadoop is a batch-oriented system. Hive, therefore, is not suitable for OLTP (Online
Transaction Processing) but, although not ideal, seems closer to OLAP (Online
Analytical Processing).
•The reason being that there is quite a latency between issuing a query and receiving a
reply as the query written in HiveQL will be converted to MapReduce jobs which are
then executed on the Hadoop cluster.
•RDBMS is suitable for housing day-to-day transaction data and supports all OLTP
operations with frequent insertions, modifications (updates), deletions of the data.
Big Data Systems
4. Hive handles static data analysis which is non-real-time data. Hive is the data
warehouse of Hadoop. There are no frequent updates to the data and the query
response time is not fast. RDBMS is suited for handling dynamic data which is real
time.
5. Hive can be easily scaled at a very low cost when compared to RDMS. Hive uses HDFS
to store data, thus it cannot be considered as the owner of the data, while on the
other hand RDBMS is the owner of the data responsible for storing, managing and
manipulating it in the database.
6. Hive uses the concept of parallel computing, whereas RDBMS uses serial computing.
Big Data Systems
Big Data Systems
Big Data Systems
Big Data Systems
Difference between Hive and HBase
1.Hive is a MapReduce-based SQL engine that runs on top of Hadoop.
HBase is a key–value NoSQL database that runs on top of
HDFS.
2.Hive is for batch processing of big data. HBase is for real-time
data streaming.
Impala It is a high performance SQL engine that runs on Hadoop cluster.
It is ideal for interactive analysis.
It has very low latency measured in milliseconds.
It supports a dialect of SQL called Impala SQL.
Big Data Systems
• ZooKeeper It is a coordination service for distributed applications.
• Oozie It is a workflow scheduler system to manage Apache Hadoop jobs.
• Mahout It is a scalable machine learning and data mining library.
• Chukwa It is a data collection system for managing large distributed systems.
• Ambari It is a web-based tool for provisioning, managing, and monitoring
Apache Hadoop clusters.
Big Data Systems
4.2.5 Hadoop Distributions
• Hadoop is an open-source Apache project. Anyone can freely download the
core aspects of Hadoop. The core aspects of Hadoop include the following:
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN (Yet Another Resource Negotiator)
• Hadoop MapReduce
Big Data Systems
• There are few companies such as IBM, Amazon Web Services, Microsoft,
Teradata, Hortonworks, Cloudera, etc. that have packaged Hadoop into a
more easily consumable distributions or services.
• Although each of these companies have a slightly different strategy, the key
essence remains its ability to distribute data and workloads across potentially
thousands of servers thus making big data manageable data.
• A few Hadoop distributions are given in Figure 4.12.
Big Data Systems
Big Data Systems
4.2.6 Hadoop versus SQL
Table 4.6 lists the differences between Hadoop and SQL.
Big Data Systems
4.2.7 Integrated Hadoop Systems Offered by Leading Market
Vendors
Refer Figure 4.13 to get a glimpse of the leading market vendors
offering integrated Hadoop systems.
Big Data Systems
4.2.8 Cloud-Based Hadoop Solutions
• Amazon Web Services holds out a comprehensive, end-to-end portfolio of
cloud computing services to help manage big data.
• The aim is to achieve this and more along with retaining the emphasis on
reducing costs, scaling to meet demand, and accelerating the speed of
innovation.
Big Data Systems
The Google Cloud Storage connector for Hadoop empowers one to perform
MapReduce jobs directly on data in Google Cloud Storage, without the need to copy
it to local disk and running it in the
Hadoop Distributed File System (HDFS).
The connector simplifies Hadoop deployment, and at the same time reduces cost
and provides performance comparable to HDFS, all this while increasing reliability by
eliminating the single point of failure of the name node.
Refer Figure 4.14.
Big Data Systems
Thank you
Q&A
49
Systems Programming - V 3.0
IMP Note to Self
50
Systems Programming - V 3.0