SABSE3-Big Data Engineer 2021-Ecosystem-Course Guide - High
SABSE3-Big Data Engineer 2021-Ecosystem-Course Guide - High
cover
Front cover
Course Guide
Big Data Engineer 2021
Big Data Ecosystem
Course code SABSE ERC 3.0
TOC
Contents
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
TMK
Trademarks
The reader should recognize that the following terms, which appear in the content of this training
document, are official trademarks of IBM or other companies:
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide.
The following are trademarks of International Business Machines Corporation, registered in many
jurisdictions worldwide:
Cloudant® Db2® IBM Research™
IBM Spectrum® InfoSphere® Insight®
Resilient® Resource® Smarter Planet®
SPSS® Watson™ WebSphere®
Linux® is a registered trademark used pursuant to a sublicense from the Linux Foundation, the
exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other
countries, or both.
Java™ and all Java-based trademarks and logos are trademarks or registered trademarks of
Oracle and/or its affiliates.
UNIX is a registered trademark of The Open Group in the United States and other countries.
RStudio® is a registered trademark of RStudio, Inc.
Evolution® is a trademark or registered trademark of Kenexa, an IBM Company.
Veracity® is a trademark or registered trademark of Merge Healthcare, an IBM Company.
Other product and service names might be trademarks of IBM or other companies.
pref
Course description
Big Data Engineer 2021 - Big Data Ecosystem
Duration: 3 days
Purpose
The Big Data Engineer – Big Data Ecosystem course is part of the Big Data Engineer career path.
It prepares students to use the big data platform and methodologies to collect and analyze large
amounts of data from different sources. This course introduces Apache Hadoop and its ecosystem,
such as HDFS, YARN, MapReduce, and more. This course covers Hortonworks Data Platform
(HDP), the open source Apache Hadoop distribution based on YARN. Students learn about Apache
Ambari, which is an open framework for provisioning, managing, and monitoring Apache Hadoop
clusters. Ambari is part of HDP.
Other topics that you learn in this course include:
• Apache Spark, the general-purpose distributed computing engine that is used for processing
and analyzing a large amount of data.
• Storing and querying data efficiently.
• Security and data governance.
• Stream computing and how it is used to analyze and process vast amount of data in real time
Audience
Undergraduate senior students from IT related academic programs, for example, computer
science, software engineering, information systems and others.
Prerequisites
Before attending Module III Big Data Engineer classroom, students must meet the following
prerequisites:
• Successful completion of Module I Big Data Overview (self-study).
pref
Objectives
After completing this course, you should be able to:
• Explain the concept of big data.
• List the various characteristics of big data.
• Recognize typical big data use cases.
• List Apache Hadoop core components and their purpose.
• Describe the Hadoop infrastructure and its ecosystem.
• Identify what is a good fit for Hadoop and what is not.
• Describe the functions and features of HDP.
• Explain the purpose and benefits of IBM added value components.
• Explain the purpose of Apache Ambari, describe its architecture, and manage Hadoop clusters
with Apache Ambari.
• Describe the nature of the Hadoop Distributed File System (HDFS) and run HDFS commands.
• Describe the MapReduce programming model and explain the Java code that is required to
handle the Mapper class, the Reducer class, and the program driver that is needed to access
MapReduce.
• Compile MapReduce programs and run them by using Hadoop and YARN commands.
• Describe Apache Hadoop v2 and YARN.
• Explain the nature and purpose of Apache Spark in the Hadoop infrastructure.
• Work with Spark RDD (Resilient Distributed Dataset) with Python.
• Use the HBase shell to create HBase tables, explore the HBase data model, store, and access
data in HBase.
• Use the Hive CLI to create Hive tables, import data into Hive, and query data on Hive.
• Use the Beeline CLI to query data on Hive.
• Explain the need for data governance and the role of data security in this governance.
• List the five pillars of security and how they are implemented with Hortonworks Data Platform
(HDP).
• Explain streaming data concepts and terminology.
pref
Agenda
Note
The following unit and exercise durations are estimates, and might not reflect every class
experience.
Day 1
(00:30) Welcome
(01:00) Unit 1 - Introduction to Big Data
(00:30) Unit 2 - Introduction to Hortonworks Data Platform (HDP)
(01:00) Lunch break
(00:30) Exercise 1 - Exploring the lab environment
(00:30) Unit 3 - Introduction to Apache Ambari
(00:45) Exercise 2 - Managing Hadoop clusters with Apache Ambari
Day 2
(01:00) Unit 4 - Apache Hadoop and HDFS
(00:30) Exercise 3 - File access and basic commands with Hadoop Distributed File System (HDFS)
(02:20) Unit 5 - MapReduce and YARN
(01:00) Lunch break
(00:45) Exercise 4 - Running MapReduce and YARN jobs
(00:30) Exercise 5 - Creating and coding a simple MapReduce job
(02:00) Unit 6 - Introduction to Apache Spark
Day 3
(00:45) Exercise 6 - Running Spark applications in Python
(02:00) Unit 7 - Storing and querying data
(01:00) Lunch break
(01:30) Exercise 7 - Using Apache Hbase and Apache Hive to access Hadoop data
(01:15) Unit 8 - Security and governance
(01:00) Unit 9 - Stream computing
Uempty
Overview
This unit provides an overview of big data, why it is important, and typical use cases. This unit
describes the evolution from traditional data processing to big data processing. It introduces
Apache Hadoop and its ecosystem.
Uempty
Unit objectives
• Explain the concept of big data.
• Describe the factors that contributed to the emergence of big data
processing.
• List the various characteristics of big data.
• List typical big data use cases.
• Describe the evolution from traditional data processing to big data
processing.
• List Apache Hadoop core components and their purpose.
• Describe the Hadoop infrastructure and the purpose of the main
projects.
• Identify what is a good fit for Hadoop and what is not.
Uempty
1.1. Big data overview
Uempty
Uempty
Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure
Uempty
Big data is a term that is used to describe large collections of data (also known as data sets). Big
data might be unstructured and grow so large and quickly that is difficult to manage with regular
database or statistics tools.
Uempty
We are witnessing a tsunami of huge volume of data of different types and formats that make
managing, processing, storing, safeguarding, securing, and transporting data a real challenge.
“Big data refers to non-conventional strategies and innovative technologies that are used by
businesses and organizations to capture, manage, process, and make sense of a large volume of
data.” (Source: Reed, J, Data Analytics: Applicable Data to Advance Any Business. Seattle, WA,
CreateSpace Independent Publishing Platform, 2017. 1544916507.
The analogies:
• Elephant (hence the logo of Hadoop)
• Humongous (the underlying word for Mongo Database)
• Streams, data lakes, and oceans of data
Uempty
There is much data, such as historical and new data that is generated from social media apps,
science, medical research, stream data from web applications, and IoT sensor data. The amount of
data is larger than ever, growing exponentially, and in many different formats.
The business value in the data comes from the meaning that you can harvest from it. Deriving
business value from all that data is a significant problem.
Uempty
Uempty
Variety
Different
forms of data
Velocity
Veracity
Analysis of
streaming Value Uncertainty
of data
data
Figure 1-8. The four classic dimensions of big data (the four Vs)
Uempty
• Data variety
More sources of data mean more varieties of data in different formats: from traditional
documents and databases, to semi-structured and unstructured data from click streams, GPS
location data, social media apps, and IoT (to name a few). Different data formats mean that it is
tougher to derive value (meaning) from the data because it must all be extracted for processing
in different ways. Traditional computing methods do not work on all these different varieties of
data.
• Data veracity
There is usually noise, biases, and abnormality in data. It is possible that such a huge amount
of data has some uncertainty that is associated with it. After much data is gathered, it must be
curated, sanitized, and cleansed.
Often, this process is seen as the thankless job of being a data janitor, and it can take more
than 85% of a data analyst’s or data scientist’s time. Veracity in data analysis is considered the
biggest challenge when compared to volume, velocity, and variety. The large volume, wide
variety, and high velocity along with high-end technology has no significance if the data that is
collected or reported is incorrect. Data trustworthiness (in other words, the quality of data) is of
the highest importance in the big data world.
• Data value
The business value in the data comes from the meaning that we can harvest from it. The value
comes from converting a large volume of data into actionable insights that are generated by
analyzing information, which leads to smarter decision making.
References:
• What is big data? More than volume, velocity and variety:
https://developer.ibm.com/blogs/what-is-big-data-more-than-volume-velocity-and-variety/
• The Four Vs of Big Data:
https://www.ibmbigdatahub.com/infographic/four-vs-big-data
• Big Data Analytics:
ftp://ftp.software.ibm.com/software/tw/Defining_Big_Data_through_3V_v.pdf
• The 5 Vs of big data:
https://www.ibm.com/blogs/watson-health/the-5-vs-of-big-data/
• The 4 Vs of Big Data for Yielding Invaluable Gems of Information:
https://www.promptcloud.com/blog/The-4-Vs-of-Big-Data-for-Yielding-Invaluable-Gems-of-Infor
mation
Uempty
Domain knowledge
Statistics Visualizations
Data
Machine Science Pattern
learning recognition
Business analysis
Presentation
KDD AI
Databases and
data processing
Big data analytics is the use of advanced analytic techniques against large, diverse data sets from
different sources and in different sizes from terabytes to zettabytes. There are several specialized
techniques and technologies that are involved. The slide shows some of the big data analytics
techniques and the relationship between them. This list is not exhaustive, but it helps you
understand the complexity of the problem domain.
For more information, see the articles that are listed under References.
References:
• An Insight into 26 Big Data Analytic Techniques: Part 1:
https://blogs.systweak.com/an-insight-into-26-big-data-analytic-techniques-part-1/
• An Insight into 26 Big Data Analytic Techniques: Part 2:
https://blogs.systweak.com/an-insight-into-26-big-data-analytic-techniques-part-2/
• Big data analytics:
https://www.ibm.com/analytics/hadoop/big-data-analytics
• A Beginner’s Guide to Big Data Analytics:
https://blogs.systweak.com/a-beginners-guide-to-big-data-analytics/
Uempty
1.2. Big data use cases
Uempty
Uempty
Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure
Uempty
Data warehouse
modernization
Operational analysis Integrate big data and data
Analyze various machine warehouse capabilities to gain
data for improved new business insights and
business results. increase operational efficiency.
IBM conducted surveys, studied analysts’ findings, spoke with thousands of customers and
prospects, and implemented hundreds of big data solutions. As a result, IBM identified five
high-value use cases that enable organizations to gain new value from big data:
• Big data exploration: Find, visualize, and understand big data to improve decision making.
• Enhanced 360-degree view of the customer: Enhance existing customer views by incorporating
internal and external information sources.
• Security and intelligence extension: Reduce risk, detect fraud, and monitor cybersecurity in real
time.
• Operations analysis: Analyze various machine data for better business results and operational
efficiency.
• Data warehouse modernization: Integrate big data and traditional data warehouse capabilities
to gain new business insights while optimizing the existing warehouse infrastructure.
These use cases are not intended to be sequential or prioritized. The key is to identify which use
cases make the most sense for the organization given the challenges that it faces.
Uempty
Figure 1-13. Common use cases that are applied to big data
Uempty
Businesses today are drowning in data. Big data analytics and AI are helping businesses across a
broad range of industries respond to the needs of their customers, which drive increased revenue
and reduced costs.
Resources:
• How 10 industries are using big data to win large:
https://www.ibm.com/blogs/watson/2016/07/10-industries-using-big-data-win-big/
• Cloudera Blog:
https://blog.cloudera.com/data-360/
• Use Cases:
https://www.ibmbigdatahub.com/use-cases
Uempty
Healthcare organizations are leveraging big data analytics to capture all the information about a
patient. The organizations can get a better view for insight into care coordination and outcomes,
base reimbursement models, population health management, and patient engagement and
outreach. Successfully harnessing big data unleashes the potential to achieve the three critical
objectives for healthcare transformation: Build sustainable healthcare systems, collaborate to
improve care and outcomes, and increase access to healthcare.
Uempty
In the big data world, here is a likely scenario:
• Linda is a diabetic person.
• Linda is seeing her physician for her annual physical.
• Linda experiences symptoms such as tiredness, stress, and irritability.
• In a big data world, Linda’s physician has a 360 view of her healthcare history: diet,
appointments, exercise, lab tests, vital signs, prescriptions, treatments, and allergies.
• The doctor records Linda’s concerns in her electronic health record. They found that the
patients like Linda have success with a wellness program that is covered by her health plan.
• When Linda joins the wellness program, she grants access to the dietician and the trainer to
see her records.
• The trainer sees the previous records.
• A big data analysis of the outcome from other members like Linda suggest to the trainer a
program that benefits Linda.
• The trainer recommends that Linda downloads an application that feeds her activity and vital
signs to her care team.
• With secure access to her wellness program, Linda monitors her health improvements.
• With the help of big data analytics, Linda’s care team sees how she is progressing.
• With these insights, the health plan adjusts the program to increase the effectiveness and offers
the program to other patients like Linda.
References:
Big Data & Analytics for Healthcare:
https://youtu.be/wOwept5WlWM
Uempty
The Precision Medicine Initiative (PMI) is a research project that involves the National Institutes of
Health (NIH) and multiple other research centers. This initiative aims to understand how a person's
genetics, environment, and lifestyle can help determine the best approach to prevent or treat
disease.
The long-term goals of the PMI focus on bringing precision medicine to all areas of health and
healthcare on a large scale. The NIH started a study that is known as the All of Us Research
Program.
The All of Us Research Program is a historic effort to collect and study data from one million or
more people living in the United States. The goal of the program is better health for all of us. The
program began national enrollment in 2018 and is expected to last at least 10 years.
The graphic on the right of the slide (from the National Cancer Institute) illustrates the use of
precision medicine in cancer treatment. Discovering unique therapies that treat an individual’s
cancer base on the specific genetic abnormalities of that person’s tumor.
Uempty
References:
• Obama’s Precision Medicine Initiative is the Ultimate big data project: “Curing both rare
diseases and common cancers doesn't just require new research, but also linking all the data
that researchers already have”:
http://www.fastcompany.com/3057177/obamas-precision-medicine-initiative-is-the-ultimate-big-
data-project
• The Precision Medicine Initiative - White House:
https://www.whitehouse.gov/precision-medicine
• Obama: Precision Medicine Initiative Is First Step to Revolutionizing Medicine - "We may be
able to accelerate the process of discovering cures in ways we've never seen before," the
president said.:
http://www.usnews.com/news/articles/2016-02-25/obama-precision-medicine-initiative-is-first-st
ep-to-revolutionizing-medicine
• All of Us Research Program:
https://allofus.nih.gov/about
• National Institutes of Health - All of Us Research Program:
https://www.nih.gov/precision-medicine-initiative-cohort-program
• National Cancer Institute and the Precision Medicine Initiative:
http://www.cancer.gov/research/key-initiatives/precision-medicine
• Precision Medicine (Wikipedia):
https://en.wikipedia.org/wiki/Precision_medicine
Uempty
Banks face many challenges as they strive to return to pre-2008 profit margins, including reduced
interest rates, unstable financial markets, tighter regulations, and lower performing assets.
Fortunately, banks taking advantage of big data and analytics can generate new revenue streams.
Watch this real-life example of how big data and analytics can improve the overall customer
experience:
https://youtu.be/1RYKgj-QK4I
References:
• Big data analytics:
https://www.ibm.com/analytics/hadoop/big-data-analytics
• IBM Big Data and Analytics at work in Banking:
https://youtu.be/1RYKgj-QK4I
Uempty
References:
• Visa Says Big Data Identifies Billions of Dollars in Fraud:
https://blogs.wsj.com/cio/2013/03/11/visa-says-big-data-identifies-billions-of-dollars-in-fraud/
• VISA: Using Big Data to Continue Being Everywhere You Want to Be:
https://www.hbs.edu/openforum/openforum.hbs.org/goto/challenge/understand-digital-transfor
mation-of-business/visa-using-big-data-to-continue-being-everywhere-you-want-to-be/commen
ts-section.html
Uempty
Financial
References:
• Credit Scoring in the Era of Big Data:
https://yjolt.org/credit-scoring-era-big-data
• Big Data Trends in Financial Services:
https://www.accesswire.com/575714/Big-Data-Trends-in-Financial-Services
Uempty
About 15 years ago, Clive Humby, the man that built Clubcard, the world’s first supermarket loyalty
scheme, coined the expression ‘Data is the new oil.” (Source:
https://medium.com/@adeolaadesina/data-is-the-new-oil-2947ed8804f6)
The metaphor explains that data, like oil, is a resource that is useless if left “unrefined”. Only when
data is mined and analyzed does it create extraordinary value. This now famous phrase was
embraced by the World Economic Forum in a 2011 report, which considered data to be an
economic asset like oil.
"Information is the oil of the 21st century, and analytics is the combustion engine.“ is a quote by
Peter Sondergaard, senior vice president and global head of Research at Gartner, Inc. The quote
highlights the importance of data and data analytics. The quote came from a speech that was given
by Mr. Sondergaard at the Gartner Symposium/ITxpo in October 2011 in Orlando, Florida.
Reference:
Data is the new oil:
https://medium.com/@adeolaadesina/data-is-the-new-oil-2947ed8804f6
Uempty
1.3. Evolution from traditional data processing
to big data processing
Uempty
Figure 1-21. Evolution from traditional data processing to big data processing
Uempty
Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure
Uempty
Just as cloud computing enables new ways for businesses to use IT because of many years of
incremental progress in the area of virtualization, big data now enables new ways of doing business
by bringing advances in analytics and management of both structured and unstructured data into
mainstream solutions.
Big data solutions now enable us to change the way to do business in ways that were not possible
a few years ago by taking advantage of previously unused sources of information.
Graphic source: IBM
Reference:
Big Data Processing:
https://www.sciencedirect.com/topics/computer-science/big-data-processing
Uempty
Uempty
Even the term “byte” is ambiguous. The generally accepted meaning these days is an octet, which
is 8 bits. The de facto standard of 8 bits is a convenient power of two permitting the values 0 - 255
for 1 byte. The international standard IEC 80000-13 codified this common meaning. Many types of
applications use information representable in eight or fewer bits and processor designers optimize
for this common usage. The popularity of major commercial computing architectures aided in the
ubiquitous acceptance of the 8-bit size. The unit octet was defined to explicitly denote a sequence
of 8 bits because of the ambiguity associated at the time with the byte.
Unicode UTF-8 encoding is variable-length and uses 8-bit code units. It was designed for
compatibility with ASCII and to avoid the complications of endianness and byte order marks in the
alternative UTF-16 and UTF-32 encodings. The name is derived from “Universal Coded Character
Set + Transformation Format - 8-bit”. UTF-8 is the dominant character encoding for the World Wide
Web, accounting for over 95% of all web pages, and up to 100% for some languages, as of 2020.
The Internet Mail Consortium (IMC) recommends that all email programs be able to display and
create mail by using UTF-8, and the W3C recommends UTF-8 as the default encoding in XML and
HTML. UTF-8 encodes each of the 1,112,064 valid code points in the Unicode code space
(1,114,112 code points minus 2,048 surrogate code points) by using one to four 8-bit bytes (a group
of 8 bits is known as an octet in the Unicode Standard).
References:
• https://en.wikipedia.org/wiki/Byte
• https://en.wikipedia.org/wiki/UTF-8
Uempty
Before diving into what is Hadoop, let us talk about the context for why Hadoop technology is so
important.
Moore’s law has been true for a long time, but no matter how many more transistors are added to
CPUs and how powerful they become, the bottleneck is disk latency. Scaling up (more powerful
computers with powerful CPUs) is not the answer to all problems because disk latency is the main
issue. Scaling out (to a cluster of computers) is the better approach, and the only approach at
extreme scale.
Uempty
Grace Hopper (9 December 1906 - 1 January 1992) was an American computer scientist and
United States Navy Rear Admiral. She was one of the first programmers of the Harvard Mark
I computer in 1944, invented the first compiler for a computer programming language, and
popularized the idea of machine-independent programming languages, which led to the
development of COBOL, one of the first high-level programming languages.
The quotation source is White, T., Hadoop: The Definitive Guide: Storage and Analysis at Internet
Scale 4th Edition. Sebastopol, CA, O'Reilly Media, 2015. 1491901632.
Reference:
https://en.wikipedia.org/wiki/Grace_Hopper
Uempty
OLTP enables the rapid and accurate data processing behind ATMs and online banking, cash
registers and e-commerce, and many other services that we interact with every day.
Reference:
OLTP:
https://www.ibm.com/cloud/learn/oltp
Uempty
A core component of data warehousing implementations, OLAP enables fast and flexible
multidimensional data analysis for business intelligence (BI) and decision support applications.
OLAP is software for performing multidimensional analysis at high speeds on large volumes of data
from a data warehouse, data mart, or some other unified, centralized data store.
Most business data has multiple dimensions or categories into which the data is broken down for
presentation, tracking, or analysis. For example, sales figures might have several dimensions that
are related to location (region, country, state/province, and store), time (year, month, week, and
day), product (clothing, men/women/children, brand, and type), and more.
In a data warehouse, data sets are stored in tables, each of which can organize data into just two of
these dimensions at a time. OLAP extracts data from multiple relational data sets and reorganizes it
into a multidimensional format, which enables fast processing and insightful analysis.
Reference:
OLAP: https://www.ibm.com/cloud/learn/olap
Uempty
• Event-driven
If when you say “real time” that you mean the opposite of scheduled, then you mean event-
driven. Instead of happening in a particular time interval, event-driven data processing
happens when a certain action or condition triggers it. The performance requirement for it is
generally before another event happens.
Uempty
References:
• Real Time Isn’t As Real As You’ve Been Led to Believe:
https://www.linkedin.com/pulse/real-time-isnt-youve-been-led-believe-james-kobielus
• Four Really Real Meanings of Real-Time:
http://bigdatapage.com/4-really-real-meanings-of-real-time/
Uempty
1.4. Introduction to Apache Hadoop and the
Hadoop infrastructure
Uempty
Introduction to Apache
Hadoop and the Hadoop
infrastructure
Uempty
Topics
• Big data overview
• Big data use cases
• Evolution from traditional data processing to big data processing
• Introduction to Apache Hadoop and the Hadoop infrastructure
Uempty
Uempty
▪ Adding load to the system should result in a graceful decline in performance of individual
jobs, not the failure of the system.
▪ Increasing resources should support a proportional increase in load capacity.
References:
• Cloudera Introduction to Hadoop:
http://people.apache.org/~larsgeorge/SAP-Summit/Slides.pdf
• The Google File System (GFS):
http://research.google.com/archive/gfs.html
• Bigtable: A Distributed Storage System for Structured Data:
https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.p
df
• MapReduce: Simplified Data Processing on Large Clusters:
https://research.google/pubs/pub62/
Uempty
Let’s dive in now into the topic of Hadoop and the Hadoop infrastructure.
This unit covers the “big picture”. The following units explore in more detail key components of the
Hadoop architecture and infrastructure.
References:
• Apache Hadoop:
http://hadoop.apache.org/
• The history of Hadoop: From 4 nodes to the future of data:
https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/2/
Uempty
Uempty
Hadoop is an open source project that develops software for reliable, scalable, and distributed
computing, such as for big data.
It is designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Instead of relying on hardware for high availability, Hadoop is designed to
detect and handle failures at the application layer. This approach delivers a highly available service
on top of a cluster of computers, each of which might be prone to failures.
Hadoop is a series of related projects with the following modules at its core:
• Hadoop Common: The common utilities that support the other Hadoop modules.
• HDFS: A powerful distributed file system that provides high-throughput access to application
data. The idea is to be able to distribute the processing of large data sets over clusters of
inexpensive computers.
• Hadoop YARN: A framework for jobs scheduling and the management of cluster resources.
• Hadoop MapReduce: A core component that is a YARN-based system that allows you to
distribute a large data set over a series of computers for parallel processing.
• Hadoop Ozone: An object store for Hadoop.
Uempty
The Hadoop framework is written in Java and was originally developed by Doug Cutting, who
named it after his son's toy elephant.
Hadoop uses concepts from Google’s MapReduce and GFS technologies as its foundation. It is
optimized to handle massive amounts of data, which might be structured, unstructured, or
semi-structured, by using commodity hardware, that is, relatively inexpensive computers. This
massive parallel processing is done with great performance. In its initial conception, it is a batch
operation handling massive amounts of data, so the response time is not instantaneous.
Hadoop is not used for OLTP or OLAP, but for big data. It complements OLTP and OLAP to manage
data. So Hadoop is not a replacement for a relational database management system (RDBMS).
References:
• Apache Hadoop:
http://hadoop.apache.org/
• What is Hadoop, and how does it relate to cloud?
https://www.ibm.com/blogs/cloud-computing/2014/05/07/hadoop-relate-cloud/
Uempty
Figure 1-37. Why and where Hadoop is used and not used
Uempty
References:
• Apache Hadoop:
http://hadoop.apache.org/
• MapReduce Tutorial:
https://hadoop.apache.org/docs/r3.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-cor
e/MapReduceTutorial.html
• HDFS:
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
• YARN:
https://hadoop.apache.org/docs/r3.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html
Uempty
There are two key components or aspects of Hadoop that are important to understand:
• HDFS is where Hadoop stores the data. This file system spans all the nodes in a cluster. HDFS
links together the data that is on many local nodes, which makes the data part of one large file
system. You can use other file systems with Hadoop, for example MapR MapRFS and IBM
Spectrum Scale (formerly known as IBM General Parallel File System (IBM GPFS)). HDFS is
the most popular file system for Hadoop.
HDFS is a distributed file system that is designed to run on commodity hardware. It has
significant differences from other distributed file systems. HDFS is highly fault-tolerant and
designed to be deployed on low-cost hardware. HDFS provides high throughput access to
application data and is suitable for applications that have large data sets.
Typically, the compute nodes and the storage nodes are the same. The MapReduce framework
and the HDFS run on the same set of nodes. This configuration allows the framework to
effectively schedule tasks on the nodes where data is present, resulting in high aggregate
bandwidth across the cluster.
Uempty
• MapReduce is a software framework that was introduced by Google to support distributed
computing on large data sets of clusters of computers. Applications that are written to use the
MapReduce framework process vast amounts of data (multi-terabyte data sets) in parallel on
large clusters (thousands of nodes) of commodity hardware reliably and a fault-tolerant manner.
A MapReduce job usually splits the input data set into independent chunks, which are
processed by the map tasks in a parallel manner. The framework sorts the outputs of the maps,
which are then input to the reduce tasks. Typically, both the input and the output of the job are
stored in a file system. The framework takes care of scheduling tasks, monitoring them and
reruns the failed tasks. Applications specify the input/output locations and supply map and
reduce functions.
References:
• MapReduce Tutorial:
https://hadoop.apache.org/docs/r3.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-cor
e/MapReduceTutorial.html
• HDFS:
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Intro
duction
Uempty
Reference:
Hadoop vs RDBMS: Comparison between Hadoop & Database?
https://community.cloudera.com/t5/Support-Questions/Hadoop-vs-RDBMS-What-is-the-difference-
between-Hadoop/td-p/232165
Uempty
companies.
• Hadoop-related projects: Apache Oozie
Workflow
&
Apache Chukwa
Monitoring
Apache ZooKeeper
Coordination
&
ƒ Hbase Scheduling Management
ƒ Apache Hive
ƒ Apache Pig Apache Hive Apache Pig Apache Avro Apache Sqoop
Query/SQL data flow
ƒ Apache Avro Data serialization/RPC RDBMS connector
Data integration
ƒ Apache Sqoop
ƒ Apache Oozie MapReduce
Distributed processing
Yarn
Cluster and resource management
ƒ Apache ZooKeeper Cluster management
ƒ Apache Spark
Introduction to big data © Copyright IBM Corporation 2021
Most of the services that are available in the Hadoop infrastructure supplement the core
components of Hadoop, which include HDFS, YARN, MapReduce, and Common. The Hadoop
infrastructure includes both Apache open source projects and other commercial tools and solutions.
The slide shows some examples of Hadoop-related projects at Apache.
Note
Apart from the components that are listed in the slide, there are many other components that are
part of the Hadoop infrastructure. The components in the slide are just an example.
Uempty
• HBase
A scalable, distributed database that supports structured data storage for large tables. It is used
for random, real-time read/write access to big data. The goal of HBase is to host large tables.
• Apache Hive
A data warehouse infrastructure that provides data summarization and ad hoc querying.
Apache Hive facilitates reading, writing, and managing large data sets that are in distributed
storage by using SQL.
• Apache Pig
• A high-level data flow language and execution framework for parallel computation. Apache Pig
is a platform for analyzing large data sets. Apache Pig consists of a high-level language for
expressing data analysis programs that is coupled with an infrastructure for evaluating these
programs.
• Apache Avro
A data serialization system.
• Apache Sqoop
A tool that is designed for efficiently transferring bulk data between Apache Hadoop and
structured data stores, such as relational databases.
• Apache Oozie
A workflow scheduler system to manage Apache Hadoop jobs.
• Apache ZooKeeper
A high-performance coordination service for distributed applications. Apache ZooKeeper is a
centralized service for maintaining configuration information; naming; providing distributed
synchronization; and providing group services. Distributed applications use these kinds of
services.
• Apache Chukwa
A data collection system for managing a large distributed system. It includes a toolkit for
displaying, monitoring, and analyzing results to make the best use of the collected data.
• Apache Ambari
A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which
include support for Hadoop HDFS, Hadoop MapReduce, Apache Hive, HCatalog, HBase,
Apache ZooKeeper, Apache Oozie, Apache Pig, and Apache Sqoop. Apache Ambari also
provides a dashboard for viewing cluster health such as heatmaps. The dashboard can
visualize MapReduce, Apache Pig, and Apache Hive applications along with features to
diagnose their performance characteristics.
• Apache Spark
A fast and general compute engine for Hadoop data. Apache Spark provides a simple
programming model that supports a wide range of applications, including ETL, machine
learning, stream processing, and graph computation.
Uempty
References:
• https://www.coursera.org/learn/hadoop/lecture/E87sw/hadoop-ecosystem-major-components
• The Hadoop infrastructure Table:
https://hadoopecosystemtable.github.io/
• Apache Hadoop:
https://hadoop.apache.org/
• Apache Hbase:
https://hbase.apache.org/
• Apache Hive:
https://hive.apache.org/
• Apache Pig:
https://pig.apache.org/
• Apache Avro:
https://avro.apache.org/docs/current/
• Apache Sqoop:
https://sqoop.apache.org/
• Apache Oozie:
https://oozie.apache.org/
• Apache ZooKeeper:
https://zookeeper.apache.org/
• Apache Chukwa:
https://chukwa.apache.org/
• Apache Ambari:
https://ambari.apache.org/
Uempty
Think differently
As you start to work with Hadoop, you must think differently:
• There are different processing paradigms.
• There are different approaches to storing data.
• Think ELT rather than ETL.
Uempty
Unit summary
• Explained the concept of big data.
• Described the factors that contributed to the emergence of big data
processing.
• Listed the various characteristics of big data.
• Listed typical big data use cases.
• Described the evolution from traditional data processing to big data
processing.
• Listed Apache Hadoop core components and their purpose.
• Described the Hadoop infrastructure and the purpose of the main
projects.
• Identified what is a good fit for Hadoop and what is not.
Uempty
Review questions
1. True or False: the number of Vs of big data are exactly four.
Uempty
Uempty
Review answers
1. True or False: the number of Vs of big data are exactly four.
Uempty
Uempty
Overview
In this unit, you learn about the Hortonworks Data Platform (HDP) the open source Apache Hadoop
distribution based on a centralized architecture (YARN).
Uempty
Unit objectives
• Describe the functions and features of HDP.
• List the IBM added value components.
• Describe the purpose and benefits of each added value component.
Uempty
2.1. Hortonworks Data Platform overview
Uempty
Uempty
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Uempty
Data at rest is data that is stored physically in any digital form (for example, in databases, data
warehouses, spreadsheets, archives, tapes, off-site backups, or mobile devices).
HDP is a powerful platform for managing big data at rest.
HDP is an open-source enterprise Hadoop distribution that has the following attributes:
• 100% open source.
• Centrally designed with YARN at its core.
• Interoperable with existing technology and skills.
• Enterprise-ready, with data services for operations, governance, and security.
Uempty
Governance
Tools Security Operations
Integration
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
Uempty
2.2. Data flow
Uempty
Data flow
Uempty
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Uempty
Data Flow
Governance
Tools Security Operations
Integration
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
In this section, you learn about some of the data workflow tools that come with HDP.
Uempty
Kafka
• Often used in place of traditional message brokers like JMS and AMQP
because of its higher throughput, reliability and replication.
Apache Kafka is a messaging system used for real-time data pipelines. Kafka is used to build
real-time streaming data pipelines that get data between systems or applications. Kafka works with
a number of Hadoop tools for various applications.
Examples of uses cases are:
• Website activity tracking: capturing user site activities for real-time tracking and monitoring
• Metrics: monitoring data
• Log aggregation: collecting logs from various sources to a central location for processing
• Stream processing: article recommendations based on user activity
• Event sourcing: state changes in applications are logged as time-ordered sequence of records
• Commit log: external commit log system that helps with replicating data between nodes in case
of failed nodes
Reference:
More information can be found here: https://kafka.apache.org/
Uempty
Sqoop
• Can also be used to extract data from Hadoop and export it to relational
databases and enterprise data warehouses
• Helps offload some tasks such as ETL from Enterprise Data Warehouse
to Hadoop for lower cost and efficient execution
Sqoop is a tool for moving data between structured databases or relational databases and related
Hadoop system. This works both ways. You can take data in your RDBMS and move it to your
HDFS and move from your HDFS to some other form of RDBMS. You can use Sqoop to offload
tasks such as ETL from data warehouses to Hadoop for lower cost and efficient execution for
analytics.
Reference:
Check out the Sqoop documentation for more info: http://sqoop.apache.org/
Uempty
2.3. Data access
Uempty
Data access
Uempty
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Uempty
Data access
Governance
Tools Security Operations
Integration
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
In this section, you learn about some of the data access tools that come with HDP. These include
MapReduce, Pig, Hive, HBase, Accumulo, Phoenix, Storm, Solr, Spark, Druid and Slider.
Uempty
Hive
• Includes HCatalog
ƒ Global metadata management layer that exposes Hive table metadata to
other Hadoop applications.
Hive is a data warehouse system built on top of Hadoop. Hive supports easy data summarization,
ad-hoc queries, and analysis of large data sets in Hadoop. For those who have some SQL
background, Hive is a great tool because it allows you to use a SQL-like syntax to access data that
is stored in HDFS. Hive also works well with other applications in the Hadoop ecosystem. It
includes an HCatalog, which is a global metadata management layer that exposes the Hive table
metadata to other Hadoop applications.
Reference:
Hive documentation: https://hive.apache.org/
Uempty
Pig
Another data access tool is Pig, which was written for analyzing large data sets. Pig has its own
language, called Pig Latin, with a purpose of simplifying MapReduce programming. PigLatin is a
simple scripting language. After it is compiled, it becomes MapReduce jobs to run against Hadoop
data. The Pig system is able to optimize your code, so you as the developer can focus on the
semantics rather than efficiency.
Reference:
Pig documentation: http://pig.apache.org/
Uempty
HBase
Uempty
Accumulo
• Features:
ƒ Server-side programming
ƒ Designed to scale
ƒ Cell-based access control
ƒ Stable
Accumulo is another key/value store, similar to HBase. You can think of Accumulo as a "highly
secure HBase". There are various features that provide a robust, scalable, data storage and
retrieval. It is also based on Google's BigTable, which again, is the same technology for HBase.
However, HBase is getting more features as it aligns closer to what the community needs. It is up to
you to evaluate your requirements and determine the best tool for your needs.
Reference:
Accumulo documentation: https://accumulo.apache.org/
Uempty
Phoenix
• Apache Phoenix enables OLTP and operational analytics in Hadoop for
low latency applications by combining the best of both worlds:
ƒ The power of standard SQL and JDBC APIs with full ACID transaction
capabilities.
ƒ The flexibility of late-bound, schema-on-read capabilities from the NoSQL
world by leveraging HBase as its backing store.
• Fully integrated with other Hadoop products such as Spark, Hive, Pig,
and MapReduce
Phoenix enables online transactional process and operational analytics in Hadoop for low latency
applications. Essentially, it is an SQL for NoSQL database. Recall that HBase is not designed for
transactional processing. Phoenix combines the best of the NoSQL data store and the need for
transactional processing. It is fully integrated with other Hadoop products such as Spark, Hive, Pig,
and MapReduce.
Reference:
Phoenix documentation: https://phoenix.apache.org/
Uempty
Storm
• Apache Storm is an open source distributed real-time computation
system.
ƒ Fast
ƒ Scalable
ƒ Fault-tolerant
• Useful when milliseconds of latency matter and Spark isn't fast enough
ƒ Has been benchmarked at over a million tuples processed per second per
node
Storm is designed for real-time computation that is fast, scalable, and fault-tolerant. When you have
a use case to analyze streaming data, consider Storm as an option. There are numerous other
streaming tools available such as Spark or even IBM Streams, a proprietary software with decades
of research behind it for real-time analytics.
Uempty
Solr
Solr is built by using the Apache Lucene search library. It is designed for full text indexing and
searching. Solr powers the search of many large sites around the internet. It is highly reliable,
scalable, and fault tolerant providing distributed indexing, replication and load-balanced querying,
automated failover and recover, centralized configure and more!
Uempty
Spark
• Apache Spark is a fast and general engine for large-scale data
processing.
• Spark has a variety of advantages including:
ƒ Speed
í Run programs faster than MapReduce in memory
ƒ Easy to use
í Write apps quickly with Java, Scala, Python, R
ƒ Generality
í Can combine SQL, streaming, and complex analytics
ƒ Runs on a variety of environments and can access diverse data sources
í Hadoop, Mesos, standalone, cloud…
í HDFS, Cassandra, HBase, S3…
Spark is an in-memory processing engine where speed and scalability are the significant
advantage. A number of built-in libraries sit on top of the Spark core and take advantage of all Spark
capabilities: Spark ML, Spark's GraphX, Spark Streaming, Spark SQL, and DataFrames. The three
main languages that are supported by Spark are Scala, Python, and R. In most cases, Spark can
run programs faster than MapReduce can by using its in-memory architecture.
Reference:
Spark documentation: https://spark.apache.org/
Uempty
Druid
• Apache Druid is a high-performance, column-oriented, distributed data
store.
ƒ Interactive sub-second queries
í Unique architecture enables rapid multi-dimensional filtering, ad-hoc attribute
groupings, and extremely fast aggregations
ƒ Real-time streams
í Lock-free ingestion to allow for simultaneous ingestion and querying of high
dimensional, high volume data sets
í Explore events immediately after they occur
ƒ Horizontally scalable
ƒ Deploy anywhere
Druid is a data store designed for business intelligence (OLAP) queries. Druid provides real-time
data ingestion, query, and fast aggregations. It integrates with Apache Hive to build OLAP cubes
and run sub-seconds queries.
Uempty
2.4. Data lifecycle and governance
Uempty
Uempty
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Uempty
Governance
Tools Security Operations
Integration
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
In this section, you learn about some of the data lifecycle and governance tools that come with
HDP.
Uempty
Falcon
• Framework for managing data life cycle in Hadoop clusters
Falcon is used for managing the data lifecycle in Hadoop clusters. Example use case is to feed
management services such as feed retention, replications across clusters for backups, and archival
of data.
Reference:
Falcon documentation: https://falcon.apache.org/
Uempty
Atlas
• Apache Atlas is a scalable and extensible set of core foundational
governance services
ƒ Enables enterprises to effectively and efficiently meet their compliance
requirements within Hadoop
• Exchange metadata with other tools and processes within and outside
of Hadoop
ƒ Allows integration with the whole enterprise data ecosystem
• Atlas Features:
ƒ Data classification
ƒ Centralized auditing
ƒ Centralized lineage
ƒ Security and policy engine
Atlas enables enterprises to meet their compliance requirements within Hadoop. It provides
features for data classification, centralized auditing, centralized lineage, and security and policy
engine. It integrates with the whole enterprise data ecosystem.
Reference:
Atlas documentation: https://atlas.apache.org/
Uempty
2.5. Security
Uempty
Security
Uempty
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Uempty
Security
Governance
Tools Security Operations
Integration
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
In this section you learn about some of the security tools that come with HDP.
Uempty
Ranger
• Centralized security framework to enable, monitor, and manage
comprehensive data security across the Hadoop platform
• Using Ranger console can manage policies for access to files, folders,
databases, tables, or column with ease
Ranger is used to control data security across the entire Hadoop platform. The Ranger console can
manage policies for access to files, folders, databases, tables, and columns. The policies can be
set for individual users or groups.
Reference:
Ranger documentation: https://ranger.apache.org/
Uempty
Knox
• REST API and Application Gateway for the Apache Hadoop Ecosystem
• Single access point for all REST interactions with Apache Hadoop
clusters
Knox is a gateway for the Hadoop ecosystem. It provides perimeter level security for Hadoop. You
can think of Knox like the castle walls, where within walls is your Hadoop cluster. Knox integrates
with SSO and identity management systems to simplify Hadoop security for users who access
cluster data and execute jobs.
Reference:
Knox documentation: https://knox.apache.org/
Uempty
2.6. Operations
Uempty
Operations
Uempty
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Uempty
Operations
Governance
Tools Security Operations
Integration
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
In this section you learn about some of the operations tools that come with HDP.
Uempty
Ambari
You will grow to know your way around Ambari, as this is the central place to manage your entire
Hadoop cluster. Installation, provisioning, management, and monitoring of your Hadoop cluster is
done with Ambari. It also comes with some easy to use RESTful APIs, which allow application
developers to easily integrate Ambari with their own applications.
Reference:
Ambari documentation: https://ambari.apache.org/
Uempty
Cloudbreak
• A tool for provisioning and managing Apache Hadoop clusters in the
cloud
Cloudbreak is a tool for managing clusters in the cloud. Cloudbreak is a Hortonworks' project, and
is currently not a part of Apache. It automates the launch of clusters into various cloud infrastructure
platforms.
Uempty
ZooKeeper
Uempty
Oozie
• Oozie is a Java based workflow scheduler system to manage Apache
Hadoop jobs
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie is integrated with the rest of
the Hadoop stack. Oozie workflow jobs are Directed Acyclical Graphs (DAGs) of actions. At the
heart of this is YARN.
Reference:
Oozie documentation: http://oozie.apache.org/
Uempty
2.7. Tools
Uempty
Tools
Uempty
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Uempty
Tools
Governance
Tools Security Operations
Integration
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encrpytion
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
In this section you learn about some of the Tools that come with HDP.
Uempty
Zeppelin
Zeppelin is a web-based notebook that is designed for data scientists to easily and quickly explore
data sets through collaborations. Notebooks can contain Spark SQL, SQL, Scala, Python, JDBC,
and more. Zeppelin allows for interaction and visualization of large data sets.
Reference:
Zeppelin documentation: https://zeppelin.apache.org/
Uempty
Ambari Views
• Ambari web interface includes a built-in set of Views that are pre-
deployed for you to use with your cluster
• Includes views for Hive, Pig, Tez, Capacity Scheduler, File, HDFS
Ambari views provide a built-in set of views for Hive, Pig, Tez, Capacity Schedule, File, HDFS,
which allows developers to monitor and manage the cluster. It also allows developers to create new
user interface components that plug in to the Ambari Web UI.
Uempty
2.8. IBM added value components
Uempty
Uempty
Topics
• Hortonworks Data Platform overview
• Data flow
• Data access
• Data lifecycle and governance
• Security
• Operations
• Tools
• IBM added value components
Uempty
• Big Replicate
• BigQuality
• BigIntegrate
• Big Match
The slide shows some of the added value components available from IBM. You learn about these
components next.
Uempty
IBM Db2 Big SQL is a high performance massively parallel processing (MPP) SQL engine for
Hadoop that makes querying enterprise data from across the organization an easy and secure
experience. A Db2 Big SQL query can quickly access various data sources including HDFS,
RDBMS, NoSQL databases, object stores, and WebHDFS by using a single database connection
or single query for best-in-class analytic capabilities.
Reference:
Overview of Db2 Big SQL
https://www.ibm.com/support/knowledgecenter/SSCRJT_5.0.2/com.ibm.swg.im.bigsql.doc/doc/ove
rview_icnav.html
Uempty
Big Replicate
• IBM Big Replicate is an enterprise-class data replication software
platform
• Provides active-active data replication for Hadoop across supported
environments, distributions, and hybrid deployments
• Replicates data automatically with guaranteed consistency across
Hadoop clusters running on any distribution, cloud object storage, and
local and NFS mounted file systems
IBM Big Replicate is an enterprise-class data replication software platform that keeps data
consistent in a distributed environment, on premises and in the hybrid cloud, including SQL and
NoSQL databases. This data replication tool is powered by a high-performance coordination engine
that uses consensus to keep unstructured data accessible, accurate, and consistent in different
locations. The real-time data replication technology is noninvasive. It moves big data operations
from lab environments to production environments, across multiple Hadoop distributions, and from
on-premises to cloud environments, with minimal downtime or disruption.
Reference:
https://www.ibm.com/products/big-replicate
Link to video: https://www.youtube.com/watch?v=MXVt-ytm_Ts
Uempty
• You can profile, validate, cleanse, transform, and integrate your big data
on Hadoop, an open source framework that can manage large volumes
of structured and unstructured data.
Information Server is a platform for data integration, data quality, and governance that is unified by
a common metadata layer and scalable architecture. This means more reuse, better productivity,
and the ability to leverage massively scalable architectures like MPP, GRID, and Hadoop clusters.
Reference:
Overview of InfoSphere Information Server on Hadoop
https://www.ibm.com/support/knowledgecenter/en/SSZJPZ_11.7.0/com.ibm.swg.im.iis.ishadoop.d
oc/topics/overview.html
Uempty
Figure 2-51. Information Server - BigIntegrate:Ingest, transform, process and deliver any data into & within Hadoop
IBM BigIntegrate is a big data integration solution that provides superior connectivity, fast
transformation, and reliable, easy-to-use data delivery features that execute on the data nodes of a
Hadoop cluster. IBM BigIntegrate provides a flexible and scalable platform to transform and
integrate your Hadoop data.
After you have data sources that are understood and cleansed, the data must be transformed into a
usable format for the warehouse and delivered in a timely fashion whether in batch, real-time, or
SOA architectures. All warehouse projects require data integration – how else will the many
enterprise data sources make their way into the warehouse? Hand-coding is not a scalable option.
Increase developer efficiency
• Top down design – Highly visual development environment
• Enhanced collaboration through design asset reuse
High performance delivery with flexible deployments
• Support for multiple delivery styles: ETL, ELT, Change Data Capture, SOA integration, etc.
• High-performance, parallel engine
Uempty
Rapid integration
• Pre-built connectivity
• Balanced optimization
• Multiple user configuration options
• Job parameter available for all options
• Powerful logging and tracing
BigIntegrate is built for the simple to the most sophisticated data transformations.
Think about simple transformations such as transforming or calculating total values. This is the very
basic transformation across data such as by using a spreadsheet or calculator. Then, imagine
more complex transformations. Such as provide a lookup to an automated loan system where the
loan qualification date equals the interest rate for that time of day based on a lookup to an ever
changing system.
These are the types of transformations organizations are doing every day and they require an easy
to use canvas that allows you to design as you think. This is exactly what BigIntegrate has been
built to do.
Uempty
Figure 2-52. Information Server - BigQuality:Analyze, cleanse and monitor your big data
IBM BigQuality provides a massively scalable engine to analyze, cleanse, and monitor data.
Analysis discovers patterns, frequencies, and sensitive data that is critical to the business – the
content, quality, and structure of data at rest. While a robust user interface is provided, the process
can be completely automated.
Cleanse uses powerful out of the box (that are completely customizable) routines to investigate,
standardize, match, and survive free format data. For example, understanding that William Smith
and Bill Smith are the same person. Or knowing that BLK really means Black in some contexts.
Monitor is measuring the content, quality, and structure of data in flight to make operational
decisions about data. For example, ‘exceptions’ can be sent to a full workflow engine called the
Stewardship Center where people can collaborate on the issues.
Uempty
IBM InfoSphere Big Match for Hadoop helps you analyze massive volumes of structured and
unstructured data to gain deeper customer insights. It can enable fast, efficient linking of data from
multiple sources to provide complete and accurate customer information, without the risks of
moving data from source to source. The solution supports platforms that run Apache Hadoop such
as Cloudera.
Features:
• Matching algorithms: Uses statistical learning algorithms and a probabilistic matching engine
that run natively within Hadoop for fast and more accurate customer data matching.
• Fast processing and deployment: Provides configurable prebuilt algorithms and templates to
help you deploy in hours instead of spending weeks or months developing code. Uses
distributed processing to accelerate matching of big data volumes.
• API support: Provides support for Java and REST-based APIs, which can be used by third-party
applications.
• Searching and export capabilities: Provides search functions, as well as export (with entity ID)
and extract capabilities to allow data to be consumed by downstream systems.
Uempty
• Apache Spark support: Provides Spark-based utilities and visualization to further enable
analysis of results. Spark’s advanced analytics and data science capabilities include near
real-time streaming through micro batch processing and graph computation analysis.
Uempty
Unit summary
• Described the functions and features of HDP.
• Listed the IBM added value components.
• Described the purpose and benefits of each added value component.
Uempty
Review questions
1. Which of these components of HDP provides data access
capabilities?
A. MapReduce
B. Falcon
C. Ranger
D. Ambari
2. Identify the component that is a messaging system used for
real-time data pipelines
A. Nifi
B. Sqoop
C. Kafka
D. None of the following
3. True or False: Big Match is added value from IBM.
Uempty
Review questions
5. IBM BigQuality provides scalable engine to
A. Manage
B. Design
C. Connect
D. Cleanse
5) D
Uempty
Review answers
1. Which of thes components of HDP provides data access
capabilities?
A. MapReduce
B. Falcon
C. Ranger
D. Ambari
2. Identify the component that is a messaging system used for
real-time data pipelines
A. Nifi
B. Sqoop
C. Kafka
D. None of the following
3. True or False: Big Match is added value from IBM.
Uempty
Review answers
5. IBM BigQuality provides scalable engine to
A. Manage
B. Design
C. Connect
D. Cleanse
Uempty
Uempty
Exercise objectives
• Access the VM that you use for the exercises in this course
Uempty
Overview
In this unit, you learn about Apache Ambari, which is an open framework for provisioning,
managing, and monitoring Apache Hadoop clusters. Ambari is part of Hortonworks Data Platform
(HDP).
Uempty
Unit objectives
• Explain the purpose of Apache Ambari in the Hortonworks Data
Platform (HDP) stack.
• Describe the overall architecture of Apache Ambari and its relationship
to other services and components of a Hadoop cluster.
• List the functions of the main components of Apache Ambari.
• Explain how to start and stop services with the Apache Ambari Web UI.
Uempty
3.1. Apache Ambari overview
Uempty
Uempty
Topics
• Apache Ambari overview
• Apache Ambari Web UI
• Apache Ambari command-line interface (CLI)
• Apache Ambari basic terms
Uempty
Operations
Governance
Tools Security Operations
Integration
Knox Cloudbreak
Falcon
Atlas ZooKeeper
Atlas
HDFS
Oozie
Encryption
Data workflow
Data Access
Sqoop
Batch Script SQL NoSQL Stream Search In-Mem Others
In this section, you learn about Apache Ambari, which is one of the operations tools that comes with
HDP.
Uempty
Apache Ambari
The Apache Ambari project is aimed at making Hadoop management simpler by developing
software for provisioning, managing, and monitoring Apache Hadoop clusters. Apache Ambari
provides an intuitive Hadoop management web user interface (UI) that is backed by its RESTful
APIs.
Uempty
Reference:
Apache Ambari documentation Wiki
For more information, see the Apache Ambari wiki at
https://cwiki.apache.org/confluence/display/AMBARI/Ambari
Uempty
Uempty
Uempty
3.2. Apache Ambari Web UI
Uempty
Uempty
Topics
• Apache Ambari overview
• Apache Ambari Web UI
• Apache Ambari command-line interface (CLI)
• Apache Ambari basic terms
Uempty
There are two Apache Ambari interfaces to the outside world (through the firewall around the
Hadoop cluster):
• Apache Ambari Web UI, which is the interface that you review and use in this unit.
• A custom application API that enables programs to talk to Apache Ambari.
Uempty
Uempty
This slide shows a view of the dashboard within the Apache Ambari web interface. Notice the
various components on the standard dashboard configuration. The typical items here include:
• HDFS Disk Usage: 1%.
• DataNodes Live: 1/1. There is one data node here, and it is live and running.
The system that is shown here has been up 13.7 hours (see NameNode Uptime).
Uempty
When you hover your cursor over an individual entry, you get the detailed metrics of the
component.
The CPU Usage metric detail is shown on the next slide.
Uempty
For other components, such as CPU Usage, you might not be as interested in the instantaneous
metric value as you are with current disk usage, but you might be interested in the metric over a
recent period.
Uempty
Apache Ambari is intended to be the nexus for monitoring the performance of the Hadoop cluster,
and the nexus for generic and specific alerts and health checks.
Apache AMS is a system for collecting, aggregating, and serving Hadoop and system metrics in
Apache Ambari-managed clusters.
Uempty
When you add or start a service, the action takes place in background mode so that you can
continue to perform other operations while the requested change runs.
You can view the current background operations (blue), completed successful operations (green),
and terminated failed operations (red) in the Background Service Check window.
Uempty
Uempty
A currently failed service is shown with a red triangle at the left side of the display. The service is
HBase.
Uempty
Uempty
3.3. Apache Ambari command-line interface
(CLI)
Uempty
Uempty
Topics
• Apache Ambari overview
• Apache Ambari Web UI
• Apache Ambari command-line interface (CLI)
• Apache Ambari basic terms
Uempty
• To undeploy services that are not being used, run the following commands.
(Services should be stopped manually before removing them. Sometimes stopping services
from Apache Ambari might not stop some of the subcomponents, so make sure that you stop
them too.)
curl -u user:password -H "X-Requested-By: ambari" -X DELETE
http://localhost:8080/api/v1/clusters/BI4_QSVservices/FLUME
curl -u user:password -H "X-Requested-By: ambari" -X DELETE
http://localhost:8080/api/v1/clusters/BI4_QSVservices/SLIDER
curl -u user:password -H "X-Requested-By: ambari" -X DELETE
http://localhost:8080/api/v1/clusters/BI4_QSVservices/SOLR
• An Apache Ambari shell (with prompt) is available. For more information, see the
following website:
https://cwiki.apache.org/confluence/display/AMBARI/Ambari+Shell
Uempty
Here is an example interaction with the CLI. The hosts are in a Hadoop Cluster, and the results are
returned in the JSON format.
curl -i -u username:password http://rvm.svl.ibm.com:8080/api/v1/hosts
HTTP/1.1 200 OK
Set-Cookie: AMBARISESSIONID=1n8t0nb6perytxj09ju3las9z;Path=/
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Content-Type: text/plain
Content-Length: 265
Server: Jetty(7.6.7.v20120910)
{
"href" : "http://rvm.svl.ibm.com:8080/api/v1/hosts",
"items" : [
{
"href" : "http://rvm.svl.ibm.com:8080/api/v1/hosts/rvm.svl.ibm.com",
"Hosts" : {
"cluster_name" : "BI4_QSE",
"host_name" : "rvm.svl.ibm.com"
}
}
]
}
You can start or restart the Ambari Server by running the following command:
[root@rvm ~]# ambari-server restart
Using python /usr/bin/python2.6
Restarting ambari-server
Using python /usr/bin/python2.6
Stopping ambari-server
Ambari Server stopped
Using python /usr/bin/python2.6
Starting ambari-server
Ambari Server running with 'root' privileges.
Organizing resource files at /var/lib/ambari-server/resources...
Server PID at: /var/run/ambari-server/ambari-server.pid
Server out at: /var/log/ambari-server/ambari-server.out
Server log at: /var/log/ambari-server/ambari-server.log
Waiting for server start....................
Ambari Server 'start' completed successfully.
Apache Ambari has a Python shell. For more information about this shell, see the following website:
https://cwiki.apache.org/confluence/display/AMBARI/Ambari+python+Shell
References:
• Ambari Documentation Suite
https://docs.cloudera.com/HDPDocuments/Ambari-1.7.0.0/Ambari_Doc_Suite/ADS_v170.html
#ref-549d17a4-c274-4905-82f4-5ed9cfbfbea8
• Pivotal documentation
https://hdb.docs.pivotal.io/230/hawq/admin/ambari-rest-api.html#ambari-rest-ex-mgmt
Uempty
3.4. Apache Ambari basic terms
Uempty
Uempty
Topics
• Apache Ambari overview
• Apache Ambari Web UI
• Apache Ambari command-line interface (CLI)
• Apache Ambari basic terms
Uempty
• Service: Service refers to services in the Hadoop stack. HDFS, HBase, and Pig are examples of
services. A service might have multiple components (for example, HDFS has NameNode,
Secondary NameNode, and DataNode). A service can be a client library (for example, Pig does
not have any daemon services, but has a client library).
• Component: A service consists of one or more components. For example, HDFS has three
components: NameNode, DataNode, and Secondary NameNode. Components can be
optional. A component can span multiple nodes (for example, DataNode instances on multiple
nodes).
• Host/Node: Node refers to a machine in the cluster. Node and host are used interchangeably in
this document.
• Node-Component: Node-component refers to an instance of a component on a particular node.
For example, a particular DataNode instance on a particular node is a node-component.
• Operation: An operation refers to a set of changes or actions that are performed on a cluster to
satisfy a user request or achieve a state change in the cluster. For example, starting a service is
an operation, and running a smoke test is an operation.
If a user requests to add a service to the cluster that includes running a smoke test too, then the
entire set of actions to meet the user request constitutes an operation. An operation can consist
of multiple "actions" that are ordered.
Uempty
• Task: A task is the unit of work that is sent to a node to run. A task is the work that a node must
do as part of an action. For example, an "action" can consist of installing a DataNode on Node
n1 and installing a DataNode and a secondary NameNode on Node n2. In this case, the "task"
for n1 is to install a DataNode, and the "tasks" for n2 are to install both a DataNode and a
secondary NameNode.
• Stage: A stage refers to a set of tasks that are required to complete operations that are
independent of each other. All tasks in the same stage can be run across different nodes in
parallel.
• Action: An action consists of a task or tasks on a machine or a group of machines. Each action
is tracked by an action ID, and nodes report the status at least at the granularity of the action.
An action can be considered a stage under execution. In this document, a stage and an action
have one-to-one correspondence unless specified otherwise. An action ID is a bijection of
request-id, stage-id.
• Stage Plan: An operation typically consists of multiple tasks on various machines, and they
usually have dependencies requiring them to run in a particular order. Some tasks are required
to complete before others can be scheduled. Therefore, the tasks that are required for an
operation can be divided into various stages where each stage must be completed before the
next stage, but all the tasks in the same stage can be scheduled in parallel across different
nodes.
• Manifest: A manifest refers to the definition of a task that is sent to a node for running. The
manifest must define the task and be serializable. A manifest can also be persisted on disk for
recovery or record.
• Role: A role maps to either a component (for example, NameNode or DataNode) or an action
(for example, HDFS rebalancing, HBase smoke test, or other admin commands).
Reference:
Apache Ambari: http://ambari.apache.org
Uempty
Unit summary
• Explained the purpose of Apache Ambari in the HDP stack.
• Described the overall architecture of Apache Ambari and its relationship to
other services and components of a Hadoop cluster.
• Listed the functions of the main components of Apache Ambari.
• Explained how to start and stop services with the Apache Ambari GUI.
Uempty
Review questions
1. True or False: Apache Ambari is backed by RESTful APIs for
developers to easily integrate with their own applications.
2. Which functions does AMS provide?
A. Monitors the health and status of the Hadoop cluster.
B. Starts, stops, and reconfigures Hadoop services across the
cluster.
C. Collects, aggregates, and serves Hadoop and system metrics.
D. Handles the configuration of Hadoop services for the cluster.
3. Which page from the Apache Ambari Web UI enables you to
check the versions of the software that is installed on your
cluster?
A. Cluster Admin > Stack and Versions.
B. admin > Service Accounts.
C. Services.
D. Hosts.
Uempty
Uempty
Review answers
1. True or False: Apache Ambari is backed by RESTful APIs for
developers to easily integrate with their own applications.
2. Which functions does AMS provide?
A. Monitors the health and status of the Hadoop cluster.
B. Starts, stops, and reconfigures Hadoop services across the cluster.
C. Collects, aggregates, and serves Hadoop and system metrics.
D. Handles the configuration of Hadoop services for the cluster.
3. Which page from the Apache Ambari UI enables you to check the
versions of the software that is installed on your cluster?
A. Cluster Admin > Stack and Versions
B. admin > Service Accounts
C. Services
D. Hosts
Uempty
Uempty
Uempty
Exercise objectives
This exercise introduces you to Apache Ambari Web UI. After
completing this exercise, you will be able to do the following tasks:
• Manage Hadoop clusters with Apache Ambari.
• Explore services, hosts, and alerts with the Ambari Web UI.
• Use Ambari Rest APIs.
Uempty
Overview
This unit explains the underlying technologies that are important to solving the big data challenges
with focus on Hadoop Distributed File System (HDFS).
Uempty
Unit objectives
• Explain the need for a big data strategy and the importance of parallel
reading of large data files and internode network speed in a cluster.
• Describe the nature of the Hadoop Distributed File System (HDFS).
• Explain the function of NameNode (NN) and DataNode in a Hadoop
cluster.
• Explain how files are stored and blocks (splits) are replicated.
Uempty
4.1. Apache Hadoop: Summary and recap
Uempty
Uempty
Topics
Apache Hadoop: Summary and recap
Introduction to Hadoop Distributed File System
Managing a Hadoop Distributed File System cluster
Uempty
Hadoop is an open source project for developing software for reliable, scalable, and distributed
computing for projects like big data.
It is designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Instead of relying on hardware for high availability (HA), Hadoop is
designed to detect and handle failures at the application layer. This approach delivers a HA service
on top of a cluster of computers, each of which might be prone to failure.
Hadoop is a series of related projects with the following modules at its core:
• Hadoop Common: The common utilities that support the other Hadoop modules.
• HDFS: A powerful distributed file system that provides high-throughput access to application
data. The idea is to distribute the processing of large data sets over clusters of inexpensive
computers.
• Hadoop YARN: A framework for jobs scheduling and the management of cluster resources.
• Hadoop MapReduce: This core component is a YARN-based system that you use to distribute
a large data set over a series of computers for parallel processing.
• Hadoop Ozone: An object store for Hadoop.
Uempty
The Hadoop framework is written in Java and originally developed by Doug Cutting, who named it
after his son's toy elephant. Hadoop uses concepts from Google’s MapReduce and Google File
System (GFS) technologies at its foundation.
Hadoop is optimized to handle massive amounts of data, which might be structured, unstructured,
or semi-structured, by using commodity hardware, that is, relatively inexpensive computers. This
massive parallel processing is done with great performance. In its initial conception, it is a batch
operation that handles massive amounts of data, so the response time is not instantaneous.
Hadoop is not used for online transactional processing (OLTP) or online analytical processing
(OLAP), but for big data. It complements OLTP and OLAP to manage data. So, Hadoop is not a
replacement for a relational database management system (RDBMS).
References:
• Apache Hadoop:
http://hadoop.apache.org/
• What is Hadoop, and how does it relate to cloud:
https://www.ibm.com/blogs/cloud-computing/2014/05/07/hadoop-relate-cloud/
Uempty
companies.
• Hadoop-related projects: Apache Oozie
Workflow
&
Apache Chukwa
Monitoring
Apache ZooKeeper
Coordination
&
ƒ HBase Scheduling Management
ƒ Apache Hive
ƒ Apache Pig Apache Hive Apache Pig Apache Avro Apache Sqoop
Query/SQL Data flow
ƒ Apache Avro Data serialization/RPC RDBMS connector
Data integration
ƒ Apache Sqoop
ƒ Apache Oozie MapReduce
Distributed processing
Yarn
Cluster and resource management
ƒ Apache ZooKeeper Cluster management
ƒ Apache Spark
Apache Hadoop and HDFS © Copyright IBM Corporation 2021
Most of the services that are available in the Hadoop infrastructure supplement the core
components of Hadoop, which include HDFS, YARN, MapReduce, and Hadoop Common. The
Hadoop infrastructure includes both Apache open source projects and other commercial tools and
solutions.
The slide shows some examples of Hadoop-related projects at Apache. Apart from the components
that are listed in the slide, there are many other components that are part of the Hadoop
infrastructure. The components in the slide are just an example.
• HBase
A scalable and distributed database that supports structured data storage for large tables. It is
used for random, real-time read/write access to big data. The goal of HBase is to host large
tables.
• Apache Hive
A data warehouse infrastructure that provides data summarization and ad hoc querying.
Apache Hive facilitates reading, writing, and managing large data sets that are in distributed
storage by using SQL.
Uempty
• Apache Pig
A high-level data flow language and execution framework for parallel computation. Apache Pig
is a platform for analyzing large data sets. Apache Pig consists of a high-level language for
expressing data analysis programs that is coupled with an infrastructure for evaluating these
programs.
• Apache Avro
A data serialization system.
• Apache Sqoop
A tool that is designed for efficiently transferring bulk data between Apache Hadoop and
structured data stores, such as relational databases.
• Apache Oozie
A workflow scheduler system to manage Apache Hadoop jobs.
• Apache ZooKeeper
A high-performance coordination service for distributed applications. Apache ZooKeeper is a
centralized service for maintaining configuration information, naming, providing distributed
synchronization, and providing group services. Distributed applications use these kinds of
services.
• Apache Chukwa
A data collection system for managing a large distributed system. It includes a toolkit for
displaying, monitoring, and analyzing results to make the best use of the collected data.
• Apache Ambari
A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. It
includes support for Hadoop HDFS, Hadoop MapReduce, Apache Hive, HCatalog, HBase,
Apache ZooKeeper, Apache Oozie, Apache Pig, and Apache Sqoop. Apache Ambari also
provides a dashboard for viewing cluster health, such as heatmaps. With the dashboard, you
can visualize MapReduce, Apache Pig, and Apache Hive applications along with features to
diagnose their performance characteristics.
• Apache Spark
A fast and general compute engine for Hadoop data. Apache Spark provides a simple
programming model that supports a wide range of applications, including ETL, machine
learning, stream processing, and graph computation.
Uempty
References:
• https://www.coursera.org/learn/hadoop/lecture/E87sw/hadoop-ecosystem-major-components
• The Hadoop Ecosystem Table
https://hadoopecosystemtable.github.io/
• Apache Hadoop
https://hadoop.apache.org/
• Apache HBase
https://hbase.apache.org/
• Apache Hive
https://hive.apache.org/
• Apache Pig
https://pig.apache.org/
• Apache Avro
https://avro.apache.org/docs/current/
• Apache Sqoop
https://sqoop.apache.org/
• Apache Oozie
https://oozie.apache.org/
• Apache ZooKeeper
https://zookeeper.apache.org/
• Apache Chukwa
https://chukwa.apache.org/
• Apache Ambari
https://ambari.apache.org/
Uempty
Uempty
5. Hadoop is omnipresent: There is no industry where big data has not reached. Big data has
covered almost all domains like healthcare, retail, government, banking, media, transportation,
and natural resource. People are increasingly becoming data aware, which means that they are
realizing the power of data. Hadoop is a framework that can harness this power of data to
improve the business.
6. A maturing technology: Hadoop is evolving with time.
Reference:
https://data-flair.training/blogs/hadoop-history/
Uempty
Hadoop cannot resolve all data-related problems. It is designed to handle big data. Hadoop works
better when handling one single huge file rather than many small files. Hadoop complements
existing RDBMS technology.
Low latency allows unnoticeable delays between an input being processed and the corresponding
output to provide real-time characteristics. Low latency can be especially important for internet
connections that use services, such as online gaming and VOIP. Source:
https://wiki2.org/en/Low_latency.
Hadoop is not good for low latency data access. In practice, you can replace latency with delay. So,
low latency means a negligible delay in processing. Low latency data access means a negligible
delay in accessing data. Hadoop is not designed for low latency.
Hadoop works best with large files. The larger the file, the less time Hadoop spends seeking for the
next data location on disk and the more time Hadoop runs at the limit of the bandwidth of your
disks. Seeks are expensive operations that are useful when you must analyze only a small subset
of your data set. Because Hadoop is designed to run over your entire data set, it is best to minimize
seek time by using large files.
Hadoop is good for applications requiring a high throughput of data. Clustered machines can read
data in parallel for high throughput.
Uempty
4.2. Introduction to Hadoop Distributed File
System
Uempty
Introduction to Hadoop
Distributed File System
Uempty
Topics
Apache Hadoop: Summary and recap
Introduction to Hadoop Distributed File System
Managing a Hadoop Distributed File System cluster
Uempty
Introduction to HDFS
• HDFS is an Apache Software Foundation (ASF) project and a
subproject of the Apache Hadoop project.
• HDFS is a Hadoop file system that is designed for storing large files
running on a cluster of commodity hardware.
• Hadoop HDFS provides a fault-tolerant storage layer for Hadoop and its
other components.
• HDFS rigorously restricts data writing to one writer at a time. Bytes are
always appended to the end of a stream, and byte streams are
guaranteed to be stored in the order that they are written.
HDFS is an Apache Software Foundation (ASF) project and a subproject of the Apache Hadoop
project. Hadoop is ideal for storing large amounts of data, like terabytes and petabytes, and uses
HDFS as its storage system. HDFS lets you connect nodes (commodity personal computers) that
are contained within clusters over which data files are distributed. Then, you can access and store
the data files as one seamless file system. Access to the data files is handled in a streaming
manner, which means that applications or commands are run directly by using the MapReduce
processing model.
HDFS is fault-tolerant and provides high-throughput access to large data sets for Hadoop and its
components.
HDFS has many similarities with other distributed file systems, but is different in several respects.
One noticeable difference is the HDFS write-once-read-many model that relaxes concurrency
control requirements, simplifies data coherency, and enables high-throughput access.
Another unique attribute of HDFS is the viewpoint that it is better to locate processing logic near the
data rather than moving the data to the application space.
HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the end
of a stream, and byte streams are guaranteed to be stored in the order that they are written.
Reference:
https://www.ibm.com/developerworks/library/wa-introhdfs/index.html
Uempty
HDFS goals
• Fault tolerance by detecting faults and applying quick and automatic
recovery
• Data access by using MapReduce streaming
• Simple and robust coherency model
• Processing logic close to the data rather than the data close to the
processing logic
• Portability across heterogeneous commodity hardware and operating
systems
• Scalability to reliably store and process large amounts of data
• Economy by distributing data and processing across clusters of
commodity personal computers
• Efficiency by distributing data and logic to process it in parallel on
nodes where data is
• Reliability by automatically maintaining multiple copies of data and
automatically redeploying processing logic in the event of failures
Uempty
• Driving principles:
ƒ Data is stored across the entire cluster.
ƒ Programs are brought to the data, not the data to the program.
• Data is stored across the entire cluster (the Distributed File System
(DFS)):
ƒ The entire cluster participates in the file system.
ƒ Blocks of a single file are distributed across the cluster.
ƒ A given block is typically replicated for resiliency.
101101001
Cluster
010010011
1
100111111
001010011
101001010 1 3 2
010110010
010101001
2
100010100
101110101 4 1 3
Blocks 110101111
011011010
101101001
3
010100101
010101011 2 4
100100110
101110100
4 2 3
1
4
Logical File
Apache Hadoop and HDFS © Copyright IBM Corporation 2021
The driving principle of MapReduce is a simple one: Spread the data out across a huge cluster of
machines and then, rather than bringing the data to your programs as you do in a traditional
programming, write your program in a specific way that allows the program to be moved to the data.
Thus, the entire cluster is made available in both reading and processing the data.
The Distributed File System (DFS) is at the heart of MapReduce. It is responsible for spreading
data across the cluster by making the entire cluster look like one large file system. When a file is
written to the cluster, blocks of the file are spread out and replicated across the whole cluster (in the
slide, notice that every block of the file is replicated to three different machines).
Adding more nodes to the cluster instantly adds capacity to the file system and automatically
increases the available processing power and parallelism.
Uempty
HDFS architecture
• Master/Worker architecture NameNode File1
• Master: NameNode (NN): a
ƒ Manages the file system namespace b
and metadata: c
FsImage d
Edits Log
ƒ Regulates client access to files.
• Worker: DataNode:
ƒ Many per cluster.
ƒ Manages storage that is attached to
the nodes.
ƒ Periodically reports its status to the
NN.
a b a c
b a d b
d c c d
DataNodes
Apache Hadoop and HDFS © Copyright IBM Corporation 2021
The entire file system namespace, including the mapping of blocks to files and file system
properties, is stored in a file that is called the FsImage. The FsImage is stored as a file in the NN's
local file system. It contains the metadata on disk (not an exact copy of what is in RAM, but a
checkpoint copy).
The NN uses a transaction log called the EditLog (or Edits Log) to persistently record every change
that occurs to file system metadata, and it synchronizes with metadata in RAM after each write.
The stand-alone HDFS Cluster needs one NN. The NN can be a potential single point of failure
(this situation is resolved in later releases of HDFS with a Secondary NN, various forms of HA, and
in Hadoop v2 with NN federation and HA as standard options). To avoid a single point of failure, do
the following tasks:
• Use better quality hardware for all management nodes, and do not use inexpensive commodity
hardware for the NN.
• Mitigate by backing up to other storage.
In a power failure on NN, recovery is performed by using the FsImage and the EditLog.
Reference:
https://www.ibm.com/developerworks/library/wa-introhdfs/index.html
Uempty
HDFS blocks
• HDFS is designed to support large files.
• Each file is split into blocks. The Hadoop default is 128 MB.
• Blocks are on different physical DataNodes.
• Behind the scenes, each HDFS block is supported by multiple operating
system blocks.
OS blocks
• All the blocks of a file are of the same size except the last one (if the file
size is not a multiple of 128). For example, a 612 MB file is split as:
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed
throughout the cluster. In this way, the map and reduce functions can be run on smaller subsets of
your larger data sets, which provide the scalability that is needed for big data processing.
The current default setting for Hadoop/HDFS is 128 MB.
References:
• https://hortonworks.com/apache/hdfs
• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
• https://data-flair.training/blogs/data-block/
Uempty
HDFS replicates file blocks for fault tolerance. An application can specify the number of replicas of
a file at the time it is created, and this number can be changed anytime afterward. The NN makes
all decisions concerning block replication.
HDFS uses an intelligent replica placement model for reliability and performance. Optimizing
replica placement makes HDFS unique from most other distributed file systems, and it is facilitated
by a rack-aware replica placement policy that uses network bandwidth efficiently.
Large HDFS environments typically operate across multiple installations of computers.
Communication between two data nodes in different installations is typically slower than data nodes
within the same installation. Therefore, the NN attempts to optimize communications between data
nodes. The NN identifies the location of data nodes by their rack IDs.
Rack awareness
Typically, large HDFS clusters are arranged across multiple installations (racks). Network traffic
between different nodes within the same installation is more efficient than network traffic across
installations. An NN tries to place replicas of a block on multiple installations for improved fault
tolerance. However, HDFS allows administrators to decide on which installation a node belongs.
Therefore, each node knows its rack ID, making it rack-aware.
Uempty
References:
An introduction to the Hadoop Distributed File System:
https://www.ibm.com/developerworks/library/wa-introhdfs/index.html
NameNode and DataNodes:
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.4/hadoop-project-dist/hadoop-hdfs/H
dfsDesign.html#Data_Replication
Uempty
For small clusters in which all servers are connected by a single switch, there are only two levels of
locality: on-machine and off-machine. When loading data from a DataNode's local drive into HDFS,
the NN schedules one copy to go into the local DataNode and picks two other machines at random
from the cluster.
For larger Hadoop installations that span multiple racks, ensure that replicas of data exist on
multiple racks so that the loss of a switch does not render portions of the data unavailable due to all
the replicas being underneath it.
HDFS can be made rack-aware by using a script that allows the master node to map the network
topology of the cluster. Although alternative configuration strategies can be used, the default
implementation allows you to provide an executable script that returns the rack address of each of
a list of IP addresses. The network topology script receives as arguments one or more IP
addresses of the nodes in the cluster. It returns the standard output of a list of rack names, one for
each input. The input and output order must be consistent.
Uempty
To set the rack-mapping script, specify the key topology.script.file.name in conf/Hadoop?site.xml,
which provides a command to run to return a rack ID. (It must be an executable script or program).
By default, Hadoop attempt to send a set of IP addresses to the file as several separate
command-line arguments. You can control the maximum acceptable number of arguments by using
the topology.script.number.args key.
Rack IDs in Hadoop are hierarchical and look like path names. By default, every node has a rack ID
of /default-rack. You can set rack IDs for nodes to any arbitrary path, such as /foo/bar-rack. Path
elements further to the left are higher up the tree, so a reasonable structure for a large installation
might be /top-switch-name/rack-name.
The following example script performs rack identification based on IP addresses with a hierarchical
IP addressing scheme that is enforced by the network administrator. This script can work directly
for simple installations; more complex network configurations might require a file- or table-based
lookup process. Be careful to keep the table up to date because nodes are physically relocated.
This script requires that the maximum number of arguments be set to 1.
Uempty
#!/bin/bash
# Set rack ID based on IP address.
# Assumes network administrator has complete control
# over IP addresses assigned to nodes and they are
# in the 10.x.y.z address space. Assumes that
# IP addresses are distributed hierarchically. e.g.,
# 10.1.y.z is one data center segment and 10.2.y.z is another;
# 10.1.1.z is one rack, 10.1.2.z is another rack in
# the same segment, etc.)
#
# This is invoked with an IP address as its only argument
# get IP address from the input
ipaddr=$0
# select "x.y" and convert it to "x/y"
segments=`echo $ipaddr | cut --delimiter=. --fields=2-3 --output-delimiter=/`
echo /${segments}
A more complex rack-aware script:
File name: rack-topology.sh
#!/bin/bash
# Adjust/Add the property "net.topology.script.file.name"
# to core-site.xml with the "absolute" path this
# file. ENSURE the file is "executable".
# Supply appropriate rack prefix
RACK_PREFIX=default
# To test, supply a hostname as script input:
if [ $# -gt 0 ]; then
CTL_FILE=${CTL_FILE:-"rack_topology.data"}
HADOOP_CONF=${HADOOP_CONF:-"/etc/hadoop/conf"}
if [ ! -f ${HADOOP_CONF}/${CTL_FILE} ]; then
echo -n "/$RACK_PREFIX/rack "
exit 0
fi
while [ $# -gt 0 ] ; do
nodeArg=$1
exec< ${HADOOP_CONF}/${CTL_FILE}
result=""
while read line ; do
ar=( $line )
if [ "${ar[0]}" = "$nodeArg" ] ; then
result="${ar[1]}"
fi
done
shift
if [ -z "$result" ] ; then
echo -n "/$RACK_PREFIX/rack "
else
echo -n "/$RACK_PREFIX/rack_$result "
fi
Uempty
done
else
echo -n "/$RACK_PREFIX/rack "
fi
Here is a sample topology data file:
File name: rack_topology.data
# This file should be:
# - Placed in the /etc/hadoop/conf directory
# - On the NameNode (and backups IE: HA, Failover, etc)
# - On the Job Tracker OR Resource Manager (and any Failover JT's/RM's)
# This file should be placed in the /etc/hadoop/conf directory.
# Add Hostnames to this file. Format <host ip> <rack_location>
192.168.2.10 01
192.168.2.11 02
192.168.2.12 03
Uempty
Compression of files
• File compression brings two benefits:
ƒ Reduces the space that is needed to store files.
ƒ Speeds up data transfer across the network or to and from disk.
• But is the data splittable? (necessary for parallel reading)
Compression formats
• gzip
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm, which is a
combination of LZ77 and Huffman Coding.
• bzip2
bzip2 is a freely available, patent free, and high-quality data compressor. It typically
compresses files to within 10% - 15% of the best available techniques (the PPM family of
statistical compressors) while being about twice as fast at compression and six times faster at
decompression.
Uempty
• LZO
The LZO compression format is composed of many smaller (~256 K) blocks of compressed
data, allowing jobs to be split along block boundaries. Moreover, it was designed with speed in
mind: It decompresses about twice as fast as gzip, meaning that it is fast enough to keep up
with hard disk drive read speeds. It does not compress as well as gzip: Expect files that are on
the order of 50% larger than their gzipped version, but that is still 20 - 50% of the size of the files
without any compression at all, which means that I/O-bound jobs complete the map phase
about four times faster.
LZO is Lempel-Ziv Oberhummer. It is a free software tool that is implemented by Izop. The
original library was written in ANSI C, and it is made available under the GNU General Purpose
License. Versions of LZO are available for the Perl, Python, and Java languages. The copyright
for the code is owned by Markus F. X. J. Oberhummer.
• LZ4
LZ4 is a lossless data compression algorithm that is focused on compression and
decompression speed. It belongs to the LZ77 family of byte-oriented compression schemes.
The algorithm has a slightly worse compression ratio than algorithms like gzip. However,
compression speeds are several times faster than gzip, and decompression speeds can be
faster than LZO. The reference implementation in C by Yann Collet is licensed under a BSD
license.
• Snappy
Snappy is a compression/decompression library. It does not aim for maximum compression or
compatibility with any other compression library. Instead, it aims for high speeds and
reasonable compression. For example, compared to the fastest mode of zlib, Snappy is an
order of magnitude faster for most inputs, but the resulting compressed files are 20% - 100%
larger. On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about
250 MBps or more and decompresses at about 500 MBps or more. Snappy is widely used
inside Google, from BigTable and MapReduce to RPC systems.
All packages that are produced by the ASF, such as Hadoop, are implicitly licensed under the
Apache License Version 2.0, unless otherwise explicitly stated. The licensing of other algorithms,
such as LZO, which are not licensed under ASF might pose some problems for distributions that
rely solely on the Apache License.
References:
• http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/
MapReduceTutorial.html#Data_Compression
• http://comphadoop.weebly.com
• https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Oberhumer
• http://en.wikipedia.org/wiki/LZ4_(compression_algorithm)
Uempty
You can take advantage of compression, but the compression format that you use depends on the
file size, data format, and tools that are used.
References:
• Data Compression:
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-cor
e/MapReduceTutorial.html#Data_Compression
• Data Compression in Hadoop: http://comphadoop.weebly.com
• Compression Options in
Hadoophttps://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2
• Choosing a Data Compression Format:
https://www.cloudera.com/documentation/enterprise/5-3-x/topics/admin_data_compression_pe
rformance.html
Uempty
4.3. Managing a Hadoop Distributed File
System cluster
Uempty
Managing a Hadoop
Distributed File System
cluster
Uempty
Topics
• Apache Hadoop: Summary and recap
• Introduction to Hadoop Distributed File System
• Managing a Hadoop Distributed File System cluster
Uempty
NameNode startup
1. NN reads fsimage in memory.
2. NN applies edit log changes.
3. NN waits for block data from data nodes:
ƒ NN does not store the physical location information of the blocks.
ƒ NN exits SafeMode when 99.9% of blocks have at least one copy that is
accounted for.
During startup, NN loads the file system state from fsimage and the edits log file. Then, it waits for
DataNodes to report their blocks so that it does not prematurely start replicating the blocks even
though enough replicas are in the cluster.
During this time, NN stays in SafeMode. SafeMode for NN is essentially a read-only mode for the
HDFS cluster, where it does not allow any modifications to file system or blocks. Normally, NN
leaves SafeMode automatically after DataNodes report that most file system blocks are available.
If required, HDFS can be placed in SafeMode explicitly by running the command hdfs dfsadmin
-safemode. The NN front page shows whether SafeMode is on or off.
Reference:
NameNode and DataNodes:
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#NameNo
de_and_DataNodes
Uempty
Here are the storage files (in HDFS) where NN stores its metadata:
• fsimage
• edits
• VERSION
There is an edits_in progress file that is accumulating edits (adds and deletes) since the last update
of the fsimage. This edits file is closed off, and the changes are incorporated into a new version of
the fsimage based on whichever of two configurable events occurs first:
• The edits file reaches a certain size (here 1 MB, but the default is 64 MB).
• The time limit between updates is reached, and there were updates (the default is 1 hour).
Uempty
NameNode
First data node
3 The first DataNode
daisychain-writes to the
second DateNode, and
the second DataNode
writes to the third Second data node
DataNode with an ack
back to the previous
node.
…
Replication pipelining
1. When a client is writing data to an HDFS file with a replication factor of three, NN retrieves a list
of DataNodes by using a replication target choosing algorithm. This list contains DataNodes
that host a replica of that block.
2. The client then writes to the first DataNode.
3. The first DataNode starts receiving the data in portions, writes each portion to its local
repository, and transfers that portion to the second DataNode in the list. The second DataNode
starts receiving each portion of the data block, writes that portion to its repository, and then
flushes that portion to the third DataNode. Finally, the third DataNode writes the data to its local
repository. Thus, a DataNode can be receiving data from the previous one in the pipeline and
concurrently forwarding data to the next one in the pipeline. Thus, the data is pipelined from one
DataNode to the next.
Reference:
Data Organization:
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Data_Or
ganization
Uempty
Apache Hadoop clusters grow and change with use. The normal method is to use Apache Ambari
to build your initial cluster with a base set of Hadoop services targeting known use cases. You might
want to add other services for new use cases, and even later you might need to expand the storage
and processing capacity of the cluster.
Apache Ambari can help with both the initial configuration and the later expansion/reconfiguration
of your cluster.
When you can add more hosts to the cluster, you can assign these hosts to run as DataNodes (and
NodeManagers under YARN, as you see later) so that you can expand both your HDFS storage
capacity and your overall processing power.
Similarly, you can remove DataNodes if they are malfunctioning or you want to reorganize your
cluster.
Uempty
Active Standby
NameNode NameNode
Uempty
To deploy an HA cluster, prepare the following items:
• NN machines
The machines on which you run the active and standby NNs should have hardware that is
equivalent to each other, and hardware that is equivalent to what issued in a non-HA cluster.
• JournalNode machines
The machines on which you run the JournalNodes. The JournalNode daemon is relatively
lightweight, so these daemons can reasonably be collocated on machines with other Hadoop
daemons, for example, NNs, the JobTracker, or the YARN ResourceManager.
There must be at least three JournalNode daemons because edit log modifications must be
written to a majority of JournalNodes, which allows the system to tolerate the failure of a single
machine. You can also run more than three JournalNodes, but to increase the number of
failures that the system can tolerate, you should run an odd number of JournalNodes (that is,
three, five, seven, and so on). When running with N JournalNodes, the system can tolerate at
most (N - 1) / 2 failures and continue to function normally.
Reference:
HDFS High Availability Using the Quorum Journal Manager:
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWith
QJM.html
Uempty
Standby NameNode
• During operation, the primary NN cannot merge fsimage and the edits log.
• This task is done on the secondary NN:
ƒ Every couple of minutes, the secondary NN copies a new edit log from the primary NN.
ƒ Merges the edits log in to fsimage.
ƒ Copies the merged fsimage back to the primary NN.
• Not HA, but provides a faster startup time:
ƒ Standby NN does not have a complete image, so any in-flight transactions are lost.
ƒ Primary NN must merge less during startup.
NN stores the HDFS file system information in a file that is named fsimage. Updates to the file
system (add/remove blocks) do not update the fsimage file, but instead are logged in to a file, so
the I/O is fast-append streaming only as opposed to random file writes. When restarting, NN reads
fsimage and then applies all the changes from the log file to bring the file system state up to date in
memory. This process takes time.
The job of the secondary NN is not to be a secondary to NN, but only to periodically read the file
system changes log and apply the changes into the fsimage file, thus bringing it up to date. This
task allows NN to start faster next time.
Unfortunately, the secondary NN service is not a standby secondary NN despite its name.
Specifically, it does not offer HA for NN.
More recent distributions have NN HA by using NFS (shared storage) or NN HA by using a Quorum
Journal Manager (QJM).
Reference:
HDFS High Availability Using the Quorum Journal Manager:
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWith
QJM.html
Uempty
With Federated NNs in Hadoop V2, there are two main layers:
• Namespace:
▪ Consists of directories, files, and blocks.
▪ Supports all the namespace-related file system operations, such as create, delete, modify,
and list files and directories.
• Block storage service, which has two parts:
▪ Block management (performed in the NN):
- Provides DataNode cluster membership by handling registrations and periodic
heartbeats.
- Processes block reports and maintains location of blocks.
- Supports block-related operations, such as create, delete, modify, and get block
location.
- Manages replica placement, block replication for under-replicated blocks, and deletes
blocks that are over-replicated.
▪ Storage is provided by DataNodes by storing blocks on the local file system and allowing
read/write access.
Uempty
Multiple NNs and namespaces
The prior HDFS architecture allows only a single namespace for the entire cluster. In that
configuration, a single NN manages the namespace. HDFS Federation addresses this limitation by
adding support for multiple NNs and namespaces to HDFS.
To scale the name service horizontally, federation uses multiple independent NNs and
namespaces. The NNs are federated, and the NNs are independent and do not require
coordination with each other. The DataNodes are used as common storage for blocks by all the
NNs. Each DataNode registers with all the NNs in the cluster. DataNodes send periodic heartbeats
and block reports. They also handle commands from the NNs.
Users can use ViewFS to create personalized namespace views. ViewFS is analogous to
client-side mount tables in some UNIX and Linux systems.
Reference:
HDFS Federation:
https://hadoop.apache.org/docs/r3.3.0/hadoop-project-dist/hadoop-hdfs/Federation.html
Uempty
ƒ The current directory is designated by a dot (".") and the here symbol in
Linux/UNIX.
ƒ If you want the root of the HDFS file system, use a slash ("/").
HDFS can be manipulated by using a Java API or a command-line interface (CLI). All commands
for manipulating HDFS by using Hadoop's CLI begin with hdfs dfs, which is the file system shell.
This command is followed by the command name as an argument to hdfs dfs. These commands
start with a dash. For example, the ls command for listing a directory is a common UNIX command
and is preceded with a dash. As on UNIX systems, ls can take a path as an argument. In this
example, the path is the current directory, which is represented by a single dot.
dfs is one of the command options for hdfs. If you type the command hdfs by itself, you see other
options.
Uempty
scheme://authority/path
• Scheme:
ƒ For the local file system, the scheme is file.
ƒ For HDFS, the scheme is hdfs.
• Authority is the hostname and port of the NN.
hdfs dfs -copyFromLocal file:///myfile.txt
dfs://localhost:9000/user/virtuser/myfile.txt
• The scheme and authority are often optional. The defaults are taken
from the configuration file core-site.xml.
30
Apache Hadoop and HDFS © Copyright IBM Corporation 2021
Just as for the ls command, the file system shell commands can take paths as arguments. These
paths can be expressed in the form of uniform resource identifiers (URIs). The URI format consists
of a scheme, an authority, and path. There are multiple schemes that are supported. The local file
system has a scheme of "file". HDFS has a scheme that is called "hdfs."
For example, if you want to copy a file called "myfile.txt" from your local file system to an HDFS file
system on the localhost, you can do this task by issuing the command that is shown. The
copyFromLocal command takes a URI for the source and a URI for the destination.
"Authority" is the hostname of the NN. For example, if the NN is in localhost and accessed on port
9000, the authority would be localhost:9000.
The scheme and the authority do not always need to be specified. Instead, you might rely on their
default values. These defaults can be overridden by specifying them in a file that is named
core-site.xml in the “conf: directory of your Hadoop installation.
Uempty
Or
HDFS supports many Portable Operating System Interface (POSIX)-like commands. HDFS is not a
fully POSIX (for UNIX) compliant file system, but it supports many of the commands. The HDFS
commands are mostly easily recognized UNIX commands like cat and chmod. There are also a
few commands that are specific to HDFS, such as copyFromLocal.
Note that:
• localsrc and dst are placeholders for your files.
• localsrc can be a directory or a list of files that is separated by spaces.
• dst can be a new file name (in HDFS) for a single-file-copy, or a directory (in HDFS) that is the
destination directory.
For example, hdfs dfs -put *.txt ./Gutenberg copies all the text files in the local Linux directory
with the suffix of .txt to the directory “Gutenberg” in the user’s home directory in HDFS.
The "direction“ that is implied by the names of these commands (copyFromLocal, put) is relative
to the user, who can be thought to be situated outside HDFS.
Also, you should note that there is no cd command that is available for Hadoop.
Uempty
copyToLocal / get
• Copy files from dfs into the local file system:
hdfs dfs -copyToLocal [-ignorecrc] [-crc] <src> <localdst>
or
The copyToLocal or get command copies files out of the file system that you specify and into the
local file system. Here is an example of the command:
hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
Copy the files to the local file system. Files that fail the CRC check can be copied by using the
-ignorecrc option. Files and CRCs can be copied by using the -crc option. For example, hdfs dfs
-get hdfs:/mydir/file file:///home/hdpadmin/localfile.
For files in Linux where you use the file:// authority, two slashes represent files relative to your
current Linux directory (pwd). To reference files absolutely, use three slashes ("slash-slash pause
slash").
Uempty
Unit summary
• Explained the need for a big data strategy and the importance of
parallel reading of large data files and internode network speed in a
cluster.
• Described the nature of the Hadoop Distributed File System (HDFS).
• Explained the function of NameNode and DataNode in a Hadoop
cluster.
• Explained how files are stored and blocks (splits) are replicated.
Uempty
Review questions
1. True or False: Hadoop systems are designed for using a
single server.
2. What is the default number of replicas in a Hadoop system?
A. 1
B. 2
C. 3
D. 4
3. True or False: One of the Hadoop goals is fault tolerance by
detecting faults and applying quick and automatic recovery.
4. True or False: At least two NameNodes are required for a
stand-alone Hadoop cluster.
5. The default Hadoop block size is:
A. 16
B. 32
C. 64
D. 128
1. False
2. C. 3
3. True
4. False
5. D. 128
Uempty
Review answers
1. True or False: Hadoop systems are designed for using a
single server.
2. What is the default number of replicas in a Hadoop system?
A. 1
B. 2
C. 3
D. 4
3. True or False: One of the Hadoop goals is fault tolerance by
detecting faults and applying quick and automatic recovery.
4. True or False: At least two NameNodes are required for a
stand-alone Hadoop cluster.
5. The default Hadoop block size is:
A. 16
B. 32
C. 64
D. 128
Uempty
Figure 4-35. Exercise: File access and basic commands with HDFS
Uempty
Exercise objectives
• This exercise introduces you to basic HDFS commands.
• After completing this exercise, you should be able to:
ƒ Move data to HDFS.
ƒ Access files.
ƒ Run basic HDFS commands.
Uempty
Overview
In this unit, you learn about the MapReduce and YARN frameworks.
Uempty
Unit objectives
Uempty
5.1. Introduction to MapReduce
Uempty
Introduction to MapReduce
Uempty
Topics
• Introduction to MapReduce
• Hadoop v1 and MapReduce v1 architecture and limitations
• YARN architecture
• Hadoop and MapReduce v1 compared to v2
Uempty
101101001 Cluster
010010011
1
100111111
001010011
101001010 1 3 2
010110010
010101001
2
100010100
101110101 4 1 3
Blocks 110101111
011011010
101101001
3
010100101
010101011 2 4
100100110
101110100
4 2 3
1
4
Logical File
MapReduce and YARN © Copyright IBM Corporation 2021
The driving principle of MapReduce is a simple one: spread your data out across a huge cluster of
machines and then, rather than bringing the data to your programs as you do in traditional
programming, you write your program in a specific way that allows the program to be moved to the
data. Thus, the entire cluster is made available in both reading the data as well as processing the
data.
A Distributed File System (DFS) is at the heart of MapReduce. It is responsible for spreading data
across the cluster, by making the entire cluster look like one giant file system. When a file is written
to the cluster, blocks of the file are spread out and replicated across the whole cluster (in the
diagram, notice that every block of the file is replicated to three different machines).
Adding more nodes to the cluster instantly adds capacity to the file system and, as we'll see on the
next slide, automatically increases the available processing power and parallelism.
Uempty
MapReduce explained
• Hadoop computational model
ƒ Data stored in a distributed file system spanning many inexpensive
computers
ƒ Bring function to the data
ƒ Distribute application to the compute resources where the data is stored
• Scalable to thousands of nodes and petabytes of data
}
}
}
context.write(word, one);
1.Map Phase
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
(break job into small parts)
2.Shuffle
public void reduce(Text key,
Iterable<IntWritable> val, Context context){
int sum = 0;
Distribute map
for (IntWritable v : val) {
sum += v.get();
Uempty
Uempty
Uempty
MapReduce overview
Results can be
written to HDFS or a
database
Uempty
Map Phase
Logical Output
sort File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
reduce To DFS
010110010 sort 101001010
010101001 010110010
2
100010100 2 map 01
101110101
110101111 Logical Output
011011010
101101001 File
sort merge
3
010100101 101101001
010101011
100100110
3 map 010010011
100111111
101110100 001010011
reduce 101001010 To DFS
4 sort
010110010
01
4 map
Logical
Input File
Earlier, you learned that if you write your programs in a special way, the programs can be brought to
the data. This special way is called MapReduce, and it involves breaking down your program into
two discrete parts: Map and Reduce.
A mapper is typically a relatively small program with a relatively simple task: it is responsible for
reading a portion of the input data, interpreting, filtering, or transforming the data as necessary and
then finally producing a stream of <key, value> pairs. What these keys and values are is not of
importance in the scope of this topic, but keep in mind that these values can be as large and
complex as you need.
Notice in the diagram, how the MapReduce environment automatically takes care of taking your
small "map" program (the black boxes) and push that program out to every machine that has a
block of the file you are trying to process. This means that the bigger the file, the bigger the cluster,
the more mappers get involved in processing the data. That's a powerful idea.
Uempty
Shuffle
Logical Output
sort File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111
011011010 Logical Output
101101001 File
sort merge
3
010100101
101101001
010101011
100100110
3 map 010010011
101110100 100111111
reduce 001010011
101001010 To DFS
4 sort 010110010
01
4 map
This next phase is called "Shuffle" and is orchestrated behind the scenes by MapReduce.
The idea here is that all the data that is being emitted from the mappers is first locally grouped by
the <key> that your program chose, and then for each unique key, a node is chosen to process all
the values from all the mappers for that key.
For example, if you used U.S. state (such as MA, AK, NY, CA, etc.) as the key of your data, then
one machine would be chosen to send all the California data to, and another chosen for all the New
York data. Each machine would be responsible for processing the data for its selected state. In the
diagram in the slide, the data only has two keys (shown as white and black boxes), but keep in
mind that there may be many records with the same key coming from a given mapper.
Uempty
Reduce Phase
Logical
sort Output
File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111
011011010
101101001
sort merge 101101001
3
010100101
010010011
010101011
100100110
3 map 100111111
101110100 001010011
reduce 101001010 To DFS
010110010
4 sort
01
4 map
Logical
Output
File
Reducers are the last part of the picture. Again, these are small programs (typically small) that are
responsible for sorting and/or aggregating all the values with the key that was assigned to work on.
Just like with mappers, the more unique keys you have the more parallelism.
Once each reducer has completed whatever is assigned to do, such as add up the total sales for
the state it was assigned, and it in turn, emits key/value pairs that get written to storage that can
then be used as the input to the next MapReduce job.
This is a simplified overview of MapReduce.
Uempty
Combiner
sort Logical Output
File
101101001
010010011
1 map copy merge 101101001
1
100111111 & combine 010010011
001010011 100111111
101001010 001010011
010110010 sort reduce 101001010 To DFS
010101001 010110010
2
100010100 2 map 01
101110101 & combine
110101111
011011010 Logical Output
101101001 File
sort merge
3
010100101
010101011
100100110
3 map 101101001
010010011
101110100 & combine 100111111
reduce 001010011
101001010 To DFS
4 sort 010110010
01
4 map
& combine
At the same time as the Sort is done during the Shuffle work on the Mapper node, an optional
Combiner function may be applied.
For each key, all key/values with that key are sent to the same Reducer node (that is the purpose of
the Shuffle phase).
Rather than sending multiple key/value pairs with the same key value to the Reducer node, the
values are combined into one key/value pair. This is only possible where the reduce function is
additive (or does not lose information when combined).
Since only one key/value pair is sent, the file transferred from Mapper node to Reducer node is
smaller and network traffic is minimalized.
Uempty
WordCount example
• In the example of a list of animal names
ƒ MapReduce can automatically split files on line breaks
ƒ The file is split into two blocks on two nodes
• To count how often each big cat is mentioned, in SQL you would use:
Node 1 Node 2
Tiger Tiger
Lion Tiger
Lion Wolf
Panther Panther
Wolf …
…
In a file with two blocks (or "splits") of data, animal names are listed. There is one animal name per
line in the files.
Rather than count the number of mentions of each animal, you are interested only in members of
the cat family.
Since the blocks are held on different nodes, software running on the individual nodes process the
blocks separately.
If you were using SQL, which is not used, the SQL would be as shown in the slide.
Uempty
Map task
• There are two requirements for the Map task:
ƒ Filter out the non big-cat rows
ƒ Prepare count by transforming to <Text(name), Integer(1)>
Node 1
Tiger <Tiger, 1>
Lion <Lion, 1>
Lion <Lion, 1>
Panther <Panther, 1>
The Map Tasks are
Wolf …
executed locally on each
…
split
Node 2
Tiger <Tiger, 1>
Tiger <Tiger, 1>
Wolf <Panther, 1>
Panther …
…
Uempty
"Reduce" step: The master node then takes the answers to all the sub-problems and combines
them in some way to get the output, the answer to the problem it was originally trying to solve.
The Map step that is shown does the following processing:
• Each Map node reads its own "split" (block) of data
• The information required (in this case, the names of animals) is extracted from each record (in
this case, one line = one record)
• Data is filtered (keeping only the names of cat family animals)
• key-value pairs are created (in this case, key = animal and value = 1)
• key-value pairs are accumulated into locally stored files on the individual nodes where the Map
task is being executed
Uempty
Shuffle
• Shuffle moves all values of one key to the same target node
• Distributed by a Partitioner Class (normally hash distribution)
• Reduce tasks can run on any node - here on Nodes 1 and 3
ƒ The number of Map and Reduce tasks do not need to be identical
ƒ Differences are handled by the hash partitioner
Node 1 Node 1
Tiger <Tiger, 1>
<Panther, <1,1>>
Lion <Lion, 1>
<Tiger, <1,1,1>>
Panther <Lion, 1>
…
Wolf <Panther, 1>
… …
Node 2 Node 3
Tiger <Tiger, 1> Shuffle distributes keys <Lion, <1,1>>
Tiger <Tiger, 1> using a hash partitioner. …
Wolf <Panther, 1> Results are stored in
Panther … HDFS blocks on the
… machines that run the
Reduce jobs
MapReduce and YARN © Copyright IBM Corporation 2021
Shuffle distributes the key-value pairs to the nodes where the Reducer task runs. Each Mapper task
produces one file for each Reducer task. A hash function that is running on the Mapper node
determines which Reducer task receives any key-value pair. All key-value pairs with a key are sent
to the same Reducer task.
Reduce tasks can run on any node, either different from the set of nodes where the Map task runs
or on the same DataNodes. In the slide example, Node 1 is used for one Reduce task, but a new
node, Node 3, is used for a second Reduce node.
There is no relation between the number of Map tasks (generally one node for each block of the
files begin read) and the number of Reduce tasks. Commonly the number of Reduce tasks is
smaller than the number of Map tasks.
Uempty
Reduce
• The reduce task computes aggregated values for each key
ƒ Normally the output is written to the DFS
ƒ Default is one output part-file per Reduce task
ƒ Reduce tasks aggregate all values of a specific key, in this example, the
count of the particular animal type
Node 3
<Lion, 2>
<Lion, <1,1>>
…
…
Note that these two Reducer tasks are running on Nodes 1 and 3.
The Reduce node then takes the answers to all the sub-problems and combines them in some way
to get the output - the answer to the problem it was originally trying to solve.
In this case, the Reduce step that is shown on this slide does the following processing:
• The data is sent to each Reduce node from the various Map nodes.
• This data is previously sorted (and possibly partially merged).
• The Reduce node aggregates the data; for WordCount, it sums the counts that are received for
each word (each animal in this case).
• One file is produced for each Reduce task and it is written to HDFS where the blocks are
automatically replicated.
Uempty
Combiner (Optional)
• For performance, an aggregate in the Map task can be helpful
• Reduces the amount of data that is sent over the network
ƒ Also reduces Merge effort, since data is premerged in Map
ƒ Done in the Map task before Shuffle
Node 1 Node 1
Tiger <Tiger, 1> <Lion, 1>
<Panther, <1,1>>
Lion <Lion, 1> <Panther, 1>
<Tiger, <1, 2>>
Panther <Panther, 1> <Tiger, 1>
…
Wolf … …
…
Node 2 Node 3
Tiger <Tiger, 1> <Tiger, 2> <Lion, 1>
Tiger <Tiger, 1> <Panther, 1> …
Wolf <Panther, 1> …
Panther …
…
The Combiner phase is optional. When it is used, it runs on the Mapper node and preprocesses the
data that is sent to Reduce tasks by pre-merging and pre-aggregating the data in the files that are
transmitted to the Reduce tasks.
The Combiner thus reduces the amount of data that is sent to the Reducer tasks, which speeds up
the processing as smaller files need to be transmitted to the Reducer task nodes.
Uempty
This is a slightly simplified version of WordCount.java for MapReduce 1. The full program is slightly
larger, and there are some recommended differences for compiling for MapReduce 2 with Hadoop
2.
Code from the Hadoop classes is brought in with the import statements. Like an iceberg, most of
the actual code executed at run time is hidden from the programmer; it runs deep down in the
Hadoop code itself.
The interest here is the Mapper class, Map.
This class reads the file (you will see on the driver class slide later as arg[0]) as a string. The string
is tokenized, for example, broken into words separated by spaces.
Uempty
Note the following shortcomings of the code:
• No lowercasing is done, thus The and the are treated as separate words that are counted
separately.
• Any adjacent punctuation is appended to the word, thus "the (leading double quotation mark)
and the (quotation marks) are counted separately, and any word followed by punctuation, for
example cow, (trailing comma) is counted separately from cow (the same word without trailing
punctuation).
You see these shortcomings in the output. Note that this is the standard WordCount program and
the interest is not in the actual results but only in the process at this stage.
The WordCount program is to Hadoop Java programs functions as the "Hello, world!" program is to
the C language. It is generally the first program that people experience when coming to the new
technology.
Uempty
Uempty
The driver routine, which is embedded in main, does the following work:
• Sets the JobName for runtime.
• Sets the Mapper class to Map.
• Sets the Reducer class to Reduce.
• Sets the Combiner class to Reduce.
• Sets the input file to arg[0].
• Sets the output directory to arg[1].
The combiner runs on the Map task and use the same code as the Reducer task.
The names of the output files will be generated inside the Hadoop code.
Uempty
Classes
• Hadoop provides three main Java classes to read data in
MapReduce:
ƒ InputSplit divides a file into splits
í Splits are normally the block size but depends on the number of
requested Map tasks whether any compression allows splitting, etc.
ƒ RecordReader takes a split and reads the files into records
í For example, one record per line (LineRecordReader)
í But note that a record can be split across splits
ƒ InputFormat takes each record and transforms it into a
<key, value> pair that is then passed to the Map task
• Lots of additional helper classes might be required to handle
compression, for example, LZO compression.
The InputSplit, RecordReader, and InputFormat classes are provided inside the Hadoop code.
Other helper classes are needed to support Java MapReduce programs. Some of these classes
are provided from inside the Hadoop code itself, but distribution vendors and programmers can
provide other classes that either override or supplement standard code. Thus, some vendors
provide the LZO compressions algorithm to supplement standard compression codecs (such as
codecs for bzip2).
Uempty
Splits
• Files in MapReduce are stored in blocks (128 MB)
• MapReduce divides data into fragments or splits.
ƒ One Map task is executed on each split
• Most files have records with defined split points
ƒ Most common is the end of line character
• The InputSplit class is responsible for taking an HDFS file and
transforming it into splits.
ƒ The goal is to process as much data as possible locally
Uempty
RecordReader
• Most of the time a split does not happen at a block end
• Files are read into Records by the RecordReader class
ƒ Normally the RecordReader starts and stops at the split points.
• LineRecordReader reads over the end of the split until the line end.
ƒ HDFS sends the missing piece of the last record over the network
• Likewise, LineRecordReader for Block2 disregards the first incomplete
line
Node 1 Node 2
Tiger\n ther\n
Tiger\n Tiger\n
Lion\n Wolf\n
Pan Lion
In this example RecordReader1 will not stop at Pan but will read on until the end of
the line. Likewise RecordReader2 will ignore the first line.
MapReduce and YARN © Copyright IBM Corporation 2021
Uempty
InputFormat
• MapReduce tasks read files by defining an InputFormat class
ƒ Map tasks expect <key, value> pairs
• To read line-delimited textfiles, Hadoop provides the TextInputFormat
class
ƒ It returns one key, value pair per line in the text
ƒ The value is the content of the line
ƒ The key is the character offset to the new line character (end of line)
Node 1
Tiger <0, Tiger>
Lion <6, Lion>
Lion <11, Lion>
Panther <16, Panther>
Wolf <24, Wolf>
… …
Uempty
5.2. Hadoop v1 and MapReduce v1 architecture
and limitations
Uempty
The original Hadoop (v1) and MapReduce (v1) had limitations, and a number of issues surfaced
over time. You will review these in preparation for looking at the differences and changes
introduced with Hadoop 2 and MapReduce v2.
Uempty
Topics
• Introduction to MapReduce
• Hadoop v1 and MapReduce v1 architecture and limitations
• YARN architecture
• Hadoop and MapReduce v1 compared to v2
Uempty
MapReduce v1 engine
• Master / Worker architecture
ƒ A single controller (JobTracker) controls job execution on multiple workers
(TaskTrackers).
• JobTracker
ƒ Accepts MapReduce jobs that are submitted by clients.
ƒ Pushes Map and Reduce tasks out to TaskTracker nodes.
ƒ Keeps the work as physically close to data as possible.
ƒ Monitors tasks and the TaskTracker status.
• TaskTracker
ƒ Runs Map and Reduce tasks.
ƒ Reports statuses to JobTracker.
ƒ Manages storage and transmission of intermediate output
If one TaskTracker is slow, it can delay the entire MapReduce job, especially towards the end of a
job, where everything can end up waiting for the slowest task. With speculative-execution enabled,
a single task can be run on multiple worker nodes.
For jobs scheduling, by default Hadoop uses first in, first out (FIFO), and optionally, five scheduling
priorities to schedule jobs from a work queue. Other scheduling algorithms are available as
add-ons, such as Fair Scheduler and Capacity Scheduler.
Uempty
Distributed TaskTracker
file system 8. Retrieve
job resources. 9. Launch.
(for
example,
child JVM
HDFS)
Child
• Client: Submits MapReduce jobs.
• JobTracker: Coordinates the job run and 10. Run.
breaks down the job to Map and Reduce MapTask
tasks for each node to work on the cluster. or
ReduceTask
• TaskTracker: Runs the Map and Reduce
task functions. TaskTracker node
The process of running a MapReduce job on Hadoop consists of the following steps:
1. The MapReduce program that you write tells the Job Client to run a MapReduce job.
2. The job sends a message to the JobTracker, which produces a unique ID for the job.
3. The Job Client copies job resources, such as a JAR file that contains Java code that you wrote
to implement the Map or the Reduce task to the shared file system, usually HDFS.
4. After the resources are in HDFS, the Job Client can tell the JobTracker to start the job.
5. The JobTracker does its own initialization for the job. It calculates how to split the data so that it
can send each "split" to a different Mapper process to maximize throughput.
6. It retrieves these "input splits" from the distributed file system, not the data itself.
7. The TaskTrackers are continually sending heartbeat messages to the JobTracker. Now that the
JobTracker has work for them, it returns a Map task or a Reduce task as a response to the
heartbeat.
8. The TaskTrackers must obtain the code to run, so they get it from the shared file system.
9. The TaskTrackers start a Java virtual machine (JVM) with a child process that runs in it, and this
child process runs your Map code or Reduce code. The result of the Map operation remains in
the local disk for the TaskTracker node (not in HDFS).
Uempty
10. The output of the Reduce task is stored in the HDFS file system by using the number of copies
that specified by the replication factor.
Uempty
Fault tolerance
JobTracker node
JobTracker
3
JobTracker fails.
Heartbeat
Child JVM
Child
1 Task fails.
MapTask
or
ReduceTask
TaskTracker node
Uempty
This fault tolerance underscores the need for program execution to be side-effect free. If Mappers
and Reducers had individual identities and communicated with one another or the outside world,
then restarting a task would require the other nodes to communicate with the new instances of the
Map and Reduce tasks, and the rerun tasks would need to reestablish their intermediate state. This
process is notoriously complicated and error-prone in general. MapReduce simplifies this problem
drastically by eliminating task identities or the ability for task partitions to communicate with one
another. An individual task sees only its own direct inputs and knows only its own outputs to make
this failure and restart process clean and dependable.
Uempty
Uempty
Uempty
2 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Runs dozen or so
Map and Reduce tasks
4000 TaskTrackers
MapReduce and YARN © Copyright IBM Corporation 2021
Uempty
5.3. YARN architecture
Uempty
YARN architecture
Uempty
Topics
• Introduction to MapReduce
• Hadoop v1 and MapReduce v1 architecture and limitations
• YARN architecture
• Hadoop and MapReduce v1 compared to v2
Uempty
YARN
• Acronym for Yet Another Resource Negotiator.
• New resource manager is included in Hadoop 2.x and later.
• De-couples the Hadoop workload and resource management.
• Introduces a general-purpose application container.
• Hadoop 2.2.0 includes the first generally available (GA) version of
YARN.
• Most Hadoop vendors support YARN.
Uempty
Apache
MapReduce v2 Tez HBase Others
(batch) (interactive) (online) Spark (varied)
(in memory)
YARN
(cluster resource management)
HDFS
Uempty
NodeManager @node134
NodeManager @node135
Resource
Manager
@node132
NodeManager @node136
Uempty
Application 1:
Analyze lineitem table.
NodeManager @node134
Launch
NodeManager @node135
Resource
Manager Application
@node132 Master 1
NodeManager @node136
Uempty
Application 1:
Analyze lineitem table.
NodeManager @node134
NodeManager @node135
Resource Resource request
Manager Application
@node132 Master 1
Container IDs
NodeManager @bigaperf136
Uempty
Application 1:
Analyze lineitem table.
NodeManager @node134
App 1 App 1
Launch
NodeManager @node135
Resource
Manager Application
@node132 Master 1
Launch
NodeManager @node136
App 1
Uempty
Application 1:
Analyze
lineitem
table. NodeManager @node134
App 1 App 1
Application 2:
Analyze customer table.
NodeManager @node135
Resource
Manager Application
@node132 Master 1
NodeManager @node136
Application
Master 2 App 1
Uempty
Application 1:
Analyze
lineitem
table.
NodeManager @node134
App 1 App 1
Application 2:
Analyze customer table.
NodeManager @node135
Resource
Manager Application
@node132 Master 1
NodeManager @node136
Application
Master 2 App 1
Uempty
App 2
Application 1:
Analyze
lineitem
table.
NodeManager @node134
App 1 App 1
Application 2:
Analyze customer table.
NodeManager @node135
Resource
Manager Application
AppApp
2 2
@node132 Master 1
NodeManager @nodef136
Application
Master 2 App 1
Uempty
Application Resource
client manager
1: Submit
YARN
Client node Resource manager node
application.
NodeManager
3: Allocate resources (heartbeat).
2b: Launch.
Container
Application NodeManager
process 4a: Start
container.
4b: Launch.
Node manager node
Container
Application
process
To run an application on YARN, a client contacts the resource manager and prompts it to run an
application master process (step 1). The resource manager then finds a node manager that can
launch the application master in a container (steps 2a and 2b). Precisely what the application
master does after it is running depends on the application. It might simply run a computation in the
container it is running in and return the result to the client, or it might request more containers from
the resource managers (step 3) and use them to run a distributed computation (steps 4a and 4b).
For more information, see White, T. (2015) Hadoop: The definitive guide (4th ed.). Sabastopol, CA:
O'Reilly Media, p. 80.
Uempty
YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability and availability
Uempty
YARN lifts the scalability ceiling in Hadoop by splitting the roles of the Hadoop JobTracker into two
processes: A ResourceManager controls access to the cluster’s resources (memory, CPU, and
other components), and the ApplicationManager (one per job) controls task execution.
YARN can run on larger clusters than MapReduce v1. MapReduce v1 reaches scalability
bottlenecks in the region of 4,000 nodes and 40,000 tasks, which stems from the fact that the
JobTracker must manage both jobs and tasks. YARN overcomes these limitations by using its split
ResourceManager / ApplicationMaster architecture: It is designed to scale up to 10,000 nodes and
100,000 tasks.
In contrast to the JobTracker, each instance of an application has a dedicated ApplicationMaster,
which runs for the duration of the application. This model is closer to the original Google
MapReduce paper, which describes how a master process is started to coordinate Map and
Reduce tasks running on a set of workers.
Uempty
Multi-tenancy generally refers to a set of features that enable multiple business users and
processes to share a common set of resources, such as an Apache Hadoop cluster that uses a
policy rather than physical separation, without negatively impacting service-level agreements
(SLA), violating security requirements, or even revealing the existence of each party.
What YARN does is de-couple Hadoop workload management from resource management, which
means that multiple applications can share a common infrastructure pool. Although this idea is not
new, it is new to Hadoop. Earlier versions of Hadoop consolidated both workload and resource
management functions into a single JobTracker. This approach resulted in limitations for customers
hoping to run multiple applications on the same cluster infrastructure.
To borrow from object-oriented programming terminology, multi-tenancy is an overloaded term. It
means different things to different people depending on their orientation and context. To say that a
solution is multi-tenant is not helpful unless you are specific about the meaning.
Uempty
Some interpretations of multi-tenancy in big data environments are:
• Support for multiple concurrent Hadoop jobs
• Support for multiple lines of business on a shared infrastructure
• Support for multiple application workloads of different types (Hadoop and non-Hadoop)
• Provisions for security isolation between tenants
• Contract-oriented service level guarantees for tenants
• Support for multiple versions of applications and application frameworks concurrently
Organizations that are sophisticated in their view of multi-tenancy need all these capabilities and
more. YARN promises to address some of these requirements and does so in large measure.
However, you will find in future releases of Hadoop that there are other approaches that are being
addressed to provide other forms of multi-tenancy.
Although it is an important technology, the world is not suffering from a shortage of resource
managers. Some Hadoop providers are supporting YARN, and others are supporting Apache
Mesos.
Uempty
To ease the transition from Hadoop v1 to YARN, a major goal of YARN and the MapReduce
framework implementation on top of YARN was to ensure that existing MapReduce applications
that were programmed and compiled against previous MapReduce APIs (MRv1 applications) can
continue to run with little or no modification on YARN (MRv2 applications).
For many users who use the org.apache.hadoop.mapred APIs, MapReduce on YARN ensures full
binary compatibility. These existing applications can run on YARN directly without recompilation.
You can use JAR files from your existing application that code against mapred APIs and use
bin/hadoop to submit them directly to YARN.
Unfortunately, it was difficult to ensure full binary compatibility with the existing applications that
compiled against MRv1 org.apache.hadoop.mapreduce APIs. These APIs have gone through
many changes. For example, several classes stopped being abstract classes and changed to
interfaces. Therefore, the YARN community compromised by supporting source compatibility only
for org.apache.hadoop.mapreduce APIs. Existing applications that use MapReduce APIs are
source-compatible and can run on YARN either with no changes, with simple recompilation against
MRv2 .jar files that are included with Hadoop 2, or with minor updates.
Uempty
The NodeManager is a more generic and efficient version of the TaskTracker. Instead of having a
fixed number of Map and Reduce slots, the NodeManager has several dynamically created
resource containers. The size of a container depends upon the amount of resources it contains,
such as memory, CPU, disk, and network I/O.
Currently, only memory and CPU are supported (YARN-3); cgroups might be used to control disk
and network I/O in the future.
The number of containers on a node is a product of configuration parameters and the total amount
of node resources (such as total CPUs and total memory) outside the resources that are dedicated
to the secondary daemons and the OS.
Uempty
Uempty
Uempty
Uempty
5.4. Hadoop and MapReduce v1 compared to
v2
Uempty
The original Hadoop (v1) and MapReduce (v1) had limitations, and several issues surfaced over
time. We review these issues in preparation for looking at the differences and changes that were
introduced with Hadoop v2 and MapReduce v2.
Uempty
Topics
• Introduction to MapReduce
• Hadoop v1 and MapReduce v1 architecture and limitations
• YARN architecture
• Hadoop and MapReduce v1 compared to v2
Uempty
Hadoop v1 to Hadoop v2
The most notable change from Hadoop v1 to Hadoop v2 is the separation of cluster and resource
management from the execution and data processing environment. This change allows for many
new application types to run on Hadoop, including MapReduce v2.
HDFS is common to both versions. MapReduce is the only execution engine in Hadoop v1. The
YARN framework provides work scheduling that is neutral to the nature of the work that is
performed. Hadoop v2 supports many execution engines, including a port of MapReduce that is
now a YARN application.
Uempty
The fundamental idea of YARN and MRv2 is to split the two major functions of the JobTracker,
resource management and job scheduling / monitoring, into separate daemons. The idea is to have
a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is
either a single job in the classical sense of MapReduce jobs or a DAG of jobs.
The ResourceManager and per-node worker, the NodeManager (NM), form the data-computation
framework. The ResourceManager is the ultimate authority that arbitrates resources among all the
applications in the system.
The per-application ApplicationMaster is, in effect, a framework-specific library that is tasked with
negotiating resources from the ResourceManager and working with the NodeManagers to run and
monitor the tasks.
Uempty
The ResourceManager has two main components: Scheduler and ApplicationsManager:
• The Scheduler is responsible for allocating resources to the various running applications. The
Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for
the application. Also, it offers no guarantees about restarting failed tasks either due to
application failure or hardware failures. The Scheduler performs its scheduling function based
the resource requirements of the applications; it does so based on the abstract notion of a
resource Container, which incorporates elements such as memory, CPU, disk, network, and
other resources. In the first version, only memory is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster
resources among the various queues, applications, and other items. The current MapReduce
schedulers, such as the CapacityScheduler and the FairScheduler, are some examples of the
plug-in.
The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of
cluster resources.
• The ApplicationsManager is responsible for accepting job submissions and negotiating the first
container for running the application-specific ApplicationMaster. It provides the service for
restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent that is responsible for containers,
monitoring their resource usage (CPU, memory, disk, and network), and reporting the same to
the ResourceManager / Scheduler.
The per-application ApplicationMaster has the task of negotiating appropriate resource
containers from the Scheduler, tracking their status, and monitoring for progress.
MRv2 maintains API compatibility with previous stable release (hadoop-1.x), which means that
all MapReduce jobs should still run unchanged on top of MRv2 with just a recompile.
Uempty
Architecture of MRv1
Classic version of MapReduce (MRv1)
TaskTracker
Reduce task
Client
• Runs Map and
Reduce tasks.
In MapReduce v1, there is only one JobTracker that is responsible for allocation of resources, task
assignment to data nodes (as TaskTrackers), and ongoing monitoring ("heartbeat") as each job is
run (the TaskTrackers constantly report back to the JobTracker on the status of each running task).
Uempty
YARN architecture
High-level architecture of YARN
ResourceManager (RM) NodeManager
NodeManager (NM)
Uempty
The NodeManager is a more generic and efficient version of the TaskTracker. Instead of having a
fixed number of Map and Reduce slots, the NodeManager has several dynamically created
resource containers. The size of a container depends upon the amount of resources it contains,
such as memory, CPU, disk, and network I/O. Currently, only memory and CPU (YARN-3) are
supported. cgroups might be used to control disk and network I/O in the future. The number of
containers on a node is a product of configuration parameters and the total amount of node
resources (such as total CPU and total memory) outside the resources that are dedicated to the
secondary daemons and the OS.
The ApplicationMaster can run any type of task inside a container. For example, the MapReduce
ApplicationMaster requests a container to start a Map or a Reduce task, and the Giraph
ApplicationMaster requests a container to run a Giraph task. You can also implement a custom
ApplicationMaster that runs specific tasks and invent a new distributed application framework. I
encourage you to read about Apache Twill, which aims to make it easier to write distributed
applications sitting on top of YARN.
In YARN, MapReduce is simply degraded to the role of a distributed application (but still a useful
one) and is now called MRv2. MRv2 is simply the re?implementation of the classical MapReduce
engine, now called MRv1, that runs on top of YARN.
Uempty
ApplicationMaster JobTracker
(but dedicated and short-lived)
NodeManager TaskTracker
Container Slot
Uempty
Unit summary
Uempty
Review questions
1. Which of the following phases in a MapReduce job is
optional?
A. Map
B. Shuffle
C. Reduce
D. Combiner
2. True or False: Interactive, online, and streaming applications
are not allowed to run on Hadoop v2.
3. The JobTracker in MRv1 is replaced by which components
in YARN? (Select all that apply.)
A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. TaskTracker
Uempty
Uempty
Review answers
1. Which of the following phases in a MapReduce job is
optional?
A. Map
B. Shuffle
C. Reduce
D. Combiner
2. True or False: Interactive, online, and streaming
applications are not allowed to run on Hadoop v2
3. The JobTracker in MRv1 is replaced by which components
in YARN? (Select all that apply.)
A. ResourceManager
B. NodeManager
C. ApplicationMaster
D. TaskTracker
Uempty
Uempty
Exercise: Running
MapReduce and YARN jobs
Uempty
Exercise objectives
• This exercise introduces you to a simple MapReduce program
that uses Hadoop v2 and related technologies. You compile
and run the program by using Hadoop and Yarn commands.
You also explore the MapReduce job’s history with Ambari
Web UI.
• After completing this exercise, you will be able to:
ƒ List the sample MapReduce programs provided by the Hadoop
community.
ƒ Compile MapReduce programs and run them by using Hadoop
and YARN commands.
ƒ Explore the MapReduce job’s history by using the Ambari Web UI.
Uempty
Uempty
Exercise objectives
• In this exercise, you compile and run a new and more complex
version of the WordCount program that was introduced in
“Exercise. Running MapReduce and YARN jobs”. This new
version uses many of the features that are provided by the
MapReduce framework.
• After completing this exercise, you will be able to:
ƒ Compile and run more complex MapReduce programs.
Uempty
Overview
In this unit, you learn about Apache Spark, which is an open source, general-purpose distributed
computing engine that is used for processing and analyzing a large amount of data.
Uempty
Unit objectives
• Explain the nature and purpose of Apache Spark in the Hadoop
infrastructure.
• Describe the architecture and list the components of the Apache Spark
unified stack.
• Describe the role of a Resilient Distributed Dataset (RDD).
• Explain the principles of Apache Spark programming.
• List and describe the Apache Spark libraries.
• Start and use Apache Spark Scala and Python shells.
• Describe Apache Spark Streaming, Apache Spark SQL, MLib, and
GraphX.
Uempty
6.1. Apache Spark overview
Uempty
Uempty
Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring
Uempty
There is an explosion of data, and no matter where you look, data is everywhere. You get data from
social media such as Twitter feeds, Facebook posts, SMS, and many other sources. Processing
this data as quickly as possible becomes more important every day. How can you discover what
your customers want and offer it to them immediately? You do not want to wait hours for a batch job
to complete when you must have the data in minutes or less.
MapReduce is useful, but the amount of time it takes for the jobs to run is no longer acceptable in
many situations. The learning curve to write a MapReduce job is also difficult because it takes
specific programming knowledge and expertise. Also, MapReduce jobs work only for a specific set
of use cases. You need something that works for a wider set of use cases.
Uempty
Apache Spark was designed as a computing platform to be fast, general-purpose, and easy to use.
It extends the MapReduce model and takes it to a whole other level:
• The speed comes from the in-memory computations. Applications running in memory allows for
a much faster processing and response. Apache Spark is even faster than MapReduce for
complex applications on disk.
• The Apache Spark generality covers a wide range of workloads under one system. You can run
batch application such as MapReduce type jobs or iterative algorithms that build upon each
other. You can also run interactive queries and process streaming data with your application. In
a later slide, you see that there are several libraries that you can easily use to expand beyond
the basic Apache Spark capabilities.
• The ease of use of Apache Spark enables you to quickly pick it up by using simple APIs for
Scala, Python, Java, and R. There are more libraries that you can use for SQL, machine
learning, streaming, and graph processing. Apache Spark runs on Hadoop clusters such as
Hadoop YARN or Apache Mesos, or even as a stand-alone product with its own scheduler.
Uempty
Ease of use
• To implement the classic WordCount in Java MapReduce, you need
three classes: the main class that sets up the job, a mapper, and a
reducer, each about 10 lines long.
• Here is the same WordCount program that is written in Scala for
Apache Spark:
Apache Spark supports Scala, Python, Java, and R programming languages. The slide shows
programming in Scala. Python is widespread among data scientists and in the scientific community,
bringing those users on par with Java and Scala developers.
An important aspect of Apache Spark is the ways that it can combine the functions of many tools
that are available in the Hadoop infrastructure to provide a single unifying platform. In addition, the
Apache Spark execution model is general enough that a single framework can be used for the
following tasks:
• Batch processing operations (like in MapReduce)
• Stream data processing
• Machine learning
• SQL-like operations
• Graph operations
The result is that many ways of working with data are available on the same platform, which bridges
the gap between the work of the classic big data programmer, data engineers, and data scientists.
However, Apache Spark has its own limitations, for example, there are no universal tools. Thus,
Apache Spark is not suitable for transaction processing and other atomicity, consistency, isolation,
and durability (ACID) types of operations.
Uempty
You might be asking why you want to use Apache Spark and what you use it for.
Apache Spark is related to MapReduce in a sense that it expands on Hadoop's capabilities. Like
MapReduce, Apache Spark provides parallel distributed processing, fault tolerance on commodity
hardware, scalability, and other processes. Apache Spark adds to the concept with aggressively
cached in-memory distributed computing, low latency, high-level APIs, and a stack of high-level
tools, which are described on the next slide. These features save time and money.
Uempty
There are two groups that want to use Apache Spark, which are data scientists and data engineers,
who have overlapping skill sets:
• Data scientists must analyze and model the data to obtain insight. They must transform the data
into something that they can use for data analysis. They use Apache Spark for its ad hoc
analysis to run interactive queries that give them results immediately. Data scientists might have
experience using SQL, statistics, machine learning, and some programming, usually in Python,
MatLab, or R. After the data scientists obtain insights into the data and determine that there is a
need to develop a production data processing application, a web application, or some system to
act upon the insight, the work is handed over to data engineers.
• Data engineers use the Apache Spark programming API to develop a system that implements
business use cases. Apache Spark parallelizes these applications across the clusters while
hiding the complexities of distributed systems programming and fault tolerance. Data engineers
can employ Apache Spark to monitor, inspect, and tune applications.
For everyone else, Apache Spark is easy to use with a wide range of functions. The product is
mature and reliable.
Uempty
Apache
Apache Spark MLlib GraphX
Spark SQL Streaming Machine learning Graph processing
Real-time
processing
Apache
Stand-alone scheduler YARN
Mesos
Apache Spark Core is at the center of the Apache Spark Unified Stack. Apache Spark Core is a
general-purpose system providing scheduling, distributing, and monitoring of the applications
across a cluster.
Apache Spark Core is designed to scale up from one to thousands of nodes. It can run over various
cluster managers, including Hadoop YARN and Apache Mesos, or it can run stand-alone with its
own built-in scheduler.
Apache Spark Core contains basic Apache Spark functions that are required for running jobs and
needed by other components. The most important of these functions is the Resilient Distributed
Dataset (RDD), which is the main element of the Apache Spark API. RDD is an abstraction of a
distributed collection of items with operations and transformations that is applicable to the data set.
It is resilient because it can rebuild data sets if there are node failures.
Various add-in components can run on top of the core that are designed to interoperate closely so
that the users combine them like they would any libraries in a software project. The benefit of the
Apache Spark Unified Stack is that all the higher layer components inherit the improvements that
are made at the lower layers. For example, optimizing the Apache Spark Core speeds up the SQL,
the streaming, the machine learning, and the graph processing libraries as well.
Uempty
Apache Spark simplifies the picture by providing many of Hadoop functions through several
purpose-built components: Apache Spark Core, Apache Spark SQL, Apache Spark Streaming,
Apache Spark MLib, and Apache Spark GraphX:
• Apache Spark SQL is designed to work with Apache Spark by using SQL and HiveQL (a Hive
variant of SQL). Apache Spark SQL allows developers to intermix SQL with the Apache Spark
programming language, which is supported by Python, Scala, Java, and R.
• Apache Spark Streaming provides processing of live streams of data. The Apache Spark
Streaming API closely matches the Apache Spark Core's API, making it easy for developers to
move between applications that process data that is stored in memory versus data that arrives
in real time. Apache Spark Streaming also provides the same degree of fault tolerance,
throughput, and scalability that the Apache Spark Core provides.
• MLib is the machine learning library that provides multiple types of machine learning algorithms.
These algorithms are designed to scale out across the cluster. Supported algorithms include
logistic regression, naive Bayes classification, support vector machine (SVM), decision trees,
random forests, linear regression, k-means clustering, and others.
• GraphX is a graph processing library with APIs that manipulates graphs and performs
graph-parallel computations. Graphs are data structures that are composed of vertices and
edges connecting them. GraphX provides functions for building graphs and implementations of
the most important algorithms of the graph theory, like page rank, connected components,
shortest paths, and others.
"If you compare the functionalities of Apache Spark components with the tools in the Hadoop
ecosystem, you can see that some of the tools are suddenly superfluous. For example, Apache
Storm can be replaced by Apache Spark Streaming, Apache Giraph can be replaced by Apache
Spark GraphX and Apache Spark MLib can be used instead of Apache Mahout. Apache Pig, and
Apache Sqoop are not really needed anymore, as the same functionalities are covered by Apache
Spark Core and Apache Spark SQL. But even if you have legacy Pig workflows and need to run
Pig, the Spork project enables you to run Pig on Apache Spark." - Bonaći, M. and Zečević, P, Spark
in action. Greenwich, CT: Manning Publications, 2016. 1617292605.
("Spork" is Apache Pig on Apache Spark, as described at
https://github.com/sigmoidanalytics/spork.)
Uempty
Apache Spark jobs can be written in Scala, Python, or Java. Apache Spark shells are available for
Scala (spark-shell) and Python (pyspark). This course does not teach you how to program in each
specific language, but we cover how to use some of them within the context of Apache Spark. It is a
best practice that you have at least some programming background to understand how to code in
any of these languages.
If you are setting up the Apache Spark cluster yourself, you must ensure that you have a compatible
version of it. This information can be found on the Apache Spark website. In the lab environment,
everything is set up for you, so you start the shell, and you are ready to go.
Apache Spark itself is written in the Scala language, so it is natural to use Scala to write Apache
Spark applications. This course cover code examples that are written in Scala, Python, and Java.
Java 8 supports the functional programming style to include lambdas, which concisely capture the
functions that are run by the Apache Spark engine. Lambdas bridge the gap between Java and
Scala for developing applications on Apache Spark.
Uempty
The Apache Spark shell provides a simple way to learn Apache Spark API. It is also a powerful tool
analyze data interactively. The shell is available in either Scala, which runs on the JVM, or Python.
To start Scala, run spark-shell from within the Apache Spark .bin directory. To create an RDD from
a text file, launch the textFile method with the sc object, which is the SparkContext.
To start the shell for Python, run pyspark from the same .bin directory. Then, starting the textFile
command also creates an RDD for that text file.
In the lab exercise later, you start either of the shells and run a series of RDD transformations and
actions to get a feel of how to work with Apache Spark. Later, you dive deeper into RDDs.
IPython offers features such as tab completion. For more information, see: http://ipython.org.
IPython Notebook is a web-browser based version of IPython.
Uempty
6.2. Scala overview
Uempty
Scala overview
Uempty
Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring
Uempty
Everything in Scala is an object. The primitive types that are defined by Java, such as int or
Boolean, are objects in Scala. Functions are objects in Scala and play an important role in how
applications are written for Apache Spark.
Numbers are objects. As an example, in the expression that you see here, “1 + 2 * 3 / 4” means that
the individual numbers invoke the various identifiers +, -, *, and / with the other numbers that are
passed in as arguments by using the dot notation.
Functions are objects. You can pass functions as arguments into another function. You can store
them as variables. You can return them from other functions. The function declaration is the
function name followed by the list of parameters and then the return type.
If you want to learn more about Scala, go to its website for tutorials and guide. Throughout this
course, you see examples in Scala that have explanations about what it does. Remember, the
focus of this unit is on the context of Apache Spark, and is not intended to teach Scala, Python, or
Java.
Uempty
References for learning Scala:
• Horstmann, C. S., Scala for the Impatient. Upper Saddle River, NJ: Addison-Wesley
Professional, 2010. 0321774094.
• Odersky, M., et al., Programming in Scala: A Comprehensive Step-by-Step Guide, 2nd Edition.
Walnut Creek, CA: Artima Press, 2011. 0981531644.
• https://docs.scala-lang.org/tutorials/scala-for-java-programmers.html
Uempty
Anonymous functions are common in Apache Spark applications. Essentially, if the function you
need is going to be required only once, there is no value in naming it. Use it anonymously and
forget about it. For example, you have a timeFlies function and print a statement to the console in
it. In another function, oncePerSecond, you must call this timeFlies function. Without anonymous
functions, you code it like the previous example by defining the timeFlies function.
By using the anonymous function capability, you provide the function only with arguments, the right
arrow, and the body of the function after the right arrow. Because you use this function in only this
place, you do not need to name the function.
The Python syntax is relatively convenient and easy to work with, but aside from the basic structure
of the language Python is also sprinkled with small syntax structures that make certain tasks
especially convenient. The lambda keyword/function construct is one of them: The creators call it
"syntactical candy."
Reference:
https://docs.scala-lang.org/tutorials/scala-for-java-programmers.html
Uempty
text_file = spark.textFile("hdfs://...")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://...")
• Lambda functions can be used with Scala, Python, and Java V8. This
example is written in Scala.
The Lambda or => syntax is a shorthand way to define functions inline in Python and Scala. With
Apache Spark, you can define anonymous functions separately and then pass the name to Apache
Spark.
For example, in Python:
def hasHDP(line):
return "HDP" in line;
HDPLines = lines.filter(hasHDP)
…is functionally equivalent to:
grep HDP inputfile
A common example is MapReduce WordCount. You split up the file by words (tokenization) and
then map each word into a key value pair with the word as the key and a value of 1. Then, you
reduce by the key, which adds all the values of the same key, effectively counting the number of
occurrences of that key. Finally, the counts are written to a file in HDFS.
Uempty
References:
• A tour of Scala: Anonymous function syntax: https://www.scala-lang.org/old/node/133
• Apache Software Foundation:
Apache Spark Examples https://spark.apache.org/examples.html
Uempty
6.3. Resilient Distributed Dataset
Uempty
Uempty
Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring
Uempty
Reduced
RDD
The Apache Spark primary core abstraction is called RDD. RDD is a distributed collection of
elements that is parallelized across the cluster.
There are two types of RDD operations: transformations and actions.
• Transformations do not return a value. In fact, nothing is evaluated during the definition of
transformation statements. Apache Spark creates the definition of a transformation that is
evaluated later at run time. This definition is called this lazy evaluation. The transformation is
stored as a directed acyclic graphs (DAGs).
• Actions perform the evaluation. The actions’ transformations are performed along with the work
that is called for to use or produce RDDs. Actions return values. For example, you can do a
count on an RDD to get the number of elements within, and that value is returned.
Uempty
The fault-tolerant aspect of RDDs enables Apache Spark to reconstruct the transformations that are
used to build the lineage to get back any lost data.
In the example RDD flow that is shown in the slide, the first step loads the data set from Hadoop.
Successive steps apply transformations on this data, such as filter, map, or reduce. Nothing
happens until an action is called. The DAG is updated each time until an action is called, which
provides fault tolerance, for example, if a node goes offline, all it must do when it comes back online
is to reevaluate the graph from where it left off.
In-memory caching is provided with Apache Spark to enable the processing to happen in memory. If
the RDD does not fit in memory, it spills to disk.
There are three methods for creating an RDD:
• You can parallelize an existing collection, which means that the data is within Apache Spark and
can now be operated on in parallel. For example, if you have an array of data, you can create
an RDD from it by calling the parallelized method. This method returns a pointer to the RDD.
So, this new distributed data set can now be operated on in parallel throughout the cluster.
• You can reference a data set that can come from any storage source that is supported by
Hadoop, such as HDFS and Amazon S3.
• You can transform an existing RDD to create an RDD, for example, If you have the array of data
that you parallelized earlier and you want to filter out the records that are available. A new RDD
is created by using the filter method.
Uempty
Creating an RDD
To create an RDD, complete the following steps:
1. Start the Apache Spark shell (requires a PATH environment variable):
spark-shell
2. Create some data:
val data = 1 to 10000
3. Parallelize that data (creating the RDD):
val distData = sc.parallelize(data)
4. Perform more transformations or invoke an action on the
transformation:
distData.filter(…)
You can also create an RDD from an external data set:
val readmeFile = sc.textFile("Readme.md")
Here is a quick example of how to create an RDD from an existing collection of data. In the
examples throughout the course, unless otherwise indicated, you are using Scala to show how
Apache Spark works. In the lab exercises, you get to work with Python and Java as well.
1. Start the Apache Spark shell. This command is found under the /usr/bin directory.
2. After the shell is up (with the scala> prompt), create some data with values 1 - 10,000.
3. Create an RDD from that data by using the parallelize method from the SparkContext, shown as
sc on the slide, which means that the data can now be operated on in parallel. You learn more
abut the SparkContext (the sc object that is starting the parallelized function) later, so for now,
know that when you initialize a shell, the SparkContext is initialized for you to use.
The parallelize method returns a pointer to the RDD. Transformations operations such as
parallelize return only a pointer to the RDD. The method does not create that RDD until some
action is started on it. With this new RDD, you can perform more transformations or actions on
it, such as the filter transformation. With large amounts of data (big data), you do not want to
duplicate the date until needed, and not cache it in memory until needed.
Another way to create an RDD is from an external data set. In the example here, you create an
RDD from a text file by using the textFile method of the SparkContext object. You see more
examples about how to create RDDs throughout this course.
Uempty
Now, you are loading a file from HDFS. Loading the file creates an RDD, which is only a pointer to
the file. The data set is not loaded into memory yet. Nothing happens until an action is called. The
transformation updates only the direct acyclic graph (DAG).
So, the transformation here maps each line to the length of that line. Then, the action operation
reduces it to get the total length of all the lines. When the action is called, Apache Spark goes
through the DAG and applies all the transformations up until that point followed by the action, and
then a value is returned to the caller.
A DAG is essentially a graph of the business logic that is not run until an action is called (often
called lazy evaluation).
To view the DAG of an RDD after a series of transformations, use the toDebugString method you
see in the slide. The method displays the series of transformations that Apache Spark goes through
after an action is called. You read the series from the bottom up. In the sample DAG that is shown
in the slide, you can see that it starts as a textFile and goes through a series of transformations,
such as map and filter, followed by more map operations. It is this behavior that enables fault
tolerance. If a node goes offline and comes back on, all it must do is grab a copy of the DAG from a
neighboring node and rebuild the graph back to where it was before it went offline.
Uempty
In the next several slides, you see at a high level what happens when an action is run.
First, look at the code. The goal here is to analyze some log files. In the first line, you load the log
from HDFS. In the next two lines, you filter out the messages within the log errors. Before you start
an action on it, you tell it to cache the filtered data set (it does not cache it yet as nothing has been
done up until this point).
Then, you do more filters to get specific error messages relating to MySQL and PHP followed by
the count action to find out how many errors were related to each of those filters.
Uempty
In reviewing the steps, the first thing that happens when you load the text file is the data is
partitioned into different blocks across the cluster.
Uempty
Thee driver sends the code to be run on each block. In this example, the code is the various
transformations and actions that are sent out to the workers. The executor on each worker
performs the work on each block. You learn more about executors later in this unit.
Uempty
The executors read the HDFS blocks to prepare the data for the operations in parallel.
Uempty
After a series of transformations, you want to cache the results up until that point into memory. A
cache is created.
Uempty
After the first action completes, the results are sent back to the driver. In this case, you are looking
for messages that relate to MySQL that are then returned to the driver.
Uempty
To process the second action, Apache Spark uses the data on the cache (it does not need to go to
the HDFS data again). Apache Spark reads it from the cache and processes the data from there.
Uempty
Finally, the results are sent back to the driver and you complete a full cycle.
Uempty
Transformation Meaning
map(func) Returns a new data set that is formed by passing each element of the source
through a function func.
filter(func) Returns a new data set that is formed by selecting those elements of the source
on which func returns true.
flatMap(func) Like map, but each input item can be mapped to 0 or more output items. So, func
should return a Seq rather than a single item.
join(otherDataset, When called on data sets of type (K, V) and (K, W), returns a data set of (K, (V,
[numTasks]) W)) pairs with all pairs of elements for each key.
reduceByKey(func) When called on a data set of (K, V) pairs, returns a data set of (K,V) pairs where
the values for each key are aggregated by using the reduce function func.
sortByKey([ascending],[numTa When called on a data set of (K, V) pairs where K implements Ordered, returns a
sks]) data set of (K,V) pairs that are sorted by keys in ascending or descending order.
Here are some of the transformations that are available. The full set can be found on the Apache
Spark website. The Apache Spark Programming Guide can be found at
https://spark.apache.org/docs/latest/programming-guide.html, and transformations can be found at
https://spark.apache.org/docs/latest/rdd-programming-guide.html.
Transformations are lazy evaluations. Nothing is run until an action is called. Each transformation
function basically updates the graph, and when an action is called, the graph runs. A transformation
returns a pointer to the new RDD.
Uempty
Some of the less obvious transformations:
• The flatMap function is like map, but each input can be mapped to 0 or more output items. The
returned pointer of the func method should return a sequence of objects rather than a single
item. Then, flatMap flattens a list of lists for the operations that follow. Basically, flatMap is used
for MapReduce operations where you might have a text file and each time a line is read in, you
split up that line by spaces to get individual keywords. Each of those lines ultimately is flattened
so that you can perform the map operation on it to map each keyword to the value of one.
• The join function combines two sets of key value pairs and returns a set of keys to a pair of
values from the two sets. For example, you have a K,V pair and a K,W pair. When you join them
together, you get a K,(V,W) set.
• The reduceByKey function aggregates on each key by using the reduce function. You would
use this function in a WordCount to sum up the values for each word to count its occurrences.
Uempty
Action Meaning
collect() Returns all the elements of the data set as an array of the driver program. This action
is usually useful after a filter or another operation that returns a sufficiently small
subset of data.
count() Returns the number of elements in a data set.
take(n) Returns an array with the first n elements of the data set.
Action returns values. Again, you can find more information on the Apache Spark website. The full
set of functions is available at: https://spark.apache.org/docs/latest/rdd-programming-guide.html
The slide shows a subset:
• The collect function returns all the elements of the data set as an array of the driver program.
This function is usually useful after a filter or another operation that returns a small subset of
data to make sure your filter function works correctly.
• The count function returns the number of elements in a data set and can also be used to check
and test transformations.
• The take(n) function returns an array with the first n elements. This function is not run in
parallel. The driver computes all the elements.
• The foreach(func) function runs a function func on each element of the data set.
Uempty
RDD persistence
• Each node stores partitions of the cache that it computes in memory.
• The node reuses them in other actions on that data set (or derived data sets).
Future actions are much faster (often by more than 10x).
• There are two methods for RDD persistence:
ƒ persist()
ƒ cache()
Here, we describe RDD persistence. You know it as the cache function. The cache function is the
default of the persist function; cache() is essentially the persist function with MEMORY_ONLY
storage.
One of the key capabilities of Apache Spark is its speed through persisting or caching. Each node
stores any partitions of the cache and computes it in memory. When a subsequent action is called
on the same data set or a derived data set, Apache Spark uses it from memory instead of having to
retrieve it again. Future actions in such cases are often 10 times faster. The first time an RDD is
persisted, it is kept in memory on the node. Caching is fault-tolerant because if any part of the
partition is lost, it automatically is recomputed by using the transformations that originally created it.
Uempty
There are two methods to invoke RDD persistence:
• The persist() method enables you to specify a different storage level of caching. For example,
you can choose to persist the data set on disk, persist it in memory but as serialized objects to
save space, and other ways.
• The cache() method is the default way of using persistence by storing deserialized objects in
memory.
The table shows the storage levels and what they mean. Basically, you can choose to store in
memory or memory and disk. If a partition does not fit in the specified cache location, then it is
recomputed dynamically. You can also decide to serialize the objects before storing them. This
action is space-efficient but requires the RDD to be deserialized before it can be read, so it takes up
more CPU workload. There also is the option to replicate each partition on two cluster nodes.
Finally, there is an experimental storage level storing the serialized object in Tachyon. This level
reduces garbage collection impact and allows the executors to be smaller and share a pool of
memory. You can read more about this level on the Apache Spark website.
Uempty
There are many rules in this slide but use them as a reference when you must decide the type of
storage level. There are tradeoffs between the different storage levels. You should analyze your
situation to decide which level works best. You can find this information on the Apache Spark
website: https://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence.
Uempty
Here are the primary rules:
• If the RDDs fit with the default storage level (MEMORY_ONLY), leave them that way because
this level is the most CPU-efficient option, and it allows operations on the RDDs to run as fast
as possible. Basically, if your RDD fits within the default storage level, use that. It is the fastest
option to take advantage of Apache Spark design.
• Otherwise, use MEMORY_ONLY_SER and a fast serialization library to make objects more
space-efficient but still reasonably fast to access.
• Do not spill to disk unless the functions that compute your data sets are expensive or require a
large amount of space.
• If you want fast recovery, use the replicated storage levels. All levels of storage are
fault-tolerant but still require the recomputing of the data. If you have a replicated copy, you can
continue to work while Apache Spark is recomputing a lost partition.
• In environments with high amounts of memory or multiple applications, the experimental
OFF_HEAP mode has several advantages. Use Tachyon if your environment has high amounts
of memory or multiple applications. The OFF_HEAP mode allows you to share the same pool of
memory and reduces garbage collection costs. Also, the cached data is not lost if the individual
executors fail.
Uempty
val pair = ('a', 'b') pair = ('a', 'b') Tuple2 pair = new Tuple2('a', 'b');
pair._1 // will return 'a' pair[0] # will return 'a' pair._1 // will return 'a'
pair._2 // will return 'b' pair[1] # will return 'b' pair._2 // will return 'b'
On this slide and the next one, you review Apache Spark shared variables and the type of
operations that you can do on key-value pairs.
Apache Spark provides two limited types of shared variables for common usage patterns:
broadcast variables and accumulators.
Normally, when a function is passed from the driver to a worker, a separate copy of the variables is
used for each worker. Broadcast variables allow each machine to work with a read-only variable
that is cached on each machine. Apache Spark attempts to distribute broadcast variables by using
efficient algorithms. As an example, broadcast variables can be used to give every node a copy of
a large data set efficiently.
Uempty
The other shared variables are accumulators, which are used for counters in sums that work well in
parallel. These variables can be added through only an associated operation. Only the driver can
read the accumulators value, not the tasks. The tasks can only add to it. Apache Spark supports
numeric types, but programmers can add support for new types. As an example, you can use
accumulator variables to implement counters or sums, as in MapReduce.
Key-value pairs are available in Scala, Python, and Java. In Scala, you create a key-value pair
RDD by typing “val pair = ('a', 'b’)”. To access each element, invoke the “._” notation. You are not
using zero-index, so “._1” returns the value in the first index and “._2” returns the value in the
second index. Java is also like Scala because it is not zero-index. You create the Tuple2 object in
Java to create a key-value pair. In Python, it is a zero-index notation, so the value of the first index
is in index 0 and the second index is 1.
Uempty
There are special operations that are available to RDDs of key-value pairs. In an application, you
must import the SparkContext package to use PairRDDFunctions such as reduceByKey.
The most common ones are those operations that perform grouping or aggregating by a key. RDDs
containing the Tuple2 object represent the key-value pairs. Tuple2 objects are created simply by
writing “(a, b)” if you import the library to enable Apache Spark implicit conversion.
If you have custom objects as the key inside your key-value pair, you must provide your own
equals() method to do the comparison, and a matching hashCode() method.
In this example, you have a textFile that is a normal RDD. Then, you perform some
transformations on it, and it creates a PairRDD that allows it to invoke the reduceByKey method
that is part of the PairRDDFunctions API.
Uempty
6.4. Programming with Apache Spark
Uempty
Uempty
Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring
Uempty
The compatibility of Apache Spark with various versions of the programming languages is
important.
As new releases of the HDP are released, you should revisit the issue of compatibility of languages
to work with the new versions of Apache Spark.
You can view all versions of Apache Spark and compatible software at:
http://spark.apache.org/documentation.html
Uempty
SparkContext
• The SparkContext is the main entry point for Apache Spark functions: It
represents the connection to an Apache Spark cluster.
• Use the SparkContext to create RDDs, accumulators, and broadcast
variables on that cluster.
• With the Apache Spark shell, the SparkContext (sc) is automatically
initialized for you to use.
• But in an Apache Spark program, you must add code to import some
classes and implicit conversions into your program:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
The SparkContext is the main entry point to everything in Apache Spark. It can be used to create
RDDs and shared variables on the cluster. When you start the Apache Spark Shell, the
SparkContext is automatically initialized for you with the variable sc. For an Apache Spark
application, you must first import some classes and implicit conversions and then create the
SparkContext object.
The three import statements for Scala are shown on the slide.
Uempty
Each Apache Spark application that you create requires certain dependencies. Over the next three
slides, you review how to link to those dependencies depending on which programming language
that you decide to use.
To link with Apache Spark by using Scala, you must have a version of Scala that is compatible with
the Apache Spark version that you use. For example, Apache Spark 1.6.3 uses Scala 2.10, so
make sure that you have Scala 2.10 if you want to write applications for Apache Spark 1.6.3.
To write an Apache Spark application, you must add a Maven dependency to Apache Spark. The
information is shown on this slide. If you want to access a Hadoop cluster, you must add a
dependency to that too.
In the lab environment for this course, the dependency is set up for you. The information on this
page is important if you want to set up an Apache Spark stand-alone environment or your own
Apache Spark cluster. For more information about Apache Spark versions and dependencies, see
the following website:
https://mvnrepository.com/artifact/org.apache.spark/spark-core?repo=hortonworks-releases
Uempty
After you have the dependencies established, the first thing to do in your Apache Spark application
before you can initialize Apache Spark is to build a SparkConf object. This object contains
information about your application. For example:
val conf = new SparkConf().setAppName(appName).setMaster(master)
You set the application name and tell it which node is the master node. The “master” parameter can
be a stand-alone Apache Spark distribution, Apache Mesos, or a YARN cluster URL. You can also
decide to use the local keyword string to run it in local mode. In fact, you can run local[16] to specify
the number of cores to allocate for that job or Apache Spark shell as 16.
For production mode, you do not want to hardcode the “master” path in your program. Instead, use
it as an argument for the spark-submit command.
After you have SparkConf set up, you pass it as a parameter to the SparkContext constructor to
create it.
Uempty
Apache Spark 1.6.3 works with Python 2.6 or higher. It uses the standard CPython interpreter, so C
libraries like NumPy can be used.
Check which version of Apache Spark you have when you enter an environment that uses it.
To run Apache Spark applications in Python, use the bin/spark-submit script in the Apache Spark
home directory. This script loads the Apache Spark Java and Scala libraries so that you can submit
applications to a cluster. If you want to use HDFS, you must link to it too. In the lab environment in
this course, you do not need to do link HDFS because Apache Spark is bundled with it. However,
you must import some Apache Spark classes, as shown.
Uempty
This slide shows the information for Python. It is like the information for Scala, but the syntax here is
slightly different. You must set up a SparkConf object to pass as a parameter to the SparkContext
object. As a best practice, pass the “master” parameter as an argument to the spark-submit
operation.
Uempty
If you are using Java 8, Apache Spark supports Lambda expressions for concisely writing functions.
Otherwise, you can use the org.apache.spark.api.java.function package with older Java versions.
As with Scala, you must add a dependency to Apache Spark, which is available through Maven
Central at the following website:
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/.
If you want to access an HDFS cluster, you must add the dependency there too.
Last, you must import some Apache Spark classes.
Uempty
Here is the same information for Java. Following the same idea, you must create the SparkConf
object and pass it to the SparkContext, which in this case is a JavaSparkContext. When you
imported statements into the program, you imported the JavaSparkContext libraries.
Uempty
Passing functions to Apache Spark is important to understand as you think about the business logic
of your application.
The design of Apache Spark API heavily relies on passing functions in the driver program to run on
the cluster. When a job is run, the Apache Spark driver must tell its worker how to process the data.
Uempty
There are three methods that you can use to pass functions:
• Using an anonymous function syntax.
This method is useful for short pieces of code. For example, here we define the anonymous
function that takes in a parameter x of type Int and add one to it. Essentially, anonymous
functions are useful for a one-time use of the function. In other words, you do not need to define
explicitly the function to use it. You define as you go. The left side of the => are the parameters
or the argument. The right side of the => is the body of the function.
• Static methods in a global singleton object.
With this method, you can create a global object. In the example, it is the object MyFunctions.
Inside that object, you basically define the function func1. When the driver requires that
function, it sends out only the object, and the worker can access it. In this case, when the driver
sends out the instructions to the worker, it sends out only the singleton object.
• Passing by reference to a method in a class instance as opposed to a singleton object.
This method requires sending the object that contains the class along with the method. To avoid
sending the entire object along, consider copying it to a local variable within the function instead
of accessing it externally.
For example, you have a field with the string “Hello”. You want to avoid calling the string directly
inside a function, which is shown on the slide as “x => field + x”. Instead, assign it to a local
variable so that only the reference is passed along and not the entire object, as shown here:
Uempty
This slide shows you how you can create an application by using a simple but effective example
that demonstrates Apache Spark capabilities.
After you have the beginning of your application ready by creating the SparkContext object, you
can start to program in the business logic by using the Apache Spark API that is available in Scala,
Java, R, Python. You create the RDD from an external data set or from an existing RDD. You use
transformations and actions to compute the business logic. You can take advantage of RDD
persistence, broadcast variables, and accumulators to improve the performance of your jobs.
Here is a sample Scala application. You have your import statement. After the beginning of the
object, you see that the SparkConf is created with the application name. Then, a SparkContext is
created. The several lines of code afterward create the RDD from a text file and then perform the
HdfsTest on it to see how long the iteration through the file takes. Finally, at the end, you stop the
SparkContext by calling the stop() function.
Again, it is a simple example to show how you might create an Apache Spark application. You get
to practice this task in an exercise.
Uempty
You can view the source code of the examples on the Apache Spark website, on GitHub, or within
the Apache Spark distribution itself.
For the full lists of the examples that are available in GitHub, see the following websites:
• Scala:
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/exam
ples
• Python: https://github.com/apache/spark/tree/master/examples/src/main/python
• Java:
https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/exampl
es
• R: https://github.com/apache/spark/tree/master/examples/src/main/r
• Apache Spark Streaming:
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/exam
ples/streaming
• Java Streaming:
https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/exampl
es/streaming
Uempty
To run Scala or Java examples, run the run-example script under the Apache Spark “bin” directory.
So, for example, to run the SparkPi application, run run-example SparkPi, where SparkPi is the
name of the application. Substitute that name with a different application name to run that other
application.
To run the sample Python applications, run the spark-submit command and provide the path to the
application.
Uempty
Import
statements
SparkConf and
SparkContext
Transformations
+ Actions
Here is an example application of using Scala. Similar programs can be written in Python or Java.
The application that is shown here counts the number of lines with 'a' and the number of lines with
‘b’. You must replace YOUR_SPARK_HOME with the directory where Apache Spark is installed.
Unlike the Apache Spark shell, you must initialize the SparkContext in a program. First, you must
create a SparkConf to set up your application's name. Then, you create the SparkContext by
passing in the SparkConf object. Next, you create the RDD by loading in textFile, and then caching
the RDD. Because we apply a couple of transformations to textFile, caching helps speed up the
process, especially if the logData RDD is large. Finally, you get the values of the RDD by running
the count action on it. End the program by printing it out onto the console.
Uempty
You should know how to create an Apache Spark application by using any of the supported
programming languages. Now, you get to explore how to run the application:
1. Define the dependencies.
2. Package the application together by using system build tool, such as Ant, sbt, or Maven.
The examples here show how to these steps by using various tools. You can use any tool for any of
the programming languages. For Scala, the example is shown by using sbt, so you use a
simple.sbt file. In Java, the example shows using Maven, so you use the pom.xml file. In Python, if
you must have dependencies, which requires third-party libraries, and then you can use the
-py-files argument.
This slide shows examples of what a typical directory structure looks like for the tool that you
choose.
After you create and package the JAR file, run spark-submit to run the application:
• Scala: sbt
• Java: mvn
• Python: submit-spark
Uempty
6.5. Apache Spark libraries
Uempty
Uempty
Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring
Uempty
spark.apache.org
Apache Spark comes with libraries that you can use for specific use cases. These libraries are an
extension of the Apache Spark Core API. Any improvements that are made to the core
automatically take effect with these libraries. One of the significant benefits of Apache Spark is that
there is little overhead to using these libraries with Apache Spark because they are tightly
integrated.
The section is a high-level overview of each of these libraries and their capabilities. The focus is on
Scala with specific callouts to Java or Python if there are major differences.
The four libraries are Apache Spark SQL, Apache Spark Streaming, MLib, and GraphX. The
remainder of this section covers these libraries.
Uempty
With Apache Spark SQL, you can write relational queries that are expressed in either SQL, HiveQL,
or Scala to be run by using Apache Spark. Apache Spark SQL has a new RDD called the
SchemaRDD. The SchemaRDD consists of rows objects and a schema that describes the type of
data in each column in the row. You can think of this schema as a table in a traditional relational
database.
You create a SchemaRDD from existing RDDs, a Parquet file, a JSON data set, or by using HiveQL
to query against the data that is stored in Hive. Apache Spark SQL is an alpha component, so some
APIs might change in future releases.
Apache Spark SQL supports Scala, Java, R, and Python.
Uempty
Apache Kafka
Flume HDFS
Apache Spark
HDFS/S3 Databases
Streaming
Kinesis Dashboards
Twitter
With Apache Spark Streaming, you can process live streaming data in small batches. By using
Apache Spark Core, Apache Spark Streaming is scalable, high-throughput, and fault-tolerant. You
write stream programs with DStreams, which is a sequence of RDDs made from a stream of data.
There are various data sources that Apache Spark Streaming receives data from, including, Apache
Kafka, Flume, HDFS, Kinesis, or Twitter. It pushes data out to HDFS, databases, or dashboards.
Apache Spark Streaming supports Scala, Java, and Python. Python was introduced with Apache
Spark 1.2. Python has all the transformations that Scala and Java have with DStreams, but it can
support only text data types. Support for other sources such as Apache Kafka and Flume are
planned for future releases for Python.
Reference:
https://spark.apache.org/docs/latest/streaming-programming-guide.html
Uempty
spark.apache.org
Uempty
There are two parameters for a sliding window:
• The window length is the duration of the window.
• The sliding interval is the interval in which the window operation is performed.
Both parameters must be in multiples of the batch interval of the source DStream.
In the second diagram, the window length is 3 and the sliding interval is 2. To put it into perspective,
maybe you want to generate word counts over the last 30 seconds of data every 10 seconds. To do
this task, you apply the reduceByKeyAndWindow operation on the pairs of DStream of (Word,1)
pairs over the last 30 seconds of data.
Doing WordCount in this manner is provided as the example program NetworkWordCount, which is
available on GitHub at the following website:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples
/streaming/NetworkWordCount.scala
Reference:
https://spark.apache.org/docs/latest/streaming-programming-guide.html
Uempty
GraphX
• GraphX for graph processing:
ƒ Graphs and graph parallel computation
ƒ Social networks and language modeling
• The goal of GraphX is to optimize the process by making it easier to
view data both as a graph and as collections, such as RDD, without
data movement or duplication.
https://spark.apache.org/docs/latest/graphx-programming-guide.html
Introduction to Apache Spark © Copyright IBM Corporation 2021
The GraphX is a library that sits on top of Apache Spark Core. It is basically a graph processing
library that can be used for social networks and language modeling.
Graph data and the requirement for graph parallel systems is becoming more common, which is
why the GraphX library was developed. Specific scenarios are not efficient if they are processed by
using the data-parallel model. A need for the graph-parallel model is introduced with new
graph-parallel systems like Giraph and GraphLab to run efficiently graph algorithms much faster
than general data-parallel systems.
There are new inherent challenges that come with graph computations, such as constructing the
graph, modifying its structure, or expressing computations that span several graphs. As such, it is
often necessary to move between table and graph views, depending on the objective of the
application and the business requirements.
The goal of GraphX is to optimize the process by making it easier to view data both as a graph and
as collections, such as RDD, without data movement or duplication.
Uempty
6.6. Apache Spark cluster and monitoring
Uempty
Uempty
Topics
• Apache Spark overview
• Scala overview
• Resilient Distributed Dataset
• Programming with Apache Spark
• Apache Spark libraries
• Apache Spark cluster and monitoring
Uempty
Executor Cache
Worker Node
Executor Cache
Task Task
Uempty
There are several aspects of this architecture:
• Each application gets its own executor processes. The executor stays up for the entire duration
that the application is running so that the applications are isolated from each other, on both the
scheduling side, and running on different JVMs. However, you cannot share data across
applications. You must externalize the data if you want to share data between different
applications, or instances of SparkContext.
• Apache Spark applications are unaware about the underlying cluster manager. If the
applications can acquire executors and communicate with each other, they can run on any
cluster manager.
• Because the driver program schedules tasks on the cluster, it should run close to the worker
nodes on the same local network. If you like to send remote requests to the cluster, it is better to
use a remote procedure call (RPC) and have it submit operations from nearby.
• There are three supported cluster managers:
▪ Apache Spark comes with a stand-alone manager.
▪ You can use Apache Mesos, which is a general cluster manager that can run and service
Hadoop jobs.
▪ You can use Hadoop YARN, which is the resource manager in Hadoop 2. In the lab
exercise, you use HDP with YARN to run your Apache Spark applications.
Uempty
Uempty
• Metrics also can monitor Apache Spark applications. Metrics are based on the Coda Hale
Metrics Library. You can customize the library so that it reports to various sinks, such as CSV.
You can configure the metrics in the metrics.properties file under the “conf” directory.
• You can use external instruments to monitor Apache Spark. Ganglia is used to view overall
cluster utilization and resource bottlenecks. Various OS profiling tools and JVM utilities can also
be used for monitoring Apache Spark.
Uempty
Unit summary
• Explained the nature and purpose of Apache Spark in the Hadoop
infrastructure.
• Described the architecture and listed the components of the Apache
Spark unified stack.
• Described the role of a Resilient Distributed Dataset (RDD).
• Explained the principles of Apache Spark programming.
• Listed and described the Apache Spark libraries.
• Started and used Apache Spark Scala and Python shells.
• Described Apache Spark Streaming, Apache Spark SQL, MLib, and
GraphX.
Uempty
Review questions
1. True or False: Ease of use is one of the benefits of Apache
Spark.
2. Which language is supported by Apache Spark?
A. C++
B. C#
C. Java
D. Node.js
3. True or False: Scala is the primary abstraction of Apache
Spark.
4. In RDD actions, which function returns all the elements of
the data set as an array of the driver program?
A. Collect
B. Take
C. Count
D. Reduce
5. True or False: Referencing a data set is one of the methods
to create RDD.
Introduction to Apache Spark © Copyright IBM Corporation 2021
1. True
2. C
3. False
4. A
5. True
Uempty
Review answers
1. True or False: Ease of use is one of the benefits of using
Apache Spark.
2. Which language is supported by Apache Spark?
A. C++
B. C#
C. Java
D. Node.js
3. True or False: Scala is the primary abstraction of Apache
Spark.
4. In RDD actions, which function returns all the elements of
the data set as an array of the driver program?
A. Collect
B. Take
C. Count
D. Reduce
5. True or False: Referencing a data set is one of the methods
to create RDD.
Introduction to Apache Spark © Copyright IBM Corporation 2021
Uempty
Uempty
Exercise objectives
• In this exercise, you explore some of Spark 2 client program
examples and learn how to run them. You gain experience
with the fundamental aspects of running Spark in the HDP
environment.
• After completing this exercise, you should be able to do the
following tasks:
ƒ Browse files and folders in HDFS.
ƒ Work with Apache Spark RDD with Python.
Uempty
Overview
In this unit, you learn about how to efficiently store and query data.
Uempty
Unit objectives
• List the characteristics of representative data file formats, including flat
text files, CSV, XML, JSON, and YAML.
• List the characteristics of the four types of NoSQL data stores.
• Explain the storage that is used by HBase in some detail.
• Describe Apache Pig.
• Describe Apache Hive.
• List the characteristics of programming languages that are typically
used by data scientists: R and Python.
Uempty
7.1. Introduction to data and file formats
Uempty
Uempty
Topics
• Introduction to data and file formats
• Introduction to HBase
• Programming for the Hadoop framework
• Introduction to Apache Pig
• Introduction to Apache Hive
• Languages that are used by data scientists: R and Python
Uempty
Introduction to data
• "Data are values of qualitative or quantitative variables, belonging to a
set of items" - Wikipedia
• Common data representation formats that are used for big data include:
ƒ Row- or record-based encodings:
í Flat files / text files
í CSV and delimited files
í Avro / SequenceFile
í JSON
í Other formats: XML and YAML
ƒ Column-based storage formats:
í RC / ORC file
í Apache Parquet
ƒ NoSQL data stores
• Compression of data
Storing petabytes of data in Hadoop is relatively easy, but it is more important to choose an efficient
storage format for faster querying.
Row-based encodings (Text, Avro, and JSON) with a general-purpose compression library (gzip,
LZO, CMX, or Snappy) are common mainly for interoperability reasons, but column-based storage
formats (Apache Parquet and ORC) provide faster query execution by minimizing IO and great
compression.
Compression is important to big data file storage:
• Reduces file sizes, which speed up transferring data to and from disk.
• Generally faster to transfer a small file and then decompress it than to transfer a larger file.
Uempty
Often, the data that is gathered (raw data) must be seriously processed and even converted or
transformed either before or while loading into HDFS.
There is no settled terminology for the set of activities between acquiring and modeling data. You
see the phrase “data preparation” that is used to describe these activities. Data preparation seeks
to turn newly acquired raw data into clean data that can be analyzed and modeled in a meaningful
way. This phase of the data science workflow, and subsets of it, are variously labeled munging,
wrangling, reduction, and cleansing. You can use the various terms, although some of them are
often classified as jargon.
Data munging or data wrangling is loosely the process of manually converting or mapping data from
one raw form into another format that allows for more convenient consumption of the data with the
help of semi-automated tools. This process might include further munging, data visualization, data
aggregation, training a statistical model, and many other potential uses.
Uempty
Data munging as a process typically follows a set of general steps that begin with extracting the
data in a raw form from the data source, munging the raw data by using algorithms (like sorting) or
parsing the data into predefined data structures, and depositing the resulting content into a data
sink for storage and future use. With the rapid growth of the internet, such techniques become
increasingly important in the organization of the growing amounts of data available.
In the world of data warehousing, extract, transform, load (ETL) is common, but here the process is
often extract, load, and transform (ELT).
References:
• Exploratory data mining and data cleaning:
https://goo.gl/nIoSvj
• “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights”:
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janit
or-work.html
Uempty
CRM OLAP
Load analysis
Website Reporting
traffic
• Flat files or text files might need to be parsed into fields and attributes:
ƒ Fields might be positional at a fixed offset from the beginning of the record.
ƒ Text analytics might be required to extract meaning.
Uempty
Uempty
Doug Cutting (one of the original developers of Hadoop) answered the question, “What are the
advantages of Avro's object container file format over the SequenceFile container format?”
(Source:
http://www.quora.com/What-are-the-advantages-of-Avros-object-container-file-format-over-the-Seq
uenceFile-container-format)
Uempty
Two primary reasons:
• Language independence. The SequenceFile container and each writable implementation that is
stored in it are implemented only in Java. There is no format specification independent of the
Java implementation. Avro data files have a language-independent specification and are
implemented in C, Java, Ruby, Python, and PHP. A Python-based application can directly read
and write Avro data files.
Avro's language-independence is not yet a huge advantage in MapReduce programming
because MapReduce programs for complex data structures are generally written in Java. But,
after you implement Avro tethers for other languages (such as Hadoop Pipes for Avro (for more
information, see http://s.apache.org/ZOw)), then it is possible to write efficient mappers and
reducers in C, C++, Python, and Ruby that operate on complex data structures.
Language independence can be an advantage if you want to create or access data outside of
MapReduce programs from non-Java applications. Moreover, as the Hortonworks Data
Platform expands, Hortonworks wants to include more non-Java applications and interchange
data with these applications, so establishing a standard, language-independent data format for
this platform is a priority.
• Versioning. If a writable class changes, fields are added or removed, the type of a field is
changed, or the class is renamed, then data is usually unreadable. A writable implementation
can explicitly manage versioning by writing a version number with each instance and handling
older versions at read-time. This situation is rare, but even then, it does not permit forward
compatibility (old code reading a newer version) or branched versions. Avro automatically
handles field addition and removal, compatibility with later and earlier versions, branched
versioning, and renaming, all largely without any awareness by an application.
The versioning advantages are available today for Avro MapReduce applications.
Uempty
JSON is an open standard format that uses human-readable text to transmit data objects consisting
of attribute-value pairs. It is used primarily to transmit data between a server and web application as
an alternative to XML. Although originally derived from the JavaScript scripting language, JSON is
a language-independent data format. Code for parsing and generating JSON data is readily
available in many programming languages.
The primary reference site (http://www.json.org) describes JSON as built on two structures:
• A collection of name-value pairs. In various languages, this structure is realized as an object,
record, structure, dictionary, hash table, keyed list, or associative array.
• An ordered list of values. In most languages, this structure is realized as an array, vector, list, or
sequence.
These structures are universal data structures with great flexibility in practice. Virtually all modern
programming languages support them in one form or another. It makes sense that a data format
that is interchangeable between programming languages is also based on these structures.
The two basic data structures of JSON also are described as dictionaries (maps) and lists (arrays).
JSON treats an object as a dictionary where attribute names are used as keys into the map.
• Dictionaries are defined in a way that might be familiar to anyone who has initialized a Python
dict with some values (or has printed the contents of a dict): There are pairs of keys and values,
Uempty
which are separated by a ":", with each key-value pair delimited by a ",", and each entire
object-record surrounded by "{}".
• Lists are also represented by using Python-like syntax, which is a sequence of values that is
separated by ",", and surrounded by "[ ]".
These two data structures can be arbitrarily nested, for example, a dictionary that contains a list of
dictionaries.
Additionally, individual attributes can be text strings that are surrounded by double quotation marks
(" "), numbers, true/false, or null. There is no native support for a “set” data structure. Typically, a
set is transformed into a list when an object is written to JSON, which is the input into a set when it
is consumed, for example, in Python:
some_set = set [a list, here]
Quotation marks in text fields are written like \". When JSON objects are inserted into a file, by
convention they are typically written one per line.
Two examples of JSON:
• First one:
{"id":1, "name":"josh-shop", "listings":[1, 2, 3]}
{"id":2, "name":"provost", "listings":[4, 5, 6]}
• Second one (Source: http://en.wikipedia.org/wiki/JSON):
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null
}
Uempty
The Python standard library includes a JSON package, which is useful for reading a raw JSON
string into a dictionary. However, transforming that map into an object and writing out an arbitrary
object into JSON might require extra programming.
Uempty
Uempty
250 MB stripe
format was developed to support Column 2
Raw data Column 3
Apache Hive. Column 4
250 MB stripe
Column 7
by providing an optimized and more Raw data Column 8
250 MB stripe
Column 4
Column 6
skipping of blocks of rows. Stripe footer Column 7
on columns.
ƒ Larger default blocksize (256 MB).
Figure 7-11. Record Columnar File and Optimized Row Columnar file formats
ORC goes beyond RC and uses specific encoders for different column data types to improve
compression further, for example, variable length compression on integers. ORC introduces a
lightweight indexing that enables skipping of blocks of rows that do not match a query. It comes with
basic statistics such as min, max, sum, and count, on columns. A larger block size of 256 MB by
default optimizes large sequential reads on HDFS for more throughput and fewer files to reduce the
load on the NameNode.
Uempty
Apache Parquet
• "Apache Parquet is a columnar storage format available to any project
in the Hadoop infrastructure, regardless of the choice of data
processing framework, data model, or programming language.”
(Source: http://parquet.apache.org)
• Compressed and efficient columnar storage that was developed by
Cloudera and Twitter:
ƒ Efficiently encodes nested structures and sparsely populated data based on
the Google BigTable, BigQuery, and Dremel definition and repetition levels.
ƒ Allows compression schemes to be specified on a per-column level.
ƒ Developed to allow more encoding schemes to be added as they are
invented and implemented.
• Provides some of the best results in various benchmark and
performance tests.
For more information about Apache Parquet, which is an efficient, general-purpose columnar file
format for Apache Hadoop, see the official announcement from Cloudera and Twitter:
http://www.dataarchitect.cloud/introducing-parquet-efficient-columnar-storage-for-apache-hadoop/
Apache Parquet brings efficient columnar storage to Hadoop. Compared to, and learning from, the
initial work that was done toward this goal in Trevni, Apache Parquet includes the following
enhancements:
• Efficiently encode nested structures and sparsely populated data based on the Google Dremel
definition and repetition levels.
• Provide extensible support for per-column encodings (such as delta and run length).
• Provide extensibility of storing multiple types of data in column data (such as indexes, bloom
filters, and statistics).
• Offer better write performance by storing metadata at the end of the file.
• A new columnar storage format was introduced for Hadoop that is called Apache Parquet,
which started as a joint project between Twitter and Cloudera engineers.
• Apache Parquet was created to make the advantages of compressed and efficient columnar
data representation available to any project in the Hadoop infrastructure, regardless of the
choice of data processing framework, data model, or programming language.
Uempty
• Apache Parquet is built from the ground up with complex nested data structures in mind, which
is an efficient method of encoding data in non-trivial object schemas.
• Apache Parquet is built to support efficient compression and encoding schemes. Apache
Parquet allows compression schemes to be specified on a per-column level and was developed
to allow adding more encoding schemes as they are invented and implemented. The concepts
of encoding and compression are separated, allowing Apache Parquet users to implement
operators that work directly on encoded data without paying a decompression and decoding
penalty when possible.
• Apache Parquet is built to be used by anyone. The Hadoop infrastructure is rich with data
processing frameworks. An efficient, well-implemented columnar storage substrate should be
useful to all frameworks without the cost of extensive and difficult to set up dependencies.
• The initial code defines the file format, provides Java building blocks for processing columnar
data, and implements Hadoop input/output formats, Apache Pig Storers/Loaders, and as an
example of a complex integration, input/output formats that can convert Apache Parquet-stored
data directly to and from Thrift objects.
References:
• An Inside Look at Google BigQuery:
https://cloud.google.com/files/BigQueryTechnicalWP.pdf
• Dremel: Interactive Analysis of Web-Scale Datasets:
http://research.google.com/pubs/pub36632.html
Uempty
NoSQL
• NoSQL, also known as "Not only SQL" or "Non-relational" was
introduced to handle the rise in data types, data access, and data
availability needs that were brought on by the dot.com boom.
• It is generally agreed that there are four types of NoSQL data stores:
ƒ Key-value stores
ƒ Graph stores
ƒ Column stores
ƒ Document stores
• Why consider NoSQL?
ƒ Flexibility
ƒ Scalability (scales horizontally rather than vertically)
ƒ Availability
ƒ Lower operational costs
ƒ Specialized capabilities
Uempty
NoSQL arose from big data before it was called “big data”. As shown in the slide, people used big
data ideas in different ways to create many of NoSQL databases. For example, Apache Hadoop
borrows from the Google MapReduce white paper and Googles File System white paper, and
HBase borrows from Apache Hadoop and Google BigTable white paper. Other NoSQL databases
are developed independently, such as MongoDB.
The color coding in the slide highlights the NoSQL technologies, which are divided into analytic
solutions, such as the Apache Hadoop framework and Apache Cassandra, and operational
databases, such as CouchDB, MongoDB, and Riak. Analytic solutions are useful for running ad hoc
queries in business intelligence (BI) and data warehousing applications. Operational databases are
useful for handling high numbers of concurrent user transactions.
Reference:
Exploring the NoSQL Family Tree:
https://www.ibmbigdatahub.com/blog/exploring-nosql-family-tree
Uempty
Why NoSQL?
• A cost-effective technology
is needed to handle new
volumes of data.
Petabytes Zettabytes
• Increased data volumes
Sharding
lead to RDBMS sharding.
• Flexible data models are
needed to support Firewall
A
big data applications.
B
ABC
So why consider NoSQL technology? This slide presents three key reasons:
• Massive data sets exhaust the capacity and scale of existing RDBMSs. Buying more licenses
and adding more CPU, RAM, and disk is expensive and not linear in cost. Many companies and
organizations also want to leverage more cost-effective commodity systems and open source
technologies. NoSQL technology that is deployed on commodity high-volume servers can solve
these problems.
• Distributing the RDBMS is operationally challenging and often technically impossible. The
architecture breaks down when sharding is implemented on a large scale. Denormalization of
the data model, joins, referential integrity, and rebalancing are common issues.
• Unstructured data (such as social media data like Twitter and Facebook, and email) and
semi-structured data (such as application logs and security logs) do not fit the traditional model
of schema-on-ingest. Typically, the schema is developed after ingesting and analysis.
Unstructured and semi-structured data generates a variable number of fields and variable data
content, so they are problematic for the data architect when they design the database. There
might be many NULL fields (sparse data), or the number and type of fields might be variable.
Uempty
More considerations:
• In this new age of big data, most or all these challenges are typical, so as a result the NoSQL
market is growing rapidly. Traditional RDBMS technologies and high-end server platforms often
exceed budgets. Organizations want to leverage commodity high-volume servers. Elastic
scale-out is needed to handle new volumes of data (sensors, log files, social media data, and
other data) and increased retention requirements.
• Sharding is not cost-effective in the age of big data. Sharding creates architectural issues (such
as joins and denormalization, referential integrity, and challenges with rebalancing).
• New applications require a flexible schema. Records can be sparse (for example, social media
data is variable). Schema cannot always be designed up front.
• Increased complexity of SQL.
• Sharding introduces complexity.
• Single points of failure.
• Failover servers are more complex.
• Backups are more complex.
• Operational complexity is added.
Uempty
7.2. Introduction to HBase
Uempty
Introduction to HBase
Uempty
Topics
• Introduction to data and file formats
• Introduction to HBase
• Programming for the Hadoop framework
• Introduction to Apache Pig
• Introduction to Apache Hive
• Languages that are used by data scientists: R and Python
Uempty
HBase
• An implementation of Google BigTable. A BigTable is a sparse,
distributed, and persistent multidimensional sorted map.
• An open source Apache top-level project that is embraced and
supported by IBM and all leading Hadoop distributions.
• Powers some of the leading sites on the web, such as Facebook and
Yahoo. For more information, see the following website:
http://hbase.apache.org/poweredbyhbase.html
• It is a NoSQL data store, that is, a column data store.
In 2004, Google began developing a distributed storage system for structured data that is called
BigTable. Google engineers designed a system for storing big data that can scale to petabytes by
leveraging commodity servers. Projects at Google like Google Earth, web indexing, and Google
Finance required a new cost-effective, robust, and scalable system that a traditional RDBMS was
incapable of supporting. In November 2006, Google released a white paper describing BigTable:
Bigtable:A Distributed Storage System for Structured Data
http://research.google.com/archive/bigtable-osdi06.pdf
In 2008, HBase was released as an open source Apache top-level project
(http://hbase.apache.org) that is now the Hadoop database. HBase powers some of the leading
sites on the web, which you can learn about at the following website: Powered By Apache Hbase
http://hbase.apache.org/poweredbyhbase.html.
For more information about Hbase, see Apache HBase Reference Guide
https://hbase.apache.org/book.html.
Uempty
Why HBase?
• Highly scalable:
ƒ Automatic partitioning (sharding)
ƒ Scales linearly and automatically with new nodes
• Low latency:
ƒ Supports random read/write and small range scan
• Highly available
• Strong consistency
• Good for "sparse data" (no fixed columns)
HBase is considered "the Hadoop database“, and it is bundled with supported Apache Hadoop
distributions like Hortonworks Data Platform. If you need high-performance random read/write
access to your big data, you are probably going to use HBase on your Hadoop cluster. HBase
users can leverage the MapReduce model and other powerful features that are included with
Apache Hadoop.
The HDP strategy for Apache Hadoop is to embrace and extend the technology with powerful
advanced analytics, development tools, performance and availability enhancements, and security
and manageability. As a key component of Hadoop, HBase is part of this strategy with strong
support and a solid roadmap going forward.
When the requirements fit, HBase can replace certain costly RDBMSs.
HBase handles sharding seamlessly and automatically and benefits from the non-disruptive
horizontal scaling feature of Hadoop. When more capacity or performance is needed, users add
data nodes to the Hadoop cluster, which provides immediate growth to HBase data stores because
HBase uses the HDFS. Users can easily scale from terabytes to petabytes as their capacity needs
increase.
Uempty
HBase supports a flexible and dynamic data model. The schema does not need to be defined
upfront, which makes HBase a natural fit for many big data applications and some traditional
applications.
HDFS does not naturally support applications requiring random read/write capability. HDFS was
designed for large sequential batch operations (for example, write once with many large sequential
reads during analysis). HBase supports high-performance random read/write applications, which is
why it is often leveraged in Hadoop applications.
Cognitive Class has a free course on HBase at the following website:
Using HBase for Real-time Access to your Big Data
https://cognitiveclass.ai/courses/using-hbase-for-real-time-access-to-your-big-data
Uempty
Figure 7-20. HBase and atomicity, consistency, isolation, and durability (ACID) properties
A frequently asked HBase question is “How does HBase adhere to ACID properties?”
The HBase community has a website (http://hbase.apache.org/acid-semantics.html) about this
matter, which is summarized on this slide.
When strict ACID properties are required:
• HBase provides strict row-level atomicity.
• There is no further guarantee or transactional feature that spans multiple rows or across tables.
For more information, see the Indexed-Transactional HBase project at the following website:
https://github.com/hbase-trx/hbase-transactional-tableindexed
HBase and other NoSQL distributed data stores are subject to the CAP Theorem, which states that
distributed NoSQL data stores can achieve only two out of the three properties: consistency,
availability, and partition tolerance. (Source:
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html).
Regarding concurrency of writes and mutations, HBase automatically gets a lock before a write,
and releases that lock after the write. Also, the user can control the locking manually.
Uempty
Uempty
When researching Hbase, you find documentation that describes HBase as a column-oriented data
store, which it is. However, this description can lead to confusion and wrong impressions when you
try to picture the spreadsheet or traditional RDBMS table model.
HBase is more accurately defined as a "multidimensional sorted map“, as shown in the BigTable
specification by Google.
Reference:
Associative array:
http://en.wikipedia.org/wiki/Associative_array.
Uempty
This slide covers the logical representation of an HBase table. A table is made of column families,
which are the logical and physical grouping of columns.
Each column value is identified by a key. The row key is the implicit primary key. It is used to sort
rows.
The table that is shown in the slide (HBTABLE) has two column families: cf_data and cf_info.
cf_data has two columns with the qualifiers cq_name and cq_val. A column in HBase is referred to
by using family:qualifier. The cf_info column family has only one column: cq_desc.
The green boxes show how column families also provide physical separation. The columns in the
cf_data family are stored separately from columns in the cf_info family. Remember this information
when designing the layout of an HBase table. If you have data that is not often queried, it is better to
assign it to a separate column family.
Uempty
Column family
• A column family is the basic storage unit. Columns in the same family
should have similar properties and similar size characteristics.
• Configurable by column family:
ƒ Multiple time-stamped versions, such as a third dimension added to the
tables
ƒ Compression (none, gzip, LZO, SNAPPY)
ƒ Version retention policies (Time To Live (TTL))
• A column is named by using the following syntax:
family:qualifier
"Column keys are grouped into sets that are called column families, which form
the basic unit of access control" - Google BigTable paper
Uempty
HBase RDBMS
A sparse, distributed, and
Data layout persistent multidimensional Row- or column-oriented
sorted map
Uempty
Using the classic RDBMS table that is shown in the slide as a reference, you see what it looks like
in HBase over the next few slides.
Uempty
Good for sparse data because non-exist columns are ignored and there are no
nulls.
This example table can be implemented logically in Hbase as shown in this slide.
The timestamp data that is pointed to by row key 01235 makes the HBase view multidimensional.
Uempty
Key
Although the physical cell layout in HBase looks something like what is shown in the slide, there are
more details to the physical layout, which are described further in this presentation (such as how
data is stored in Apache Hadoop HDFS).
Uempty
A detailed schema in the RDBMS sense does not need to be defined upfront in HBase. You need to
define only column families because they impact physical on-disk storage.
This slide illustrates why HBase is called a key-value pair data store.
Varying the granularity of the key impacts retrieval performance and cardinality when querying
HBase for a value.
Data types are converted to and from the raw byte array format that HBase supports natively.
Uempty
Indexes in HBase
• All table accesses are done by using a table row key, which effectively it
is primary key.
• Rows are stored in byte-lexographical order.
• Within a row, columns are stored in sorted order, with the result that it is
fast, cheap, and easy to scan adjacent rows and columns. Partial key
lookups also are possible.
• HBase does not support indexes natively. Instead, a table can be
created that serves the same purpose.
• HBase supports "bloom filters“, which are used to decide quickly
whether a particular row and column combination exists in the store file
and reduce IO and access time.
• Secondary indexes are not supported.
Key
This slide explains how data in HBase is sorted and can be searched and indexed.
The sorting within tables makes adjacent queries and scans more efficient.
HBase does not support indexes natively, but tables can be created to serve the same purpose.
Bloom filters can be used to reduce I/Os and lookup time. For more information about Bloom filters,
see http://en.wikipedia.org/wiki/Bloom_filter.
For more information about Bloom filter usage in Hbase, see George, L., HBase: The definitive
guide. Savastopol, CA: O'Reilly Media, 2011. 1449396100.
Uempty
7.3. Programming for the Hadoop framework
Uempty
Uempty
Topics
• Introduction to data and file formats
• Introduction to HBase
• Programming for the Hadoop framework
• Introduction to Apache Pig
• Introduction to Apache Hive
• Languages that are used by data scientists: R and Python
Uempty
Apache Apache
API MapReduce MapReduce Pig Hive
HBase
Giraph MPI Storm
(graph (message (streaming
processing) passing) data)
Processing
MapReduce v2 Tez Hoya
framework
Resource
YARN
management
Distributed
HDFS
storage
Hadoop v1 has only one processing framework, which is MapReduce (batch processing). In the
YARN architecture, the processing layer is separated from the resource management layer so that
the data that is stored in HDFS can be processed and run by various data processing engines,
such as stream processing, interactive processing, graph processing, and MapReduce (batch
processing). Thus, the efficiency of the system is increased.
There are many open source tools in the Hadoop infrastructure that you can use to process data in
Hadoop (the API layer). such as Apache Pig, Apache Hive, Hbase, Giraph, MPI, and Apache
Storm.
Uempty
Not suitable for ad hoc queries, but happy to do Good for ad hoc analysis, but not necessarily for
grunt work users; leverages SQL expertise
Reads data in many file formats, databases Uses a SerDe (serialization/deserialization) interface
to read data from a table and write it back out in
any custom format. Many standard SerDes are
available, and you can write your own for custom
formats
Compiler converts Pig Latin into sequences of
MapReduce programs
Recommended for people are familiar with scripting Recommended for people who are familiar to SQL
languages like Python.
Figure 7-34. Open-source programming languages: Apache Pig and Apache Hive
Uempty
7.4. Introduction to Apache Pig
Uempty
Uempty
Topics
• Introduction to data and file formats
• Introduction to HBase
• Programming for the Hadoop framework
• Introduction to Apache Pig
• Introduction to Apache Hive
• Languages that are used by data scientists: R and Python
Uempty
Apache Pig
• Apache Pig runs in two modes:
ƒ Local mode: On a single machine without requirements for HDFS.
ƒ MapReduce/Hadoop mode: Runs on an HDFS cluster with the Apache Pig script
that is converted to a MapReduce job.
• When Apache Pig runs in an interactive shell, the prompt is grunt>.
• Apache Pig scripts have, by convention, a suffix of .pig.
• Apache Pig is written in the language Pig Latin.
Linux terminal
Embedded
Client grunt> Apache Pig
Pig Latin
scripts
Apache Pig is built on top of a general-purpose processing framework that users can use to
process data by using a higher-level abstraction.
Uempty
In SQL, users can specify that data from two tables must be joined, but not what join
implementation to use. But, with some RDBMS systems, extensions ("query hints") are available
outside the official SQL query language to allow the implementation of queries and the type of joins
to be performed on a single statement basis.
With Pig Latin, users can specify an implementation or aspects of an implementation to be used in
running a script in several ways.
Pig Latin programming is like specifying a query execution plan.
Uempty
At its core, Pig Latin is a data flow language where you define a data stream and a series of
transformations that are applied to the data as it flows through the application.
Pig Latin contrasts with a control flow language (such as C or Java) where you write a series of
instructions. In control flow languages, you use constructs such as loops and conditional logic (if
and case statements) There are no loops and no if-statements in Pig Latin.
Uempty
-- Extract words from each line and put them into an Apache Pig bag,
-- and then flatten the bag to get one word for each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(LINE)) AS word;
Uempty
7.5. Introduction to Apache Hive
Uempty
Uempty
Topics
• Introduction to data and file formats
• Introduction to HBase
• Programming for the Hadoop framework
• Introduction to Apache Pig
• Introduction to Apache Hive
• Languages that are used by data scientists: R and Python
Uempty
The Apache Hive data warehouse software facilitates querying and managing large data sets that
are in distributed storage. Built on top of Apache Hadoop, it provides tools to enable easy data
extract, transform, and load (ETL):
• A mechanism to impose structure on various data formats.
• Access to files that are stored either directly in Apache HDFS or in other data storage systems,
such as Apache Hbase.
• Query execution by using MapReduce.
Apache Hive defines a simple SQL-like query language that is called HiveQL, which enables users
who are familiar with SQL to query the data. Concurrently, this language also enables programmers
who are familiar with the MapReduce framework to plug in their custom mappers and reducers to
perform more sophisticated analysis that might not be supported by the built-in capabilities of the
language. HiveQL can also be extended with custom scalar user-defined functions (UDFs),
user-defined aggregations (UDAFs), and user-defined table functions (UDTFs).
Uempty
Apache Hive does not require that read or written data be in the "Apache Hive format" because
there is no such thing. Apache Hive works equally well on Thrift, control delimited, or specialized
data formats.
Apache Hive is not designed for online transactional processing (OLTP) workloads and does not
offer real-time queries or row-level updates. It is best used for batch jobs over large sets of
append-only data (like web logs). What Apache Hive values most is scalability (scale out with more
machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and
UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats.
Components of Apache Hive include HCatalog and WebHCat:
• HCatalog is a component of Apache Hive. It is a table and storage management layer for
Hadoop that enables users with different data processing tools, including Apache Pig and
MapReduce, to more easily read and write data on the grid.
• WebHCat provides a service that you can use to run Hadoop MapReduce (or YARN), Apache
Pig, or Apache Hive jobs, or perform Apache Hive metadata operations by using an HTTP
(REST style) interface.
Uempty
Uempty
public static class Map extends Mapper<LongWritable, Text, Text, Job = new Job(conf, "wordcount");
IntWritable> {
private final static IntWritable one = new IntWritable(1); job.setOutputKeyClass(Text.class);
private Text word = new Text(); job.setOutputValueClass(IntWritable.class);
Uempty
Source: Capriolo, E., et al., Programming Hive: Data Warehouse and Query Language for Hadoop
1st Edition. Sabastopol, CA: O'Reilly Media, 2012. 1449319335
Uempty
Apache
Applications Hive
Application
Apache Hive>
Apache Hive Web Apache Hive
Query execution Interface Server 1
CLI
Apache Hive
Metadata Metastore JobConf Config
Metastore Driver
The slides show the major components that you might deal with when working with Apache Hive.
Uempty
$ $HIVE_HOME/bin/hive
2013-01-14 23:36:52.153 GMT : Connection obtained for host: master-
Logging initialized using configuration in file:/etc/hive/conf/hive-
Uempty
Creating a table
file: users.dat
• Creating a delimited table: 1|1|Bob Smith|Mary
hive> create table users
( 2|1|Frank Barney|James:Liz:Karen
id int, 3|2|Ellen Lacy|Randy:Martin
office_id int,
name string,
4|3|Jake Gray|
children array<string> 5|4|Sally Fields|John:Fred:Sue:Hank:Robert
)
row format delimited
fields terminated by '|'
collection items terminated by ':'
stored as textfile;
• Inspecting tables:
hive> show tables;
OK
users
Time taken: 2.542 seconds
Uempty
$ hive \
--auxpath \
$HIVE_SRC/build/dist/lib/hive-hbase-handler-0.9.0.jar,\
$HIVE_SRC/build/dist/lib/hbase-0.92.0.jar,\
$HIVE_SRC/build/dist/lib/zookeeper-3.3.4.jar,\
$HIVE_SRC/build/dist/lib/guava-r09.jar \
-hiveconf hbase.master=hbase.yoyodyne.com:60000
Uempty
hbase_table_1
key value1 value2 value3
Apache 15 "fred" 357 94837
Hive
MY_TABLE
family: a family: d
(key) b c e
HBase "15" "fred" "357" 0x17275
Uempty
Reference:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Overview
Uempty
Client Interfaces
Thrift JDBC ODBC (remote)
Client Client Client
Beeline>
Thrift Service
CLI
Driver
Metastore
HS2
MapReduce YARN
HDFS
Uempty
Beeline CLI
• It is supported by HS2.
• It is a JDBC client that is based on the SQLLine CLI.
• It works in both embedded mode and remote mode:
• Embedded mode:
Runs embedded Apache Hive (like Apache Hive CLI).
• Remote mode:
Connects to a separate HS2 process over thrift (JDBC client).
Recommended for production use because it is more secure and does
not require direct HDFS or metastore access to be granted to users.
Reference:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Beeline
%E2%80%93CommandLineShell
Uempty
% bin/beeline
Hive version 0.11.0-SNAPSHOT by Apache
beeline> !connect jdbc:hive2://localhost:10000 scott tiger
!connect jdbc:hive2://localhost:10000 scott tiger
Connecting to jdbc:hive2://localhost:10000
Connected to: Hive (version 0.10.0)
Driver: Hive (version 0.10.0-SNAPSHOT)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Uempty
Uempty
Uempty
CREATE EXTERNAL TABLE pview (userid int, pageid int, ds string, ctry string)
PARTITIONED ON (ds string, ctry string)
STORED AS textfile
LOCATION '/path/to/existing/table'
Uempty
Physical layout
• Apache Hive warehouse directory structure.
.../hive/warehouse/
db1.db/ Databases (schemas) are
tab1/ contained in ".db" subdirectories
tab1.dat
part_tab1/ Table "tab1"
state=NJ/
part_tab1.dat Table partitioned by "state" column.
state=CA/ One subdirectory per unique value.
part_tab1.dat Query predicates eliminate partitions
(directories) that need to be read
• Data files are regular HDFS files. The internal format can vary from
table to table (delimited, sequence, and other formats).
• Supports external tables.
Uempty
7.6. Languages that are used by data
scientists: R and Python
Uempty
Figure 7-60. Languages that are used by data scientists: R and Python
Uempty
Topics
• Introduction to data and file formats
• Introduction to HBase
• Programming for the Hadoop framework
• Introduction to Apache Pig
• Introduction to Apache Hive
• Languages that are used by data scientists: R and Python
Uempty
R Python
R is an interactive environment for doing statistics; it Real programming language.
has a programming language, rather than being a
programming language.
Rich set of libraries, graphic and otherwise, that are Lacks some of R's richness for data analytics, but
suitable for data science. said to be closing the gap.
Better if the need is to perform data analysis. Better for more generic programming tasks (for
example, workflow control of a computer model).
Focuses on better data analysis, statistics, and data Python emphasizes productivity and code
models. readability.
More adoption from researchers, data scientists, More adoption from developers and programmers.
statisticians, and mathematicians.
Active user communities. Active user communities.
Standard R library with many more libraries where Python, numpy, scipy, scikit, Django, and Pandas.
statistical algorithms often appear first.
Figure 7-62. Languages that are used by data scientists: R and Python
Both languages are excellent, but they have their individual strengths. Over time, you probably
need both.
One approach, if you know Python already, is to use it as your first tool. When you find Python
lacking, learn enough R to do what you want, and then either:
• Write scripts in R and run them from Python by using the subprocess module.
• Install the RPy module.
Use R for plotting things and use Python for the heavy lifting.
Sometimes, the tool that you know or that is easy to learn is far more likely to win than the
powerful-but-complex tool that is out of your reach.
In the 2015 KDNuggets poll of top languages for data analytics, data mining, and data science, R
was the most-used software, and Python was in second place. (Source:
https://www.kdnuggets.com/2015/07/poll-primary-analytics-language-r-python.html)
Uempty
Quick overview of R
• R is a free interpreted language.
• Best suited for statistical analysis and modeling:
ƒ Data exploration and manipulation
ƒ Descriptive statistics
ƒ Predictive analytics and machine learning
ƒ Visualization
• Can produce "publication quality graphics“.
• Emerging as a competitor to proprietary platforms:
ƒ Widely used in universities and companies
ƒ Not as easy to use or performant as SAS or SPSS.
• Algorithms tend to first be available in R by companies and universities
as packages, such as rpart(classification) and tree(random forest
trees).
Uempty
R clients
RStudio is a simple and popular integrated developer environment
(IDE).
Run.
Write
code File
here.
See
results
here.
Or Console
write
code
here.
There are different R clients that are available. R Studio, which you can download at no charge
from https://rstudio.com/ (available for Windows, Linux, and Mac) is the most common one.
Uempty
Simple example
• R supports atomic data types, lists, vectors, matrices, and data frames.
• Data frames are analogous to database tables, and they can be created
or read from CSV files.
• Large set of statistical functions, and more functions can be loaded as
packages.
# Vectors
> kidsNames <- c("Henry", "Peter", "Allison")
> kidsAges <- c(7, 11, 17)
# data.frame
> kids <- data.frame(ages = kidsAges, names = kidsNames)
> print(kids)
ages names
1 7 Henry
2 11 Peter
3 17 Allison
> mean(kids$ages)
[1] 11.66667
This slide shows a simple interactive R program that works on basic data. The program creates two
vectors, and then it merges them together into one data frame as two columns. Then, it shows how
to display the table and compute a basic average.
This type of interactive use of R can be done by using RStudio, which runs on Linux, Windows, and
Mac.
Uempty
Uempty
import sys
file=open(sys.argv[0],"r+", encoding="utf-8-sig")
wordcount={}
for word in file.read().split(" ,;'"):
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
for key in wordcount.keys():
print ("%s %s " %(key , wordcount[key]))
file.close();
Uempty
Unit summary
• Listed the characteristics of representative data file formats, including
flat text files, CSV, XML, JSON, and YAML.
• Listed the characteristics of the four types of NoSQL data stores.
• Explained the storage that is used by HBase in some detail.
• Described Apache Pig.
• Described Apache Hive.
• Listed the characteristics of programming languages that are typically
used by data scientists: R and Python.
Uempty
Review questions
1. What is the data representation format of an RC or ORC
file?
A. Row-based encoding
B. Record-based encoding
C. Column-based storage
D. NoSQL data store
2. True or False: A NoSQL database is designed for those
developers that do not want to use SQL.
3. HBase is an example of which of the following NoSQL data
store type?
A. Key-value store
B. Graph store
C. Column store
D. Document store
Uempty
Uempty
Review answers
1. What is the data representation format of an RC or ORC
file?
A. Row-based encoding
B. Record-based encoding
C. Column-based storage
D. NoSQL data store
2. True or False: A NoSQL database is designed for those
developers that do not want to use SQL.
3. HBase is an example of which of the following NoSQL data
store type?
A. Key-value store
B. Graph store
C. Column store
D. Document store
Uempty
Review answers
4. Which database provides an SQL for Hadoop interface?
A. Hbase
B. Apache Hive
C. Cloudant
D. MongoDB
5. True or False: R is a real programming language, and
Python is an interactive environment for doing statistics.
Uempty
Figure 7-73. Exercise: Using Apache Hbase and Apache Hive to access Hadoop data
Uempty
Exercise objectives
• This exercise introduces you to Apache HBase and Apache
Hive. You learn the difference between both by gaining
experience with the HBase CLI shell and the Hive CLI to store,
process, and access Hadoop data. You also learn how to get
information about HBase and Hive configuration by using
Ambari Web UI.
• After completing this exercise, you will be able to:
ƒ Obtain information about HBase and Hive services by using the
Ambari Web UI.
ƒ Use the HBase shell to create HBase tables, explore the HBase
data model, store and access data in HBase.
ƒ Use the Hive CLI to create Hive tables, import data into Hive, and
query data on Hive.
ƒ Use the Beeline CLI to query data on Hive.
Uempty
Overview
In this unit, you learn about the need for data governance and the role of data security in data
governance.
Uempty
Unit objectives
• Explain the need for data governance and the role of data security in
this governance.
• List the five pillars of security and how they are implemented with
Hortonworks Data Platform (HDP).
• Describe the history of security with Hadoop.
• Identify the need for and the methods that are used to secure personal
and sensitive information.
• Explain the function of the Hortonworks DataPlane Service (DPS).
Uempty
8.1. Hadoop security and governance
Uempty
Uempty
Topics
• Hadoop security and governance
• Hortonworks DataPlane Service
Uempty
Uempty
References:
• http://www.ibmbigdatahub.com/infographic/9-ways-build-confidence-big-data
• https://www.ibm.com/analytics/unified-governance-integration
• Video (Unified Governance for the Cognitive Computing Era) 2:40:
https://www.youtube.com/watch?v=G1OcWYWVIGw
Uempty
In this unit, we follow the open source approach that is available with HDP product and related
products.
Hortonworks offers a 3-day training program:
• HDP Operations: Apache Hadoop Security Training:
https://www.cloudera.com/about/training/courses/hdp-administrator-security.html
This course is designed for experienced administrators who will be implementing secure
Hadoop clusters using authentication, authorization, auditing, and data protection strategies
and tools.
• The Cloudera CDH distribution has training for security matters too. It is described in Cloudera
Training: Secure Your Cloudera Cluster:
https://www.slideshare.net/cloudera/cloudera-training-secure-your-cloudera-cluster
Uempty
Sometimes, the security of computer systems is described in terms of three “As”:
• Authentication
• Authorization
• Accountability
Security must be applied to network connectivity, running processes, and the data itself.
When dealing with data, you are concerned primarily with:
• Integrity.
• Confidentiality and privacy.
• Rules and regulations concerning who has valid access to what based on both the role that is
performed and the individual exercising that role.
Uempty
Hadoop was designed for storing and processing large amounts of data efficiently and cheaply
(monetarily) compared to other platforms. The focus early in the project was around the actual
technology to make this process happen. Much of the code covered the logic about how to deal
with the complexities that are inherent in distributed systems, such as handling of failures and
coordination.
Because of this focus, the early Hadoop project established a security stance that the entire cluster
of machines and all the users accessing it were part of a trusted network. What that means was that
Hadoop did not initially have strong security measures in place to enforce security.
As the Hadoop infrastructure evolved, it became apparent that at a minimum there should be a
mechanism for users to strongly authenticate to prove their identities. The primary mechanism that
was chosen for Hadoop was Kerberos, a well-established protocol that today is common in
enterprise systems such as Microsoft AD. After strong authentication came strong authorization.
Strong authorization defined what an individual user could do after they had been authenticated.
Uempty
Initially, authorization was implemented on a per-component basis, meaning that administrators
needed to define authorization controls in multiple places. This action led to the need for centralized
administration of security, which now is handled by Apache Ranger.
Another evolving need is the protection of data through encryption and other confidentiality
mechanisms. In the trusted network, it was assumed that data was inherently protected from
unauthorized users because only authorized users were on the network. Since then, Hadoop
added encryption for data that is transmitted between nodes and data that is stored on disk.
Now, we have the Five Pillars of Security:
• Administration
• Authentication
• Authorization
• Audit
• Data Protection
Hadoop 3.0.0 (GA as of early 2018, https://hadoop.apache.org/docs/r3.0.0/), like Hadoop 2, is
intimately concerned with security:
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html
Uempty
Uempty
Security is essential for organizations that store and process sensitive data in the Hadoop
infrastructure. Many organizations must adhere to strict corporate security policies.
Here are the challenges with Hadoop security in general:
• Hadoop is a distributed framework that is used for data storage and large-scale processing on
clusters by using commodity servers. Adding security to Hadoop is challenging because not all
the interactions follow the classic client/server pattern.
• In Hadoop, the file system is partitioned and distributed, requiring authorization checks at
multiple points.
• A submitted job is run later on nodes different than the node on which the client authenticated
and submitted the job.
• Secondary services such as a workflow system (Apache Oozie) access Hadoop on behalf of
users.
• A Hadoop cluster can scale to thousands of servers and tens of thousands of concurrent tasks.
• Hadoop, YARN, and others are evolving technologies, and each component is subject to
versioning and cross-component integration.
Uempty
References:
• HDP Documentation:
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/index.html
• HDP Security (PDF):
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_security/bk_security.pdf
Uempty
WebHCat
WebSSO OAuth
LDAP/AD
SPNEGO
REST Authentication
HTTP APIs
YARN
Kerberos was originally developed for MIT’s Project Athena in the 1980s and is the most widely
deployed system for authentication. It is included with all major computer operating systems. MIT
developers maintain implementations for the following operating systems:
• Linux and UNIX
• Mac OS X
• Windows
Apache Knox 1.0.0 (released 7 February 2018) delivers three groups of user-facing services:
• Proxy services
• Authentication services
• Client Domain Specific Language (DSL) and software development kit (SDK) services
You can find the Apache Knox user guide at:
https://knox.apache.org/books/knox-1-4-0/user-guide.html
Java 1.8 (Java Version 8) is required for the Apache Knox Gateway run time.
Apache Knox 1.0.0 supports Hadoop 3.x, but can be used with Hadoop 2.x.
Uempty
Apache Ranger delivers a comprehensive approach to security for a Hadoop cluster. It provides a
centralized platform to define, administer, and manage security policies consistently across Hadoop
components.
Uempty
Apache Ranger provides a centralized security framework to manage fine-grained access control
that uses an Apache Ambari interface to provide an administrative console that can:
• Deliver a “single pane of glass” for the security administrator.
• Centralize administration of a security policy.
• Policies for accessing a resource (file, directories, database, and table column) for users and
groups.
• Enforce authorization policies within Hadoop
• Enable audit tracking and policy analytics.
• Ensure consistent coverage across the entire Hadoop stack.
Apache Ranger has plug-ins for:
• HDFS
• Apache Hive
• Apache Knox
• Apache Storm
• HBase
The Apache Ranger Key Management Service (Ranger KMS) provides a scalable cryptographic
key management service for HDFS “data at rest” encryption. Ranger KMS is based on the Hadoop
KMS that was originally developed by the Apache community and extends the native Hadoop KMS
functions by allowing system administrators to store keys in a secure database.
Reference:
Apache Ranger:
https://www.cloudera.com/products/open-source/apache-hadoop/apache-ranger.html
Uempty
Implications of security
• Data is now an essential new driver of competitive advantage.
• Hadoop plays critical role in modern data architecture by providing:
ƒ Low costs
ƒ Scale-out data storage
ƒ Added value processing
• Any internal or external breach of this enterprise-wide data can be
catastrophic:
ƒ Privacy violations
ƒ Regulatory infractions
ƒ Damage to corporate image
ƒ Damage to long-term shareholder value and consumer confidence
In every industry, data is now an essential new driver of competitive advantage. Hadoop plays a
critical role in the modern data architecture by providing low costs, scale-out data storage, and
added value processing.
The Hadoop cluster with its HDFS file system and the broader role of a data lake are used to hold
the “crown jewels” of the business organization, that is, vital operational data that is used to drive
the business and make it unique among its peers. Some of this data is also highly sensitive.
Any internal or external breach of this enterprise-wide data can be catastrophic, such as privacy
violations, regulatory infractions, damage to corporate image, and long-term shareholder value. To
prevent damage to the company’s business, customers, finances, and reputation, management
and IT leaders must ensure that this data, such as HDFS, a data lake, or hybrid storage, including
cloud storage, meets the same high standards of security as any data environment.
Uempty
Uempty
8.2. Hortonworks DataPlane Service
Uempty
Hortonworks DataPlane
Service
Uempty
Topics
• Hadoop security and governance
• Hortonworks DataPlane Service
Uempty
The DPS platform is an architectural foundation that helps register multiple data lakes and
manages data services across these data lakes from a “single unified pane of glass”.
The first release of DPS with HDP 2.6.3 includes Data Lifecycle Manager (DLM) as general
availability (GA), and Data Steward Studio (DSS) in technology preview (TP) mode.
These services are the first of a series of next-generation services.
Uempty
Data Data
+Other
Lifecycle Steward Additional extensible services
(partner)
Manager Studio
Extensible services
DPS
Core capabilities
Hybrid Multi
IoT On-premises
References:
• DPS website:
https://hortonworks.com/products/data-management/dataplane-service/
• Blogs:
▪ https://blog.cloudera.com/data-360/a-view-of-modern-data-architecture-and-management/
▪ https://blog.cloudera.com/step-step-guide-hdfs-replication/
• Press release:
https://www.cloudera.com/downloads/data-plane.html
• Product documents:
https://docs.cloudera.com/HDPDocuments/DP/DP-1.0.0/
Uempty
Figure 8-18. Managing, securing, and governing data across all assets
Uempty
Further reading
• Spivey, B. and Echeverria, J., Hadoop Security: Protecting Your Big
Data Platform. Sebastopol, CA: O’Reilly, 2015. 1491900989.
Uempty
Unit summary
• Explained the need for data governance and the role of data security in
this governance.
• Listed the five pillars of security and how they are implemented with
Hortonworks Data Platform (HDP).
• Described the history of security with Hadoop.
• Identified the need for and the methods that are used to secure
personal and sensitive information.
• Explained the function of the Hortonworks DataPlane Service (DPS).
Uempty
Review questions
1. Kerberos is used by Hadoop for:
A. Authentication
B. Authorization
C. Auditing
D. Data protection
2. ______ is used by Hadoop for API and perimeter security.
A. Apache Ambari
B. Apache Knox
C. Apache Ranger
D. Data Steward Studio
3. True or False: Kerberos provides automation and management
of Apache Ambari in the Hadoop cluster.
Uempty
Uempty
Review answers
1. Kerberos is used by Hadoop for:
A. Authentication
B. Authorization
C. Auditing
D. Data protection
2. ______ is used by Hadoop for API and perimeter security.
A. Apache Ambari
B. Apache Knox
C. Apache Ranger
D. Data Steward Studio
3. True or False: Kerberos provides automation and
management of Apache Ambari in the Hadoop cluster.
Uempty
Uempty
Overview
In this unit, you learn about big data stream computing and how it is used to analyze and process
vast amount of data in real time to gain an immediate insight and process the data at a high speed.
Uempty
Unit objectives
• Define streaming data.
• Describe IBM as a pioneer in streaming analytics with IBM Streams.
• Explain streaming data concepts and terminology.
• Compare and contrast batch data versus streaming data.
• List and explain streaming components and streaming data engines
(SDEs).
Uempty
9.1. Streaming data and streaming analytics
Uempty
Uempty
Topics
• Streaming data and streaming analytics
• Streaming components and streaming data engines
• IBM Streams
Uempty
Hierarchical databases were invented in the 1960s and still serve as the foundation of online
transaction processing (OLTP) systems for all forms of business and government that drive trillions
of transactions today.
Consider a bank as an example. It is likely that even today in many banks that information is
entered in to an OLTP system, possibly by employees or by a web application that captures and
stores that data in hierarchical databases. This information then appears in daily reports and
graphical dashboards to demonstrate the state of the business and enable and support appropriate
actions. Analytical processing here is limited to capturing and understanding what happened.
Relational databases brought with them the concept of data warehousing, which extended the use
of databases from OLTP to online analytic processing (OLAP). By using our example of the bank,
the transactions that are captured by the OLTP system are stored over time and made available to
the various business analysts in the organization. With OLAP, the analysts can now use the stored
data to determine trends in loan defaults, overdrawn accounts, income growth, and so on. By
combining and enriching the data with the results of their analyses, they might do even more
complex analysis to forecast future economic trends or make recommendations about new
investment areas. Additionally, they can mine the data and look for patterns to help them be more
proactive in predicting potential future problems in areas such as foreclosures. Then, the business
can analyze the recommendations to decide whether they must act. The core value of OLAP is
focused on understanding why things happened to make more informed recommendations.
Uempty
A key component of OLTP and OLAP is that the data is stored. Now, some new applications require
faster analytics than is possible when you must wait until the data is retrieved from storage. To meet
the needs of these new dynamic applications, you must take advantage of the increase in the
availability of data before storage, otherwise known as streaming data. This need is driving the next
evolution in analytic processing called real-time analytic processing (RTAP). RTAP focuses on
taking the proven analytics that are established in OLAP to the next level. Data in motion and
unstructured data might be able to provide actual data where OLAP had to settle for assumptions
and hunches. The speed of RTAP allows for the potential of action in place of making
recommendations.
So, what type of analysis makes sense to do in real time? Key types of RTAP include, but are not
limited to, the following analyses:
• Alerting
• Feedback
• Detecting failures
Reference:
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf
Uempty
Streaming data is the data that is continuously flowing across interconnected communication
channels. To automate and incorporate streaming data into your decision-making process, you
must use a new paradigm in programming called stream computing. Stream computing is the
response to the shift in paradigm to harness the awesome potential of data in motion. In traditional
computing, you access relatively static information to answer your evolving and dynamic analytic
questions. With stream computing, you can deploy a static application that continuously applies
that analysis to an ever-changing stream of data.
Uempty
In October 2010, IBM announced a research collaboration project with the Columbia University
Medical Center that might potentially help doctors spot life-threatening complications in brain injury
patients up to 48 hours earlier than with current methods. In a condition called delayed ischemia, a
common complication in patients recovering from strokes and brain injuries, the blood flow to the
brain is restricted, often causing permanent damage or death. With current methods of diagnosis,
the problem often has already begun by the time medical professionals see the data and spot
symptoms.
References:
• The Invention of Stream Computing:
https://www.ibm.com/ibm/history/ibm100/us/en/icons/streamcomputing
• Research articles (2004-2015):
https://researcher.watson.ibm.com/researcher/view_group_pubs.php?grp=2531
Uempty
IBM System S
ƒ System S provides a programming model
and an execution platform for user-
developed applications that ingest, filter,
analyze, and correlate potentially massive
volumes of continuous data streams.
ƒ A source adapter is an operator that
connects to a specific type of input (for
example, a stock exchange, weather data, or
a file system).
ƒ A sink adapter is an operator that connects
to a specific type of output that is external to
the streams processing system (for example,
an RDBMS).
ƒ An operator is a software processing unit.
ƒ A stream is a flow of tuples from one
operator to the next operator (they do not
traverse operators).
Uempty
https://www.ibm.com/ibm/history/ibm100/us/en/icons/streamcomputing
• Research articles (2004-2015):
https://researcher.watson.ibm.com/researcher/view_group_pubs.php?grp=2531
Uempty
Internet, IoT,
and alerts.
NOAA weather
Service
NYMEX
commodity Split operator
exchange
The graphic on which the slide is based comes from Stream Computing Platforms, Applications,
and Analytics (Overview), which can be found at the following website:
https://researcher.watson.ibm.com/researcher/view_group_subpage.php?id=2534&t=1
Uempty
Wikipedia https://en.wikipedia.org/wiki/Directed_acyclic_graph
“In mathematics and computer science, a directed acyclic graph (DAG), is a finite directed graph
with no directed cycles. That is, it consists of finitely many vertices and edges, with each edge
directed from one vertex to another, such that there is no way to start at any vertex v and follow a
consistently directed sequence of edges that eventually loops back to v again. Equivalently, a DAG
is a directed graph that has a topological ordering, a sequence of the vertices such that every edge
is directed from earlier to later in the sequence.”
Uempty
The terminology that is used here is that of IBM Streams, but similar terminology applies to other
Streams Processing Engines (SPEs). Here are examples of other terminology:
Apache Storm: http://storm.apache.org/releases/current/Concepts.html
• A spout is a source of streams in a topology. Generally, spouts read tuples from an external
source and emit them into the topology.
• All processing in topologies is done in bolts. Bolts can do filtering, functions, aggregations,
joins, talking to databases, and more.
Reference:
Glossary of terms for IBM Streams (formerly known as IBM InfoSphere Streams):
https://www.ibm.com/support/knowledgecenter/en/SSCRJU_4.3.0/com.ibm.streams.glossary.doc/d
oc/glossary_streams.html
Uempty
You are familiar with these operations in classic batch SQL. In the case of batch processing, all that
data is present when the SQL statement is processed.
But, with streaming data, the data is constantly flowing, and other techniques must be used. With
streaming data, operations are performed on windows of data.
Uempty
Key 1
Key 2
Key 3
4 5 2
Sometimes, a stream processing job must do something in regular time intervals regardless of how
many incoming messages the job is processing.
For example, say that you want to report the number of page views per minute. To do this task, you
increment a counter every time you see a page view event. Once per minute, you send the current
counter value to an output stream and reset the counter to zero. This window is a fixed time
window, and it is useful for reporting, for example, sales that occurred during a clock-hour.
Other methods of windowing of streaming data use a sliding window or a session window.
Uempty
9.2. Streaming components and streaming data
engines
Uempty
Uempty
Topics
• Streaming data and streaming analytics
• Streaming components and streaming data engines
• IBM Streams
Uempty
Proprietary:
• IBM Streams (full SDE)
• Amazon Kinesis
• Microsoft Azure Stream Analytics
To work with streaming analytics, it is important to understand what are the various available
components are and how they relate. The topic itself is complex and deserving of a full workshop,
so here we can provide only an introduction.
A full data pipeline (that is, streaming application) involves the following items:
• Accessing data at the source (“source operator” components). Apache Kafka can be used here.
• Processing data (serializing data, merging and joining individual streams, referencing static
data from in-memory stores and databases, transforming data, and performing aggregation and
analytics). Apache Storm is a component that is sometimes used here.
• Delivering data to long-term persistence and dynamic visualization (“sink operators”).
IBM Streams can handle all these operations by using standard and custom-build operators. It is a
full streaming data engine (SDE), but open source components can be used to build equivalent
systems.
We are looking at only Hortonworks Data Flow (HDF) / NiFi and IBM Streams (data pipeline) in
detail.
Open-source references:
Uempty
• http://storm.apache.org
• https://flink.apache.org
• http://kafka.apache.org
• Apache Samza is a distributed stream processing framework. It uses Apache Kafka for
messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security,
and resource management.
http://samza.apache.org
• Apache Beam is an advanced unified programming model that implement batch and streaming
data processing jobs that run on any execution engine.
https://beam.apache.org
• Apache Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
https://spark.apache.org/streaming
• Introduction to Apache Spark Streaming (Cloudera Tutorial):
https://hortonworks.com/tutorials/?tab=product-hdf
Proprietary SDE references:
• IBM Streams:
https://www.ibm.com/cloud/streaming-analytics
• Try IBM Streams (basic) for free:
https://console.bluemix.net/catalog/services/streaming-analytics
• Amazon Kinesis:
https://aws.amazon.com/kinesis
• Microsoft Azure Stream Analytics:
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-introduction
▪ Stock-trading analysis and alerts
▪ Fraud detection, data, and identity protections
▪ Embedded sensor and actuator analysis
▪ Web clickstream analytics
Uempty
Cloudera DataFlow
• Cloudera provides an end-to-end platform that collects, curates,
analyzes, and acts on data in real time, on premises, or in the cloud.
• With Version 3.x, HDF has an available a drag-and-drop visual
interface.
• HDF is an integrated solution that uses Apache NiFi/MiNiFi, Apache
Kafka, Apache Storm, Apache Superset, and Apache Druid components
where appropriate.
• The HDF streaming real-time data analytics platform includes data flow
management systems, stream processing, and enterprise services.
• The newest additions to HDF include a Schema Repository.
Cloudera DataFlow is an enterprise-ready open source streaming data platform with flow
management, stream processing, and management services components. It collects, curates,
analyzes, and acts on data in the data center and cloud. Cloudera DataFlow is powered by key
open source projects, including Apache NiFi and MiNiFi, Apache Kafka, Apache Storm, and
Apache Druid.
Cloudera DataFlow Enterprise Stream Processing includes support services for Apache Kafka and
Apache Storm, and Streaming Analytics Manager. Apache Kafka and Apache Storm enable
immediate and continuous insights by using aggregations over windows, pattern matching, and
predictive and prescriptive analytics.
With the newly introduced integrated Streaming Analytics Manager, users can get the following
benefits:
• Build easily by using a drag-and-drop visual paradigm to create an analytics application.
• Operate efficiently by easily testing, troubleshooting, and monitoring the deployed
application.
• Analyze quickly by using an analytics engine that is powered by Apache Druid and a rich
visual dashboard that is powered by Apache Superset.
Uempty
Reference:
https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=an&subtype=ca&appname=gpatea
m&supplier=897&letternum=ENUS218-351
Uempty
NiFi background:
• Originated at the National Security Agency (NSA), and it has more than eight years of
development as a closed-source product.
• An Apache incubator project in November 2014 as part of an NSA technology transfer program.
• Apache top-level project in July 2015.
• Java based, running on a Java virtual machine (JVM).
Wikipedia “Apache NiFi” https://en.wikipedia.org/wiki/Apache_NiFi:
• Apache NiFi (short for NiagaraFiles) is a software project from the Apache Software Foundation
that is designed to automate the flow of data between software systems. It is based on the
"NiagaraFiles" software that was previously developed by the NSA, and it was offered as open
source as a part of the NSA’s technology transfer program in 2014.
• The software design is based on the flow-based programming model and offers features that
include the ability to operate within clusters, security by using TLS encryption, extensibility
(users can write their own software to extend its abilities), and improved usability features like a
portal that can be used to view and modify behavior visually.
NiFi is written in Java and runs within a JVM running on the server that hosts it. The main
components of NiFI are:
Uempty
• Web server: The HTTP-based component that is used to visually control the software and
monitor the data flows.
• Flow controller: Serves as the “brains” of NiFi and controls the running of NiFI extensions and
scheduling the allocation of resources.
• Extensions: Various plug-ins that allow NiFI to interact with different kinds of systems.
• FlowFile repository: Used by NiFi to maintain and track the status of the active FlowFile,
including the information that NiFi is helping move between systems.
• Cluster manager: An instance of NiFi that provides the sole management point for the cluster;
the same data flow runs on all the nodes of the cluster.
• Content repository: Where data in transit is maintained.
• Provenance repository: Metadata detailing the provenance of the data flowing through the
software development and commercial support is offered by Hortonworks. The software is fully
open source. This software is also sold and supported by IBM.
MiNiFi is a subproject of NiFi that is designed to solve the difficulties of managing and transmitting
data feeds to and from the source of origin, often the first and last mile of a digital signal, enabling
edge intelligence to adjust flow behavior and bidirectional communication.
Reference:
http://discover.attunity.com/apache-nifi-for-dummies-en-report-go-c-lp8558.html
Uempty
9.3. IBM Streams
Uempty
IBM Streams
Uempty
Topics
• Streaming data and streaming analytics
• Streaming components and streaming data engines
• IBM Streams
Uempty
IBM Streams
• IBM Streams is an advanced computing platform that allows user-
developed applications to quickly ingest, analyze, and correlate
information as it arrives from thousands of real-time sources.
• The solution can handle high data throughput rates up to millions of
events or messages per second.
• IBM Streams provides:
ƒ Development support: A rich, Eclipse-based, and visual integrated
development environment (IDE) that solution architects use to build visually
applications or use familiar programming languages like Java, Scala, or
Python.
ƒ Rich data connections: Connect with virtually any data source whether it is
structured, unstructured, or streaming, and integrate with Hadoop, Apache
Spark, and other data infrastructures.
ƒ Analysis and visualization: Integrate with business solutions. You use built-in
domain analytics like machine learning, natural language, spatial-temporal,
text, acoustic, and more to create adaptive streams applications.
The IBM Streams product is based on nearly two decades of effort by the IBM Research team to
extend computing technology to handle advanced analysis of high volumes of data quickly. How
important is their research? Consider how it would help crime investigation to analyze the output of
any video cameras in the area that surrounds the scene of a crime to identify specific faces of any
persons of interest in the crowd and relay that information to the unit that is responding. Similarly,
what a competitive edge it might provide by analyzing 6 million stock market messages per second
and execute trades with an average trade latency of only 25 microseconds (far faster than a
hummingbird flaps its wings). Think about how much time, money, and resources might be saved
by analyzing test results from chip-manufacturing wafer testers in real time to determine whether
there are defective chips before they leave the line.
System S
While at IBM, Dr. Ted Codd invented the relational database. In this defining IBM Research project,
it was referred to as System R, which stood for Relational. The relational database is the foundation
for data warehousing that started the highly successful client/server and on-demand informational
eras. One of the cornerstones of that success was the capability of OLAP products that are still
used in critical business processes today.
Uempty
When the IBM Research division set its sights on developing something to address the next
evolution of analysis (RTAP) for the Smarter Planet evolution, they set their sights on developing a
platform with the same level of world-changing success, and decided to call their effort System S,
which stood for Streams. Like System R, System S was founded on the promise of a revolutionary
change to the analytic paradigm. The research of the Exploratory Stream Processing Systems
team at T.J. Watson Research Center, which was focused on advanced topics in highly scalable
stream-processing applications for the System S project, is the heart and soul of Streams.
Critical intelligence, informed actions, and operational efficiencies that are all available in real time
is the promise of Streams.
References:
• https://www.ibm.com/cloud/streaming-analytics
• Streaming Analytics: Resources:
https://www.ibm.com/cloud/streaming-analytics/resources
• Addressing Data Volume, Velocity, and Variety with IBM InfoSphere Streams V3.0,
SG24-8108:
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf
• Toolkits, Sample, and Tutorials for IBM Streams:
https://github.com/IBMStreams
Uempty
The following table shows the comparison between IBM Streams and NiFI.
Uempty
Being able to create stream processing topologies without programming is a worthwhile goal. It is
something that is possible by using IBM Streams Studio.
IBM Streams is a complete SDE that is ready to run immediately after installation. In addition, you
have all the tools to develop custom source and sink.
IBM Streams can cross-integrate with IBM SPSS Statistics to provide Predictive Model Markup
Language (PMML) capability and work with R, the open source statistical package that supports
PMML.
What is PMML?
PMML is the de-facto standard language that is used to represent data mining models. A PMML file
can contain a myriad of data transformations (pre- and post-processing) and one or more predictive
models. Predictive analytic models and data mining models are terms that refer to mathematical
models that use statistical techniques to learn patterns that are hidden in large volumes of historical
data. Predictive analytic models use the knowledge that is acquired during training to predict the
existence of known patterns in new data. With PMML, you can share predictive analytic models
between different applications. Therefore, you can train a model in one system, express it in PMML,
and move it to another system where you can use it to predict, for example, the likelihood of
machine failure.
Uempty
Reference:
https://www.ibm.com/developerworks/library/ba-ind-PMML1/
Uempty
Delta
Batch data Aggregated
loads
data
Real-time
data
Analyzed
data
Hadoop System HDFS
(or data lake)
Stream computing © Copyright IBM Corporation 2021
Figure 9-23. Where does IBM Streams fit in the processing cycle
What if you wanted to add fraud detection to your processing cycle before authorizing the
transaction and committing it to the database? Fraud detection must happen in real time for it to be
a benefit. So instead of taking your data and running it directly in the authorization / OLTP process,
process it as it streaming data by using IBM Streams. The results of this process can be directed to
your transactional system or data warehouse, or to a data lake or Hadoop system storage (for
example, HDFS).
Uempty
Twitter
Detect life-threatening conditions
at hospitals in time to intervene.
?
Predict weather patterns to plan optimal
wind turbine usage and optimize capital
expenditure on asset placement.
The slide shows some of the situations in which IBM Streams was applied to perform real-time
analytics on streaming data.
Today, organizations are tapping into only a small fraction of the data that is available to them. The
challenge is figuring out how to analyze all the data and find insights in these new and
unconventional data types. Imagine if you could analyze the 7 TB of tweets created each day to
figure out what people are saying about your products and figure out who the key influencers are
within your target demographics. Can you imagine being able to mine this data to identify new
market opportunities?
What if hospitals could take the thousands of sensor readings that are collected every hour per
patient in ICUs to identify subtle indications that the patient is becoming unwell, days earlier that is
allowed by traditional techniques? Imagine if a green energy company could use petabytes of
weather data along with massive volumes of operational data to optimize asset location and
utilization, making these environmentally friendly energy sources more cost competitive with
traditional sources.
What if you could make risk decisions, such as whether someone qualifies for a mortgage, in
minutes by analyzing many sources of data, including real-time transactional data, while the client
is still on the phone or in the office? What if law enforcement agencies could analyze audio and
video feeds in real-time without human intervention to identify suspicious activity?
Uempty
Reference:
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf
Uempty
Applications can be developed in IBM Streams Studio by using IBM Streams Processing Language
(SPL), which is a declarative language that is customized for stream computing. Applications with
the latest release are generally developed by using a drag graphical approach.
After the applications are developed, they are deployed to a Streams Runtime environment. By
using Streams Live Graph, you can monitor the performance of the runtime cluster from the
perspective of individual machines and the communications among them.
Virtually any device, sensor, or application system can be defined by using the language, but there
are predefined source and output adapters that can further simplify application development. As
examples, IBM delivers the following adapters, among many others:
• TCP/IP, UDP/IP, and files.
• IBM WebSphere Front Office, which delivers stock feeds from major exchanges worldwide.
• IBM solidDB® includes an in-memory, persistent database that uses the Solid Accelerator API.
• Relational databases, which are supported by using industry-standard ODBC applications,
such as the one shown in the slide, which usually feature multiple steps.
Uempty
For example, some utilities began paying customers who sign up for a particular usage plan to have
their air conditioning units turned off for a short time so that the temperature changed. An
application to implement this plan collects data from meters and might apply a filter to monitor only
for those customers who selected this service. Then, a usage model must be applied that was
selected for that company. Then, up-to-date usage contracts must be applied by retrieving them,
extracting the text, filtering on keywords, and possibly applying a seasonal adjustment.
Current weather information can be collected and parsed from the US National Oceanic &
Atmospheric Administration (NOAA), which has weather stations across the United States. After the
correct location is parsed for, text can be extracted, and the temperature history can be read from a
database and compared to historical information. Optionally, the latest temperature history could be
stored in a warehouse for future use.
Finally, the three streams (meter information, usage contract, and current weather comparison to
historical weather) can be used to act.
Reference:
http://www.redbooks.ibm.com/redbooks/pdfs/sg248108.pdf
Uempty
Unit summary
• Defined streaming data.
• Described IBM as a pioneer in streaming analytics with IBM Streams.
• Explained streaming data concepts and terminology.
• Compared and contrasted batch data versus streaming data.
• Listed and explained streaming components and streaming data
engines (SDEs).
Uempty
Review questions
1. True or False: IBM Streams needs Apache Storm or Apache
Spark to provide the analytics.
2. True or False: Streaming data is limited to sensors,
cameras, and video.
3. What are the differences between NiFi and MiNiFi?
A. NiFi is small and has low resource consumption.
B. NiFi is subproject of MiNiFi.
C. NiFi is a disk-based and microbatch ETL tool.
D. They are the same.
Uempty
Uempty
Review answers
1. True or False: IBM Streams needs Apache Storm or Apache
Spark to provide the analytics
2. True or False: Streaming data is limited to sensors,
cameras, and video.
3. What are the differences between NiFi and MiNiFi?
A. NiFi is small and has low resource consumption.
B. NiFi is subproject of MiNiFi.
C. NiFi is a disk-based and microbatch ETL tool.
D. They are the same.
Uempty
backpg