Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
21 views14 pages

Rohit

www

Uploaded by

guptayuvraj986
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views14 pages

Rohit

www

Uploaded by

guptayuvraj986
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

lOMoARcPSD|32787221

Data analytics unit5 notes

Bachelor of computer applications (Srinivas University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Anshika Shukla ([email protected])
lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

Map reduce

MapReduce is a programming model for writing applications that can process Big Data in
parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing huge
volumes of complex data.

Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and
process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database
servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce.


MapReduce divides a task into small parts and assigns them to many computers.
Later, the results are collected at one place and integrated to form the result dataset.

How MapReduce Works?


The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 The Map task takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their
significance.

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

 Input Phase − Here we have a Record Reader that translates each record in
an input file and sends the parsed data to the mapper in the form of key-value
pairs.
 Map − Map is a user-defined function, which takes a series of key-value pairs
and processes each one of them to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are
known as intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups similar data
from the map phase into identifiable sets. It takes the intermediate keys from
the mapper as input and applies a user-defined code to aggregate the values
in a small scope of one mapper. It is not a part of the main MapReduce
algorithm; it is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It
downloads the grouped key-value pairs onto the local machine, where the
Reducer is running. The individual key-value pairs are sorted by key into a
larger data list. The data list groups the equivalent keys together so that their
values can be iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data as input and
runs a Reducer function on each one of them. Here, the data can be
aggregated, filtered, and combined in a number of ways, and it requires a wide
range of processing. Once the execution is over, it gives zero or more key-
value pairs to the final step.
 Output Phase − In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.
Let us try to understand the two tasks Map &f Reduce with the help of a small
diagram −

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter
receives around 500 million tweets per day, which is nearly 3000 tweets per second.
The following illustration shows how Tweeter manages its tweets with the help of
MapReduce.

As shown in the illustration, the MapReduce algorithm performs the following actions

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-
value pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the filtered
maps as key-value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values into
small manageable units.

Map reduce algorithms


The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class


 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper
class is used as input by Reducer class, which in turn searches matching pairs and
reduces them.

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

MapReduce implements various mathematical algorithms to divide a task into small


parts and assign them to multiple systems. In technical terms, MapReduce algorithm
helps in sending the Map & Reduce tasks to appropriate servers in a cluster.
These mathematical algorithms may include the following −

 Sorting
 Searching
 Indexing
 TF-IDF

Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-value
pairs from the mapper by their keys.
 Sorting methods are implemented in the mapper class itself.
 In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
the Context class (user-defined class) collects the matching valued keys as a
collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes
the help of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically
sorted by Hadoop to form key-values (K2, {V2, V2, …}) before they are
presented to the Reducer.

Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner
phase (optional) and in the Reducer phase. Let us try to understand how Searching
works with the help of an example.

Example
The following example shows how MapReduce employs Searching algorithm to find
out the details of the employee who draws the highest salary in a given employee
dataset.
 Let us assume we have employee data in four different files − A, B, C, and D.
Let us also assume there are duplicate employee records in all four files
because of importing the employee data from all database tables repeatedly.
See the following illustration.

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

 The Map phase processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>). See the following illustration.

 The combiner phase (searching technique) will accept the input from the
Map phase as a key-value pair with employee name and salary. Using
searching technique, the combiner will check all the employee salary to find
the highest salaried employee in each file. See the following

Introduction to Hadoop

Hadoop is an open-source software framework that is used for storing and


processing large amounts of data in a distributed computing environment. It
is designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.
Hadoop has two main components:
 HDFS (Hadoop Distributed File System): This is the storage component
of Hadoop, which allows for the storage of large amounts of data across
multiple machines. It is designed to work with commodity hardware, which
makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource
management component of Hadoop, which manages the allocation of
resources (such as CPU and memory) for processing the data stored in
HDFS.
 Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
 Hadoop is commonly used in big data scenarios such as data
warehousing, business intelligence, and machine learning. It’s also used
for data processing, data analysis, and data mining.
What is Hadoop?

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

Hadoop is an open source software programming framework for storing a


large amount of data and performing the computation. Its framework is based
on Java programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It
is designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.
Hadoop has two main components:
 HDFS (Hadoop Distributed File System): This is the storage component
of Hadoop, which allows for the storage of large amounts of data across
multiple machines. It is designed to work with commodity hardware, which
makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource
management component of Hadoop, which manages the allocation of
resources (such as CPU and memory) for processing the data stored in
HDFS.
 Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
 Hadoop is commonly used in big data scenarios such as data
warehousing, business intelligence, and machine learning. It’s also used
for data processing, data analysis, and data mining. It enables the
distributed processing of large data sets across clusters of computers
using a simple programming model.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-
founders are Doug Cutting and Mike Cafarella. It’s co-founder Doug
Cutting named it on his son’s toy elephant. In October 2003 the first paper
release was Google File System. In January 2006, MapReduce development
started on the Apache Nutch which consisted of around 6000 lines coding for
it and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was
released.
Hadoop is an open-source software framework for storing and processing big
data. It was created by Apache Software Foundation in 2006, based on a
white paper written by Google in 2003 that described the Google File System
(GFS) and the MapReduce programming model. The Hadoop framework
allows for the distributed processing of large data sets across clusters of
computers using simple programming models. It is designed to scale up from
single servers to thousands of machines, each offering local computation
and storage. It is used by many organizations, including Yahoo, Facebook,
and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry
and has become a key technology for big data processing.

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited for big data
processing:

 Distributed Storage: Hadoop stores large data sets across multiple


machines, allowing for the storage and processing of extremely large
amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of
machines, making it easy to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning
it can continue to operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is
stored on the same node where it will be processed, this feature helps to
reduce the network traffic and improve the performance
 High Availability: Hadoop provides High Availability feature, which helps
to make sure that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model
allows for the processing of data in a distributed fashion, making it easy to
implement a wide variety of data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to
ensure that the data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps
to replicate the data across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature,
which helps to reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data
processing engines like real-time streaming, batch processing, and
interactive SQL, to run and process data stored in HDFS.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into
blocks and sends them across various nodes in form of large clusters. Also
in case of a node failure, the system operates and data transfer takes place
between the nodes which are facilitated by HDFS.

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

HDFS

Advantages of HDFS: It is inexpensive, immutable in nature, stores data


reliably, ability to tolerate faults, scalable, block structured, can process a
large amount of data simultaneously and many more. Disadvantages of
HDFS: It’s the biggest disadvantage is that it is not fit for small quantities of
data. Also, it has issues related to potential stability, restrictive and rough in
nature. Hadoop also supports a wide range of software packages such as
Apache Flumes, Apache Oozie, Apache HBase, Apache Sqoop, Apache
Spark, Apache Storm, Apache Pig, Apache Hive, Apache Phoenix, Cloudera
Impala.
Some common frameworks of Hadoop
1. Hive- It uses HiveQl for data structuring and for writing complicated
MapReduce in HDFS.
2. Drill- It consists of user-defined functions and is used for data exploration.
3. Storm- It allows real-time processing and streaming of data.
4. Spark- It contains a Machine Learning Library(MLlib) for providing
enhanced machine learning and is widely used for data processing. It also
supports Java, Python, and Scala.
5. Pig- It has Pig Latin, a SQL-Like language and performs data
transformation of unstructured data.
6. Tez- It reduces the complexities of Hive and Pig and helps in the running
of their codes faster.
Hadoop framework is made up of the following modules:
1. Hadoop MapReduce- a MapReduce programming model for handling and
processing large data.
2. Hadoop Distributed File System- distributed files in clusters among nodes.
3. Hadoop YARN- a platform which manages computing resources.
4. Hadoop Common- it contains packages and libraries which are used for
other modules.
Advantages and Disadvantages of Hadoop
Advantages:
 Ability to store a large amount of data.
 High flexibility.

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

 Cost effective.
 High computational power.
 Tasks are independent.
 Linear scaling.

Hadoop has several advantages that make it a popular choice for big data
processing:

 Scalability: Hadoop can easily scale to handle large amounts of data by


adding more nodes to the cluster.
 Cost-effective: Hadoop is designed to work with commodity hardware,
which makes it a cost-effective option for storing and processing large
amounts of data.
 Fault-tolerance: Hadoop’s distributed architecture provides built-in fault-
tolerance, which means that if one node in the cluster goes down, the
data can still be processed by the other nodes.
 Flexibility: Hadoop can process structured, semi-structured, and
unstructured data, which makes it a versatile option for a wide range of
big data scenarios.
 Open-source: Hadoop is open-source software, which means that it is
free to use and modify. This also allows developers to access the source
code and make improvements or add new features.
 Large community: Hadoop has a large and active community of
developers and users who contribute to the development of the software,
provide support, and share best practices.
 Integration: Hadoop is designed to work with other big data technologies
such as Spark, Storm, and Flink, which allows for integration with a wide
range of data processing and analysis tools.
Disadvantages:
 Not very effective for small data.
 Hard cluster management.
 Has stability issues.
 Security concerns.
 Complexity: Hadoop can be complex to set up and maintain, especially
for organizations without a dedicated team of experts.
 Latency: Hadoop is not well-suited for low-latency workloads and may not
be the best choice for real-time data processing.
 Limited Support for Real-time Processing: Hadoop’s batch-oriented
nature makes it less suited for real-time streaming or interactive data
processing use cases.
 Limited Support for Structured Data: Hadoop is designed to work with
unstructured and semi-structured data, it is not well-suited for structured
data processing
 Data Security: Hadoop does not provide built-in security features such as
data encryption or user authentication, which can make it difficult to
secure sensitive data.

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

 Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming


model is not well-suited for ad-hoc queries, making it difficult to perform
exploratory data analysis.
 Limited Support for Graph and Machine Learning: Hadoop’s core
component HDFS and MapReduce are not well-suited for graph and
machine learning workloads, specialized components like Apache Graph
and Mahout are available but have some limitations.
 Cost: Hadoop can be expensive to set up and maintain, especially for
organizations with large amounts of data.
 Data Loss: In the event of a hardware failure, the data stored in a single
node may be lost permanently.
 Data Governance: Data Governance is a critical aspect of data
management, Hadoop does not provide a built-in feature to manage data
lineage, data quality, data cataloging, data lineage, and data audit.

Data Visualization
Data visualization is a graphical representation of quantitative information
and data by using visual elements like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is
easy to understand and process for humans.

Data visualization tools provide accessible ways to understand outliers,


patterns, and trends in the data.

In the world of Big Data, the data visualization tools and technologies are
required to analyze vast amounts of information.

Data visualizations are common in your everyday life, but they always
appear in the form of graphs and charts. The combination of multiple
visualizations and bits of information are still referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You
can see visualizations in the form of line charts to display change over
time. Bar and column charts are useful for observing relationships and
making comparisons. A pie chart is a great way to show parts-of-a-whole.
And maps are the best way to share geographical data visually.

Today's data visualization tools go beyond the charts and graphs used in
the Microsoft Excel spreadsheet, which displays the data in more
sophisticated ways such as dials and gauges, geographic maps, heat
maps, pie chart, and fever chart.

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

What makes Data Visualization Effective?


E昀昀ective data visualization are created by communication, data science,
and design collide. Data visualizations did right key insights into
complicated data sets into meaningful and natural.

American statistician and Yale professor Edward Tufte believe useful


data visualizations consist of ?complex ideas communicated with clarity,
precision, and e昀케ciency.

To craft an e昀昀ective data visualization, you need to start with clean data
that is well-sourced and complete. After the data is ready to visualize, you
need to pick the right chart.

After you have decided the chart type, you need to design and customize
your visualization to your liking. Simplicity is essential - you don't want to
add any elements that distract from the data.

History of Data Visualization


The concept of using picture was launched in the 17th century to
understand the data from the maps and graphs, and then in the early
1800s, it was reinvented to the pie chart.

Several decades later, one of the most advanced examples of statistical


graphics occurred when Charles Minard mapped Napoleon's invasion of
Russia. The map represents the size of the army and the path of
Napoleon's retreat from Moscow - and that information tied to

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

temperature and time scales for a more in-depth understanding of the


event.

Computers made it possible to process a large amount of data at


lightning-fast speeds. Nowadays, data visualization becomes a fast-
evolving blend of art and science that certain to change the corporate
landscape over the next few years.

Importance of Data Visualization


Data visualization is important because of the processing of information in
human brains. Using graphs and charts to visualize a large amount of the
complex data sets is more comfortable in comparison to studying the
spreadsheet and reports.

Data visualization is an easy and quick way to convey concepts


universally. You can experiment with a di昀昀erent outline by making a slight
adjustment.

Data visualization have some more specialties such as:

o Data visualization can identify areas that need improvement or


modi昀椀cations.
o Data visualization can clarify which factor in昀氀uence customer behavior.
o Data visualization helps you to understand which products to place where.
o Data visualization can predict sales volumes.

Data visualization tools have been necessary for democratizing data,


analytics, and making data-driven perception available to workers
throughout an organization. They are easy to operate in comparison to

Downloaded by Anshika Shukla ([email protected])


lOMoARcPSD|32787221

UNIT5 DATA ANALYTICS

earlier versions of BI software or traditional statistical analysis software.


This guide to a rise in lines of business implementing data visualization
tools on their own, without support from IT.

Why Use Data Visualization?


1. To make easier in understand and remember.
2. To discover unknown facts, outliers, and trends.
3. To visualize relationships and patterns quickly.
4. To ask a better question and make better decisions.
5. To competitive analyze.
6. To improve insights.

Downloaded by Anshika Shukla ([email protected])

You might also like