100% found this document useful (1 vote)

80 views14 pages

Big Data Analytics

notes for big data

Uploaded by

hardik.dwivedi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

80 views14 pages

Big Data Analytics

notes for big data

Uploaded by

hardik.dwivedi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Unit 1: Introduction to Big Data

Data & its types:

Big Data Architecture:

When you need to ingest, process and analyze data sets that are too sizable and/or complex for
conventional relational databases, the solution is technology organized into a structure called
a Big Data architecture. Use cases include:
 Storage and processing of data in very large volumes: generally, anything over 100 GB
in size
 Aggregation and transformation of large sets of unstructured data for analysis and
reporting
 The capture, processing, and analysis of streaming data in real-time or near-real-time
Big Data architectures have a number of layers or components. These are the most common:

1. Data sources
Data is sourced from multiple inputs in a variety of formats, including both structured and
unstructured. Sources include relational databases allied with applications such as ERP or
CRM, data warehouses, mobile devices, social media, email, and real-time streaming data
inputs such as IoT devices. Data can be ingested in batch mode or in real-time.

2. Data storage
This is the data receiving layer, which ingests data, stores it, and converts unstructured data into
a format analytic tools can work with. Structured data is often stored in a relational database,
while unstructured data can be housed in a NoSQL database such as MongoDB Atlas. A
specialized distributed system like Hadoop Distributed File System (HDFS) is a good option for
high-volume batch processed data in various formats.

3. Batch processing
With very large data sets, long-running batch jobs are required to filter, combine, and generally
render the data usable for analysis. Source files are typically read and processed, with the
output written to new files. Hadoop is a common solution for this.
4. Real-time message ingestion
This component focuses on categorizing the data for a smooth transition into the deeper layers
of the environment. An architecture designed for real-time sources needs a mechanism to ingest
and store real-time messages for stream processing. Messages can sometimes just be dropped
into a folder, but in other cases, a message capture store is necessary for buffering and to enable
scale-out processing, reliable delivery, and other queuing requirements.

5. Stream processing
Once captured, the real-time messages have to be filtered, aggregated, and otherwise prepared
for analysis, after which they are written to an output sink. Options for this phase include Azure
Stream Analytics, Apache Storm, and Apache Spark Streaming.

6. Analytical data store

The processed data can now be presented in a structured format – such as a relational data
warehouse – for querying by analytical tools, as is the case with traditional business intelligence
(BI) platforms. Other alternatives for serving the data are low-latency NoSQL technologies or
an interactive Hive database.

7. Analysis and reporting

Most Big Data platforms are geared to extracting business insights from the stored data via
analysis and reporting. This requires multiple tools. Structured data is relatively easy to handle,
while more advanced and specialized techniques are required for unstructured data. Data
scientists may undertake interactive data exploration using various notebooks and tool-sets. A
data modeling layer might also be included in the architecture, which may also enable self-
service BI using popular visualization and modeling techniques.
Analytics results are sent to the reporting component, which replicates them to various output
systems for human viewers, business processes, and applications. After visualization into reports
or dashboards, the analytic results are used for data-driven business decision making.

8. Orchestration
The cadence of Big Data analysis involves multiple data processing operations followed by data
transformation, movement among sources and sinks, and loading of the prepared data into an
analytical data store. These workflows can be automated with orchestration systems from
Apache such as Oozie and Sqoop, or Azure Data Factory.
Benefits of Big Data Architecture

1. Parallel computing for high performance

To process large data sets quickly, big data architectures use parallel computing, in which
multiprocessor servers perform numerous calculations at the same time. Sizable problems are
broken up into smaller units which can be solved simultaneously.

2. Elastic scalability
Big Data architectures can be scaled horizontally, enabling the environment to be adjusted to
the size of each workload. Big Data solutions are usually run in the cloud, where you only pay
for the storage and computing resources you actually use.

3. Freedom of choice
The marketplace offers many solutions and platforms for use in Big Data architectures, such as
Azure managed services, MongoDB Atlas, and Apache technologies. You can combine solutions
to get the best fit for your various workloads, existing systems, and IT skill sets.

4. Interoperability with related systems

You can create integrated platforms across different types of workloads, leveraging Big Data
architecture components for IoT processing and BI as well as analytics workflows.

Big Data Architecture Challenges

1. Security
Big data of the static variety is usually stored in a centralized data lake. Robust security is
required to ensure your data stays protected from intrusion and theft. But secure access can be
difficult to set up, as other applications need to consume the data as well.

2. Complexity
A Big Data architecture typically contains many interlocking moving parts. These include
multiple data sources with separate data-ingestion components and numerous cross-component
configuration settings to optimize performance. Building, testing, and troubleshooting Big Data
processes are challenges that take high levels of knowledge and skill.

3. Evolving technologies
It’s important to choose the right solutions and components to meet the business objectives of
your Big Data initiatives. This can be daunting, as many Big Data technologies, practices, and
standards are relatively new and still in a process of evolution. Core Hadoop components such
as Hive and Pig have attained a level of stability, but other technologies and services remain
immature and are likely to change over time.

4. Specialized skill sets

Big Data APIs built on mainstream languages are gradually coming into use. Nevertheless, Big
Data architectures and solutions do generally employ atypical, highly specialized languages and
frameworks that impose a considerable learning curve for developers and data analysts alike.

Types of Big Data Architecture

Lambda Architecture
A single Lambda architecture handles both batch (static) data and real-time processing data. It
is employed to solve the problem of computing arbitrary functions. In this deployment model,
latency is reduced and negligible errors are preserved while retaining accuracy.
Kappa Architecture
When compared to Lambda architecture, Kappa architecture is also intended to handle both
real-time streaming and batch processing data. The Kappa architecture, in addition to reducing
the additional cost that comes from the Lambda architecture, replaces the data sourcing
medium with message queues.
The messaging engines store a sequence of data in the analytical databases, which are then read
and converted into appropriate format before being saved for the end-user.
The batch layer was eliminated in the Kappa architecture, and the speed layer was enhanced to
provide reprogramming capabilities. The key difference with the Kappa architecture is that all
the data is presented as a series or stream. Data transformation is achieved through the steam
engine, which is the central engine for data processing.

https://www.interviewbit.com/blog/big-data-architecture/

Drivers of Big Data:

Big data becomes everyone’s choice in last few years. The reason that drives to make popular
big data is following:
1. Traditional solutions failing to satisfy the modern market needs.
2. Society interact with digital platforms
3. Technology equipment become cheaper
4. Bigger data storage and high computing facility available among the people.
5. Cloud computing services are available
6. Sensor based device become popular
7. People engage in data driven innovation and decision making for competitive advantage.

Four Vs in Big Data Analytics:

Volume
The prominent feature of any dataset is its size. Volume refers to the size of data generated and
stored in a Big Data system. We’re talking about the size of data in the petabytes and exabytes
range. These massive amounts of data necessitate the use of advanced processing technology—
far more powerful than a typical laptop or desktop CPU. As an example of a massive volume
dataset, think about Instagram or Twitter. People spend a lot of time posting pictures,
commenting, liking posts, playing games, etc. With these ever-exploding data, there is a huge
potential for analysis, finding patterns, and so much more.
Variety
Variety entails the types of data that vary in format and how it is organized and ready for
processing. Big names such as Facebook, Twitter, Pinterest, Google Ads, CRM systems produce
data that can be collected, stored, and subsequently analyzed.
Velocity
The rate at which data accumulates also influences whether the data is classified as big data or
regular data. Much of this data must be evaluated in real-time; therefore, systems must be able
to handle the pace and amount of data created. The processing speed of data means that there
will be more and more data available than the previous data, but it also implies that the velocity
of data processing needs to be just as high.
Value
Value is another major issue that is worth considering. It is not only the amount of data that we
keep or process that is important. It is also data that is valuable and reliable and data that must
be saved, processed, and evaluated to get insights.

Characteristics of Big Data:

There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume

o Veracity

o Variety
o Value

o Velocity

Volume

The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes'
of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.

Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.

Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
b.Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to work
with semi-structured data. It is stored in relations, i.e., tables.
C.Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.

Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.

Enablers:
Reference:
https://www.researchgate.net/publication/315466870_Live_Data_Analytics_With_Collaborative
_Edge_and_Cloud_Processing_in_Wireless_IoT_Networks#pf4

Value chain:

Building a Big Data Strategy:

Every company can and does collect data that can be a valuable business asset. That value is lost
without a strategy that outlines how to access the data to ultimately achieve your business goals.
Therefore, building a big data strategy starts with identifying use cases for your organisation.
An effective big data strategy must link to a company goal, and that’s precisely why the first
step in the Data Use Case Template is “Link to a Strategic Goal.” That ensures that your team’s
big data strategy is focused on the right priority from the start of your planning. Consider how
your company could use data to:
1. Understand your customers
2. Offer smarter products and services to your customers
3. Improve internal processes
4. Add additional revenue by monetising data
Once you have a Data Use Case template completed for your three to five top use cases, it’s time
to build out your big data strategy with the Data Strategy Template.

https://intelliarts.com/blog/building-big-data-strategy/

Distributed File System:

https://www.geeksforgeeks.org/what-is-dfsdistributed-file-system/

Scalable Computing over the Internet:

Cloud computing: https://www.techtarget.com/searchcloudcomputing/definition/cloud-
computing#:~:text=Cloud%20computing%20lets%20client%20devices,operated%20by
%20cloud%20service%20providers.

Programming Models for Big Data

Big data analytics refers to advanced and efficient data mining and machine learning techniques
applied to large amounts of data.
A programming model is a set of programming languages and runtime libraries that form a
computing model. It is the fundamental style and interfaces for developers to write applications
and computing programs.
Research work and results in the field of big data analysis are constantly emerging, and more
and more new efficient architectures, programming models, systems, and data mining
algorithms have been proposed.
Programming models are often a core feature of big data frameworks because they implicitly
influence the execution model of big data processing engines and also drive the way users
express and build big data applications and programs.
Here are some programming models and algorithms for big data:
 MapReduce: A popular algorithm for processing big data on clusters. It's used for
parallelizable problems across large volumes of structured and unstructured data.
 Support Vector Machine: A set of machine learning algorithms that are used in data
mining, data science, and predictive analytics. They are flexible and can generate
accurate forecasts.
 Cluster algorithm: A major topic in big data analysis. The goal is to separate an
unlabeled dataset into subsets, each with a unique characteristic of its data structure.
 Supervised learning algorithms: Use labeled data to create models that can classify big
data and make predictions on future outcomes.
 Apache Spark framework: An open-source framework that provides a unified interface
for programming clusters. It has built-in modules that support SQL, machine learning,
stream processing, and graph computation.
 Naive Bayes: A model used for large data sets with hundreds or thousands of data points
and a few variables. It's fast and easy to implement than other classification algorithms.
 Streaming algorithms: Extract only a small amount of information about the dataset,
which preserves its key properties. They are typically allowed to make only one pass
over the data.
 KNN algorithm: A supervised classification algorithm that uses labeled data to classify
data based on their similarities.
Reference: http://eitc.org/research-opportunities/new-media-and-new-digital-economy/data-
science-and-analytics/programming-models-for-big-data-1

Unit II: INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE

What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in
nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set
which can be computed in Key value pair. The output of Map task is consumed by reduce task
and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

Big Data technology and tools:

Concept of Virtualization in Big Data – Apache Hadoop & Hadoop Eco System:

Overview of HDFS:
Comparison with traditional Databases:

Understanding MapReduce- Map and Reduce algorithms:

YARN: A Resource Manager for Hadoop, Pig, Mahout, etc.:

Cloud Service Models: An Important Big Data Enabler.

EX-0118 Virtual Assessment Guidance For Candidates
0% (1)
EX-0118 Virtual Assessment Guidance For Candidates
16 pages
Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Deloitte - S4 Hana Conversion
100% (4)
Deloitte - S4 Hana Conversion
18 pages
Unit1 - BDH
No ratings yet
Unit1 - BDH
77 pages
Module 1
No ratings yet
Module 1
29 pages
Big Data Architecture Overview
No ratings yet
Big Data Architecture Overview
8 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
Big Data Analy Cs - Architecture
No ratings yet
Big Data Analy Cs - Architecture
10 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
BIG DATA 1 Unit
100% (1)
BIG DATA 1 Unit
17 pages
Detailednotes - Unit1 - Big Data
No ratings yet
Detailednotes - Unit1 - Big Data
22 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Unit 1 B Tech 3 Year BD
No ratings yet
Unit 1 B Tech 3 Year BD
10 pages
Types of Digital Data & Big Data
No ratings yet
Types of Digital Data & Big Data
136 pages
Big Data Architecture
No ratings yet
Big Data Architecture
9 pages
Stream Processing Chapter 2
No ratings yet
Stream Processing Chapter 2
21 pages
Data Science
No ratings yet
Data Science
87 pages
Unit 1 Big Data - VII SEM (2024-25)
No ratings yet
Unit 1 Big Data - VII SEM (2024-25)
48 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Big Data
No ratings yet
Big Data
86 pages
Abhishek Seminar 222
No ratings yet
Abhishek Seminar 222
19 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
No ratings yet
Introduction To Big Data: Types of Digital Data, History of Big Data Innovation
12 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
BDMA Part 2
No ratings yet
BDMA Part 2
16 pages
Big Data Unit 1 Notes - 240311 - 100703
No ratings yet
Big Data Unit 1 Notes - 240311 - 100703
15 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
Big Data Architecture Basics
No ratings yet
Big Data Architecture Basics
24 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Big Data Architectures
No ratings yet
Big Data Architectures
11 pages
Unit II Big Data Architecture
No ratings yet
Unit II Big Data Architecture
5 pages
Big Data - Unit-I
No ratings yet
Big Data - Unit-I
17 pages
Big Data - Comprehensive Summary
No ratings yet
Big Data - Comprehensive Summary
12 pages
Unit 1 Big Data Notes
No ratings yet
Unit 1 Big Data Notes
48 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Big Data
No ratings yet
Big Data
16 pages
Big Data Essentials for IT Professionals
No ratings yet
Big Data Essentials for IT Professionals
31 pages
Types of Digital Data: Unit 1 Big Data KCS-061
No ratings yet
Types of Digital Data: Unit 1 Big Data KCS-061
12 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Big Data Architectures: A Detailed and Application Oriented Review
No ratings yet
Big Data Architectures: A Detailed and Application Oriented Review
11 pages
Fundamentals of Working With Big Data in Databases
No ratings yet
Fundamentals of Working With Big Data in Databases
4 pages
Database Trends & Innovations
No ratings yet
Database Trends & Innovations
5 pages
Big Data Chatgpt
No ratings yet
Big Data Chatgpt
8 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
BIGDATAUNIT1 AKTUpdf
No ratings yet
BIGDATAUNIT1 AKTUpdf
33 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
Big Data-One
No ratings yet
Big Data-One
9 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
79 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
BDA I Unit
No ratings yet
BDA I Unit
44 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
All Chapter Cricket Shop
No ratings yet
All Chapter Cricket Shop
97 pages
SY Binary Numbers
No ratings yet
SY Binary Numbers
14 pages
ANNEX 5.5 - Minimum Software Versions
No ratings yet
ANNEX 5.5 - Minimum Software Versions
5 pages
TBS Windows Linux PCIE Universal User Guide
No ratings yet
TBS Windows Linux PCIE Universal User Guide
19 pages
Device Specs for Tech Enthusiasts
No ratings yet
Device Specs for Tech Enthusiasts
9 pages
SharePoint Patching
No ratings yet
SharePoint Patching
41 pages
DS NV Quadro K420 US NV HR
No ratings yet
DS NV Quadro K420 US NV HR
1 page
ARM1176 Reference Manual
No ratings yet
ARM1176 Reference Manual
759 pages
BB Electronics (2010)
No ratings yet
BB Electronics (2010)
148 pages
Mod Bus
No ratings yet
Mod Bus
18 pages
G10 Sprint 1.1 Systematic Troubleshooting
No ratings yet
G10 Sprint 1.1 Systematic Troubleshooting
42 pages
GSD Documentation: Release 2.1.1
No ratings yet
GSD Documentation: Release 2.1.1
99 pages
10.1.3.4 Packet Tracer - Configuring OSPF Advanced Features - ILM
No ratings yet
10.1.3.4 Packet Tracer - Configuring OSPF Advanced Features - ILM
2 pages
Traversing A Single-Dimensional Array Definition
No ratings yet
Traversing A Single-Dimensional Array Definition
11 pages
Run macOS on Windows: A Guide
No ratings yet
Run macOS on Windows: A Guide
13 pages
A Tale of BETA's Wayward Son
No ratings yet
A Tale of BETA's Wayward Son
19 pages
File Allocation Strategies With Examples 1
No ratings yet
File Allocation Strategies With Examples 1
3 pages
Peripherals and Interfaces
No ratings yet
Peripherals and Interfaces
42 pages
C005 - PEMaster Install Manual PDF
No ratings yet
C005 - PEMaster Install Manual PDF
19 pages
Unité Learn The Benefits of APIs - Salesforce Trailhead
No ratings yet
Unité Learn The Benefits of APIs - Salesforce Trailhead
6 pages
Side-By-Side Product Compare: Canon Imagerunner 1730 Background Information 18997/usa
No ratings yet
Side-By-Side Product Compare: Canon Imagerunner 1730 Background Information 18997/usa
27 pages
Socket Programming in Java
67% (3)
Socket Programming in Java
18 pages
Open Office Base
No ratings yet
Open Office Base
157 pages
RAID Controller Password Reset
No ratings yet
RAID Controller Password Reset
4 pages
C++ Programming Basics Guide
No ratings yet
C++ Programming Basics Guide
23 pages
Drtech API Log
No ratings yet
Drtech API Log
2,998 pages
C and C++ Report
100% (2)
C and C++ Report
29 pages
CSS NCII Assessor Observation Checklist
No ratings yet
CSS NCII Assessor Observation Checklist
3 pages

Big Data Analytics

Uploaded by

Big Data Analytics

Uploaded by

Unit 1: Introduction to Big Data

Data & its types:

Big Data Architecture:

6. Analytical data store

7. Analysis and reporting

1. Parallel computing for high performance

4. Interoperability with related systems

Big Data Architecture Challenges

4. Specialized skill sets

Types of Big Data Architecture

Drivers of Big Data:

Four Vs in Big Data Analytics:

Characteristics of Big Data:

Building a Big Data Strategy:

Distributed File System:

Scalable Computing over the Internet:

Programming Models for Big Data

Unit II: INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE

Big Data technology and tools:

Understanding MapReduce- Map and Reduce algorithms:

YARN: A Resource Manager for Hadoop, Pig, Mahout, etc.:

Cloud Service Models: An Important Big Data Enabler.

You might also like