Unit 1: Introduction to Big Data
Data & its types:
Big Data Architecture:
When you need to ingest, process and analyze data sets that are too sizable and/or complex for
conventional relational databases, the solution is technology organized into a structure called
a Big Data architecture. Use cases include:
Storage and processing of data in very large volumes: generally, anything over 100 GB
in size
Aggregation and transformation of large sets of unstructured data for analysis and
reporting
The capture, processing, and analysis of streaming data in real-time or near-real-time
Big Data architectures have a number of layers or components. These are the most common:
1. Data sources
Data is sourced from multiple inputs in a variety of formats, including both structured and
unstructured. Sources include relational databases allied with applications such as ERP or
CRM, data warehouses, mobile devices, social media, email, and real-time streaming data
inputs such as IoT devices. Data can be ingested in batch mode or in real-time.
2. Data storage
This is the data receiving layer, which ingests data, stores it, and converts unstructured data into
a format analytic tools can work with. Structured data is often stored in a relational database,
while unstructured data can be housed in a NoSQL database such as MongoDB Atlas. A
specialized distributed system like Hadoop Distributed File System (HDFS) is a good option for
high-volume batch processed data in various formats.
3. Batch processing
With very large data sets, long-running batch jobs are required to filter, combine, and generally
render the data usable for analysis. Source files are typically read and processed, with the
output written to new files. Hadoop is a common solution for this.
4. Real-time message ingestion
This component focuses on categorizing the data for a smooth transition into the deeper layers
of the environment. An architecture designed for real-time sources needs a mechanism to ingest
and store real-time messages for stream processing. Messages can sometimes just be dropped
into a folder, but in other cases, a message capture store is necessary for buffering and to enable
scale-out processing, reliable delivery, and other queuing requirements.
5. Stream processing
Once captured, the real-time messages have to be filtered, aggregated, and otherwise prepared
for analysis, after which they are written to an output sink. Options for this phase include Azure
Stream Analytics, Apache Storm, and Apache Spark Streaming.
6. Analytical data store
The processed data can now be presented in a structured format – such as a relational data
warehouse – for querying by analytical tools, as is the case with traditional business intelligence
(BI) platforms. Other alternatives for serving the data are low-latency NoSQL technologies or
an interactive Hive database.
7. Analysis and reporting
Most Big Data platforms are geared to extracting business insights from the stored data via
analysis and reporting. This requires multiple tools. Structured data is relatively easy to handle,
while more advanced and specialized techniques are required for unstructured data. Data
scientists may undertake interactive data exploration using various notebooks and tool-sets. A
data modeling layer might also be included in the architecture, which may also enable self-
service BI using popular visualization and modeling techniques.
Analytics results are sent to the reporting component, which replicates them to various output
systems for human viewers, business processes, and applications. After visualization into reports
or dashboards, the analytic results are used for data-driven business decision making.
8. Orchestration
The cadence of Big Data analysis involves multiple data processing operations followed by data
transformation, movement among sources and sinks, and loading of the prepared data into an
analytical data store. These workflows can be automated with orchestration systems from
Apache such as Oozie and Sqoop, or Azure Data Factory.
Benefits of Big Data Architecture
1. Parallel computing for high performance
To process large data sets quickly, big data architectures use parallel computing, in which
multiprocessor servers perform numerous calculations at the same time. Sizable problems are
broken up into smaller units which can be solved simultaneously.
2. Elastic scalability
Big Data architectures can be scaled horizontally, enabling the environment to be adjusted to
the size of each workload. Big Data solutions are usually run in the cloud, where you only pay
for the storage and computing resources you actually use.
3. Freedom of choice
The marketplace offers many solutions and platforms for use in Big Data architectures, such as
Azure managed services, MongoDB Atlas, and Apache technologies. You can combine solutions
to get the best fit for your various workloads, existing systems, and IT skill sets.
4. Interoperability with related systems
You can create integrated platforms across different types of workloads, leveraging Big Data
architecture components for IoT processing and BI as well as analytics workflows.
Big Data Architecture Challenges
1. Security
Big data of the static variety is usually stored in a centralized data lake. Robust security is
required to ensure your data stays protected from intrusion and theft. But secure access can be
difficult to set up, as other applications need to consume the data as well.
2. Complexity
A Big Data architecture typically contains many interlocking moving parts. These include
multiple data sources with separate data-ingestion components and numerous cross-component
configuration settings to optimize performance. Building, testing, and troubleshooting Big Data
processes are challenges that take high levels of knowledge and skill.
3. Evolving technologies
It’s important to choose the right solutions and components to meet the business objectives of
your Big Data initiatives. This can be daunting, as many Big Data technologies, practices, and
standards are relatively new and still in a process of evolution. Core Hadoop components such
as Hive and Pig have attained a level of stability, but other technologies and services remain
immature and are likely to change over time.
4. Specialized skill sets
Big Data APIs built on mainstream languages are gradually coming into use. Nevertheless, Big
Data architectures and solutions do generally employ atypical, highly specialized languages and
frameworks that impose a considerable learning curve for developers and data analysts alike.
Types of Big Data Architecture
Lambda Architecture
A single Lambda architecture handles both batch (static) data and real-time processing data. It
is employed to solve the problem of computing arbitrary functions. In this deployment model,
latency is reduced and negligible errors are preserved while retaining accuracy.
Kappa Architecture
When compared to Lambda architecture, Kappa architecture is also intended to handle both
real-time streaming and batch processing data. The Kappa architecture, in addition to reducing
the additional cost that comes from the Lambda architecture, replaces the data sourcing
medium with message queues.
The messaging engines store a sequence of data in the analytical databases, which are then read
and converted into appropriate format before being saved for the end-user.
The batch layer was eliminated in the Kappa architecture, and the speed layer was enhanced to
provide reprogramming capabilities. The key difference with the Kappa architecture is that all
the data is presented as a series or stream. Data transformation is achieved through the steam
engine, which is the central engine for data processing.
https://www.interviewbit.com/blog/big-data-architecture/
Drivers of Big Data:
Big data becomes everyone’s choice in last few years. The reason that drives to make popular
big data is following:
1. Traditional solutions failing to satisfy the modern market needs.
2. Society interact with digital platforms
3. Technology equipment become cheaper
4. Bigger data storage and high computing facility available among the people.
5. Cloud computing services are available
6. Sensor based device become popular
7. People engage in data driven innovation and decision making for competitive advantage.
Four Vs in Big Data Analytics:
Volume
The prominent feature of any dataset is its size. Volume refers to the size of data generated and
stored in a Big Data system. We’re talking about the size of data in the petabytes and exabytes
range. These massive amounts of data necessitate the use of advanced processing technology—
far more powerful than a typical laptop or desktop CPU. As an example of a massive volume
dataset, think about Instagram or Twitter. People spend a lot of time posting pictures,
commenting, liking posts, playing games, etc. With these ever-exploding data, there is a huge
potential for analysis, finding patterns, and so much more.
Variety
Variety entails the types of data that vary in format and how it is organized and ready for
processing. Big names such as Facebook, Twitter, Pinterest, Google Ads, CRM systems produce
data that can be collected, stored, and subsequently analyzed.
Velocity
The rate at which data accumulates also influences whether the data is classified as big data or
regular data. Much of this data must be evaluated in real-time; therefore, systems must be able
to handle the pace and amount of data created. The processing speed of data means that there
will be more and more data available than the previous data, but it also implies that the velocity
of data processing needs to be just as high.
Value
Value is another major issue that is worth considering. It is not only the amount of data that we
keep or process that is important. It is also data that is valuable and reliable and data that must
be saved, processed, and evaluated to get insights.
Characteristics of Big Data:
There are five v's of Big Data that explains the characteristics.
5 V's of Big Data
o Volume
o Veracity
o Variety
o Value
o Velocity
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes'
of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos,
videos, etc.
a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
b.Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON,
XML, CSV, TSV, and email. OLTP (Online Transaction Processing) systems are built to work
with semi-structured data. It is stored in relations, i.e., tables.
C.Unstructured Data: All the unstructured files, log files, audio files, and image files are
included in the unstructured data. Some organizations have much data available, but they did
not know how to derive the value of data since the data is raw.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also
essential in business development.
For example, Facebook posts with hashtags.
Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the
data is created in real-time. It contains the linking of incoming data sets speeds, rate of change,
and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs,
business processes, networks, and social media sites, sensors, mobile devices, etc.
Enablers:
Reference:
https://www.researchgate.net/publication/315466870_Live_Data_Analytics_With_Collaborative
_Edge_and_Cloud_Processing_in_Wireless_IoT_Networks#pf4
Value chain:
Building a Big Data Strategy:
Every company can and does collect data that can be a valuable business asset. That value is lost
without a strategy that outlines how to access the data to ultimately achieve your business goals.
Therefore, building a big data strategy starts with identifying use cases for your organisation.
An effective big data strategy must link to a company goal, and that’s precisely why the first
step in the Data Use Case Template is “Link to a Strategic Goal.” That ensures that your team’s
big data strategy is focused on the right priority from the start of your planning. Consider how
your company could use data to:
1. Understand your customers
2. Offer smarter products and services to your customers
3. Improve internal processes
4. Add additional revenue by monetising data
Once you have a Data Use Case template completed for your three to five top use cases, it’s time
to build out your big data strategy with the Data Strategy Template.
https://intelliarts.com/blog/building-big-data-strategy/
Distributed File System:
https://www.geeksforgeeks.org/what-is-dfsdistributed-file-system/
Scalable Computing over the Internet:
Cloud computing: https://www.techtarget.com/searchcloudcomputing/definition/cloud-
computing#:~:text=Cloud%20computing%20lets%20client%20devices,operated%20by
%20cloud%20service%20providers.
Programming Models for Big Data
Big data analytics refers to advanced and efficient data mining and machine learning techniques
applied to large amounts of data.
A programming model is a set of programming languages and runtime libraries that form a
computing model. It is the fundamental style and interfaces for developers to write applications
and computing programs.
Research work and results in the field of big data analysis are constantly emerging, and more
and more new efficient architectures, programming models, systems, and data mining
algorithms have been proposed.
Programming models are often a core feature of big data frameworks because they implicitly
influence the execution model of big data processing engines and also drive the way users
express and build big data applications and programs.
Here are some programming models and algorithms for big data:
MapReduce: A popular algorithm for processing big data on clusters. It's used for
parallelizable problems across large volumes of structured and unstructured data.
Support Vector Machine: A set of machine learning algorithms that are used in data
mining, data science, and predictive analytics. They are flexible and can generate
accurate forecasts.
Cluster algorithm: A major topic in big data analysis. The goal is to separate an
unlabeled dataset into subsets, each with a unique characteristic of its data structure.
Supervised learning algorithms: Use labeled data to create models that can classify big
data and make predictions on future outcomes.
Apache Spark framework: An open-source framework that provides a unified interface
for programming clusters. It has built-in modules that support SQL, machine learning,
stream processing, and graph computation.
Naive Bayes: A model used for large data sets with hundreds or thousands of data points
and a few variables. It's fast and easy to implement than other classification algorithms.
Streaming algorithms: Extract only a small amount of information about the dataset,
which preserves its key properties. They are typically allowed to make only one pass
over the data.
KNN algorithm: A supervised classification algorithm that uses labeled data to classify
data based on their similarities.
Reference: http://eitc.org/research-opportunities/new-media-and-new-digital-economy/data-
science-and-analytics/programming-models-for-big-data-1
Unit II: INTRODUCTION TO HADOOP AND HADOOP ARCHITECTURE
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in
nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set
which can be computed in Key value pair. The output of Map task is consumed by reduce task
and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
Big Data technology and tools:
Concept of Virtualization in Big Data – Apache Hadoop & Hadoop Eco System:
Overview of HDFS:
Comparison with traditional Databases:
Understanding MapReduce- Map and Reduce algorithms:
YARN: A Resource Manager for Hadoop, Pig, Mahout, etc.:
Cloud Service Models: An Important Big Data Enabler.