0% found this document useful (0 votes)

15 views86 pages

Cloud Computing

The document discusses the importance and classification of data, highlighting structured, semi-structured, and unstructured data types. It emphasizes the significance of big data, characterized by its volume, velocity, variety, veracity, and value, and outlines the need for specialized tools for analysis. Additionally, it covers various analytics types, including descriptive, diagnostic, predictive, and prescriptive analytics, and their applications across different domains.

Uploaded by

prabhanjan.cs22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views86 pages

Cloud Computing

Uploaded by

prabhanjan.cs22

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

UNIT 1

▪ Data continues to be a precise and irreplaceable asset for

enterprises

▪ Data is present internal and outside four walls of the

enterprise

▪ Data present in Homogeneous sources as well as in

heterogeneous sources
Approximate Distribution of Digital Data

Approximate percentage distribution of digital data

▪ Need of the hour is to understand, manage,
process and take the data for analysis to draw
valuable insights

Data -> Information

Information ->Insights
Classification of digital data

1. Unstructured data –
▪ Data which do not conform to a data model or a form
which can not be used by computer program.
▪ 80-90% data is in this form
▪ PPT, Images , Videos, body of an email etc

2. Semi structured data- does not conform to a data model

but has some structure . Not in a form which can be used by
a computer program.
▪ Email, XML, HTML etc
3 . Structured data

▪ Data in organised form

▪ Understandable by a computer program
▪ Eg- Data stored in databases
Structured data
-> When is data structured? –Data conforms to a pre
defined schema/structure

-> Think structured data and think data model

l
-> Context of RDBMS- Conforms to the relational model –
rows/columns
-> Cardinality ratio – Number of rows/columns
-> Degree- Number of columns
Steps
1) Design a relation/table
-> Fields to store -> Data type
2) Constraints- we would like our data to conform
– Unique
- Not Null
– business constraints
– permissible values a column should access
3) Figure
4)Referential Integrity constraints
Sources- When data is structured- leverage on available
RDBMS
Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems

Database store operational data generated and

collected by day to day business activities
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /

Searching

Scalability

Transaction
Processing
ACID
Atomicity- Transaction happens in its eternity or none of it all

Consistency- If same information is stored at two or more

places they are in complete agreement

Isolation- Resource allocation to transaction happen such that

the transaction gets the impression that it is the only
transaction happens in isolation

Durability- Changes made to the database during transaction

is permanent
Semi structured data

Self describing structure

Example, emails, XML, markup languages like HTML, etc

Features

Does not conform to data model

Use tags to segregate semantic elements
Tags are also used to enforce hierarchies of records and fields within
data
No separation between data and schema
Entities belong to the same class and also grouped together need not
necessarily have same attributes- not necessary same order
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value
Semi-structured data pairs)
Often Schema information
is blended with data
values
Data objects may have
different attributes not known
beforehand
XML Extensible MarkUp

Sources Language

XML- popularised by web Other MarkUp Language

services

JSON(JavaScript Object
JSON- transmit data between Semi-Structured Notation)
server and web application. Data

Mongodb store data natively

Unstructured Data

This is the data which does not conform to a data model or is not in
a form which can be used easily by a computer program.

About 80–90% data of an organization is in this format.

Example: memos, chat rooms, PowerPoint presentations,

images, videos, letters, researches, white papers, body of an
email, etc.
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages
Chats

Social

Media data

Word
Document
Issues with terminology – Unstructured Data

Structure can be implied despite not

being formerly defined.

Issues with terminology Data with some structure may still be labeled
unstructured if the structure doesn’t help
with processing task at hand

Data may have some structure or may even be

highly structured in ways that are
unanticipated or unannounced.
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data

Text Analytics

Noisy Text Analytics

Unstructured data

Data Mining
First we deal with large data sets.
Second use methods at the intersection of AI, machine learning and
statistics and database to unearth consistent patterns in large data set
and systematic relation between variables

Association rule mining- Market basket analysis- what goes with what-
bread , cheese
Regression analysis- Predict relationship between two variables –
dependant and independent variable – value to be predicted –
dependant
Collaborative filtering – Predicting user preference based on preferences
of group of users
What is Big Data?

Big data is defined as collections of datasets whose

volume, velocity or variety is so large that it is difficult to
store, manage, process and analyze the data using
traditional databases and data processing tools.

In the recent years, there has been an exponential growth

in the both structured and unstructured data generated by
information technology, industrial, healthcare, Internet of
Things, and other systems
Definition of Big Data

Big data refers to

• datasets whose size is typically beyond the storage capacity
of and also
• complex for traditional database software tools

Big data is anything beyond the human & technical

infrastructure needed to support storage, processing and
analysis.
Big data is

• high volume, high-velocity and high-variety information

assets that
• demand cost effective, innovative forms of information
processing
• for enhanced insight and decision making.
Part I of the definition:

talks about voluminous data that may have great

variety will require a good speed/pace for
storage, preparation, processing and analysis.

Part II of the definition:

talks about embracing new techniques and
technologies to capture (ingest), store, process,
persist, integrate and visualize the high-volume,
high-velocity, and high-variety data.
Part III of the definition:

talks about deriving deeper, richer and

meaningful insights and then using these
insights to make faster and better
decisions to gain business value and thus a
competitive edge.

Data —> Information —> Actionable

intelligence —> Better decisions
—>Enhanced business value
Below are some key pieces of data from the report:

• Facebook users share nearly 4.16 million pieces of content

• Twitter users send nearly 300,000 tweets
• Instagram users like nearly 1.73 million photos
• YouTube users upload 300 hours of new video content
• Apple users download nearly 51,000 apps
• Skype users make nearly 110,000 new calls
• Amazon receives 4300 new visitors
• Uber passengers take 694 rides
• Netflix subscribers stream nearly 77,000 hours of video
Big Data analytics deals with collection, storage, processing and analysis of this
massive scale data.

Specialized tools and frameworks are required for big data analysis when:

(1) the volume of data involved is so large that it is difficult to store, process
and analyze data on a single machine

(2) the velocity of data is very high and the data needs to be analyzed in
real-time,

(3) there is variety of data involved, which can be structured, unstructured or

semi-structured, and is collected from multiple data sources,

(4) various types of analytics need to be performed to extract value from the
data such as descriptive, diagnostic, predictive and prescriptive analytics
Big data analytics is enabled by several technologies such as cloud computing,
distributed and parallel processing frameworks, non-relational databases,
in-memory computing, for instance.

Some examples of big data are listed as follows:

• Data generated by social networks including text, images, audio and video
data
• Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
• Machine sensor data collected from sensors embedded in industrial and
energy systems
for monitoring their health and detecting failures
• Healthcare data collected in electronic health record (EHR) systems
• Logs generated by web applications
• Stock markets data
Characteristics of Big Data
Volume

Big data is a form of data whose volume is so large that it would not fit on a
single machine therefore specialized tools and frameworks are required to
store process and analyze such data.

The volumes of data generated by modern IT, industrial, healthcare, Internet

of Things, and other systems is growing exponentially driven by the lowering
costs of data storage.

Though there is no fixed threshold for the volume of data to be considered

as big data, however, typically, the term big data is used for massive scale
data that is difficult to store, manage and process using traditional
databases and data processing architectures.
What is Big Data?

Volume

Bits-Bytes-Kilobytes-Megabytes-Gigabytes-Terabytes-Petabytes-Exabytes-Zettabytes-Yottabytes
A Mountain of Data

1 Kilobyte (KB) = 1000 bytes

1 Megabyte (MB) = 1,000,000 bytes
1 Gigabyte (GB) = 1,000,000,000 bytes
1 Terabyte (TB) = 1,000,000,000,000 bytes
1 Petabyte (PB) = 1,000,000,000,000,000 bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 bytes
Where does this data get generated

-Multitude of sources
- XLS, DOC, Youtube videos, Chat conversations, customer feedback CCTV
coverage

1. Typical Internal Data Sources- Data present within an organization firewall

- Data Storage- File systems, SQL, NoSQL..
- Archives- Achieves of scanned documents , paper archives, customer
correspondence records, patient’s health records, Student admission records and so on.

2. External Data Sources- Data residing an organization firewall

- Public Web: Wikepedia, weather, regulatory, census
3. Both internal and external data sources
Velocity
Velocity of data refers to how fast the data is generated.

Data generated by certain sources can arrive at very high velocities, for example,
social media data or sensor data.

Velocity is another important characteristic of big data and the primary reason
for the exponential growth of data.

High velocity of data results in the volume of data accumulated to become very
large, in short span of time.

Some applications can have strict deadlines for data analysis (such as trading or
online fraud detection) and the data needs to be analyzed in real-time.
Velocit
y

Batch Periodic Near real time Real-time processing

Variety

Variety refers to the forms of the data.

Big data comes in different forms such as structured, unstructured or

semi-structured, including text data, image, audio, video and sensor data.

Big data systems need to be flexible enough to handle such variety of data
Veracity

Veracity refers to how accurate is the data.

To extract value from the data, the data needs to be cleaned to remove
noise.

Data-driven applications can reap the benefits of big data only when the data
is meaningful and accurate.

Therefore, cleansing of data is important so that incorrect and faulty data can
be filtered out.
Value

Value of data refers to the usefulness of data for the intended purpose.

The end goal of any big data analytics system is to extract value from
the data.

The value of the data is also related to the veracity or accuracy of the
data.

For some applications value also depends on how fast we are able to
process the data.
Analytics is a broad term that encompasses the processes,
technologies, frameworks and algorithms to extract meaningful
insights from data.

Analytics is this process of extracting and creating information from

raw data by filtering, processing, categorizing, condensing and
contextualizing the data.

This information obtained is then organized and structured to infer

knowledge about the system and/or its users, its environment, and
its operations and progress towards its objectives, thus making the
systems smarter and more efficient.

The choice of the technologies, algorithms, and

frameworks for analytics is driven by the analytics
The goals of the analytics task may be:

(1) to predict something (for example whether a transaction is a

fraud or not, whether it will rain on a particular day, or whether a
tumor is benign or malignant),

(2) to find patterns in the data (for example, finding the top 10
coldest days in the year, finding which pages are visited the most
on a particular website, or finding the most searched celebrity in a
particular year),

(3) finding relationships in the data (for example, finding similar

news articles, finding similar patients in an electronic health record
system
The National Research Council [1] has done a characterization of
computational tasks for massive data analysis (called the seven
“giants").

These computational tasks include:

(1) Basis Statistics, (2) Generalized N-Body Problems, (3) Linear
Algebraic Computations, (4) Graph-Theoretic Computations, (5)
Optimization,
(6) Integration and (7) Alignment Problems.

This characterization of computational tasks aims to provide a taxonomy

of tasks that have proved to be useful in data analysis and grouping
them roughly according to mathematical structure and computational
strategy.
Descriptive Analytics

Descriptive analytics comprises analyzing past data to present it in a

summarized form which can be easily interpreted. Descriptive analytics
aims to answer –
What has happened?

A major portion of analytics done today is descriptive analytics through

use of statistics functions such as counts, maximum, minimum, mean,
topN,percentage, for instance.

These statistics help in describing patterns in the data and present the
data in a summarized form.

For example, computing the total number of likes for a particular post,
computing the average monthly rainfall or finding the average number
Diagnostic Analytics

Diagnostic analytics comprises analysis of past data to diagnose the

reasons as to why certain events happened.

Diagnostic analytics aims to answer - Why did it happen?

Let us consider an example of a system that collects and analyzes sensor

data from machines for monitoring their health and predicting failures.

While descriptive analytics can be useful for summarizing the data by

computing various statistics (such as mean, minimum, maximum,
variance, or top-N), diagnostic analytics can provide more insights into
why certain a fault has occurred based on the patterns in the sensor
data for previous faults.
Predictive Analytics

Predictive analytics comprises predicting the occurrence of an event or

the likely outcome of an event or forecasting the future values using
prediction models.

Predictive analytics aims to answer - What is likely to happen?

For example, predictive analytics can be used for predicting when a fault
will occur in a machine, predicting whether a tumor is benign or
malignant, predicting the occurrence of natural emergency (events
such as forest fires or river floods) or forecasting the pollution levels.

Predictive Analytics is done using predictive models which are trained by

existing data. These models learn patterns and trends from the existing
data and predict the occurrence of an event
Prescriptive Analytics
While predictive analytics uses prediction models to predict the
likely outcome of an event, prescriptive analytics uses multiple
prediction models to predict various outcomes and the best
course of action for each outcome.

Prescriptive analytics aims to answer - What can we do to make

it happen?

Prescriptive Analytics can predict the possible outcomes based

on the current choice of actions.
Prescriptive analytics can be used to prescribe the best
medicine for treatment of a patient based on the outcomes of
various medicines for similar patients.
Another example of prescriptive analytics would be to suggest the
best mobile data plan for a customer based on the customer’s
browsing patterns
Domain Specific Examples of Big Data
The applications of big data span a wide range of domains including (but not
limited to) homes, cities, environment, energy systems, retail, logistics, industry,
agriculture, Internet of Things, and healthcare.

Various applications of big data for each of these domains are:

Web
• Web Analytics: Web analytics deals with collection and analysis of data on
the user visits on websites and cloud applications.

Analysis of this data can give insights about the user engagement and tracking
the performance of online advertisement campaigns.

For collecting data on user visits, two approaches are used.

In the first approach, user visits are logged on the web server which collects data
such as the date and time of visit, resource requested, user’s IP address, HTTP
status code, for instance.

The second approach, called page tagging, uses a JavaScript which is embedded
in the web page.

The benefit of the page tagging approach is that it facilitates real-time data
collection and analysis.
This approach allows third party services, which do not have access to the web
server (serving the website) to collect and process the data.

These specialized analytics service providers (such as Google Analytics) are offer
advanced analytics and summarized reports. user sessions, page visits, top
entry and exit pages, bounce rate, most visited page
Performance Monitoring: Multi-tier web and cloud applications such as such as
• e-Commerce,
• Business-to-Business,
• Health care, Banking and Financial,
• Retail and Social Networking applications

can experience rapid changes in their workloads.

Provisioning and capacity planning is a challenging task for complex multi-tier

applications since each class of applications has different deployment
configurations with web servers, application servers and database servers.
For performance monitoring, various types of tests can be performed such as

• load tests (which evaluate the performance of the system with multiple users
and workload levels )
• Stress test etc

Big data systems can be used to analyze the data generated by such
tests, to predict application performance under heavy workloads
and identify bottlenecks in the system so that failures can be
prevented.
• Ad Targeting & Analytics:

Search and display advertisements are the two most widely used approaches
for Internet advertising.

In search advertising, users are displayed advertisements ("ads"), along with

the search results, as they search for specific keywords on a search engine.

Advertisers can create ads using the advertising networks provided by the
search engines or social media networks.

These ads are setup for specific keywords which are related to the product or
service being advertised.

Users searching for these keywords are shown ads along with the search
results.
Display advertising, is another form of Internet advertising, in which the ads are
displayed within websites, videos and mobile applications who participate in
the advertising network

The ad-network matches these ads against the content on the website, video or
mobile application and places the ads.

The most commonly used compensation method for Internet ads is

Pay-per-click (PPC), in which the advertisers pay each time a user clicks on an
advertisement.

Advertising networks use big data systems for matching and placing
advertisements and generating advertisement statistics reports.
• Advertises can use big data tools for tracking the performance of
advertisements, optimizing the bids for pay-per-click advertising, tracking
which keywords link the most to the advertising landing page

• Content Recommendation: Content delivery applications that serve content

(such as music and video streaming applications), collect various types of
data such as user search patterns and browsing history, history of content
consumed, and user ratings.

Such applications can leverage big data systems for recommending new
content to the users based on the user preferences and interests.
Financial
• Credit Risk Modeling: Banking and Financial institutions use credit risk
modeling to score credit applications and predict if a borrower will default or
not in the future.

Credit risk models are created from the customer data that includes, credit
scores obtained from credit bureaus, credit history, account balance data,
account transactions data and spending patterns of the customer.

Big data systems can help in computing credit risk scores of a large number of
customers on a regular basis.

These frameworks can be used to build credit risk models by analysis of

customer data
•Fraud Detection: Banking and Financial institutions can leverage big data
systems for detecting frauds such as credit card frauds, money laundering
and insurance claim frauds.

Real-time big data analytics frameworks can help in analyzing data from
disparate sources and label transactions in real-time
• Healthcare

The healthcare ecosystem consists of numerous entities including healthcare

providers (primary care physicians, specialists, or hospitals), payers
(government, private health insurance companies, employers), pharmaceutical,
device and medical service companies, IT solutions and services firms, and
patients.

The process of provisioning healthcare involves massive healthcare data that

exists in different forms (structured or unstructured), is stored in disparate
data sources (such as relational databases, or file servers) and in many different
formats.

To promote more coordination of care across the multiple providers involved

with patients, their clinical information is increasingly aggregated from diverse
sources into Electronic Health Record (EHR) systems.
EHRs capture and store information on patient health and provider actions
including individual-level laboratory results, diagnostic, treatment, and
demographic data.

Though the primary use of EHRs is to maintain all medical data for an individual
patient and to provide efficient access to the stored data at the point of care,
EHRs can be the source for valuable aggregated information about overall
patient populations.

Big data systems can be used for data collection from different stakeholders
(patients, doctors, payers, physicians, specialists, etc) and disparate data sources.

Big data analytics systems allow massive scale clinical data analytics and
facilitate development of more efficient healthcare applications, improve the
accuracy of predictions and help in timely decision making
• Internet of Things
Internet of Things (IoT) refers to things that have unique identities and are
connected to the Internet.

The "Things" in IoT are the devices which can perform remote sensing, actuating
and monitoring.

IoT devices can exchange data with other connected devices and applications
(directly or indirectly), or collect data from other devices and process the data.

IoT systems can leverage big data technologies for storage and analysis of data.
IoT applications that can benefit from big data system
• Intrusion Detection
• Smart Parking
• Smart Roads
• Structural Health Monitoring
• Smart Irrigation
Intrusion Detection: Intrusion detection systems use security cameras and
sensors (such as PIR sensors and door sensors) to detect intrusions and raise
alerts.

Smart Parkings: Smart parkings make the search for parking space easier and
convenient for drivers. Smart parkings are powered by IoT systems that detect
the number of empty parking slots and send the information over the
Internet to smart parking application back-ends

Smart Roads: Smart roads equipped with sensors can provide information on
driving conditions, travel time estimates and alerts in case of poor driving
conditions, traffic congestions and accidents

.
Structural Health Monitoring: Structural Health Monitoring systems use a
network of sensors to monitor the vibration levels in the structures such as
bridges and buildings. The data collected from these sensors is analyzed to
assess the health of the structures

Smart Irrigation: Smart irrigation systems can improve crop yields while
saving water. Smart irrigation systems - use IoT devices with soil moisture
sensors - determine the amount of moisture in the soil and release the flow
of water -when the moisture levels go below a predefined threshold.
Environment
Environment monitoring systems generate high velocity and high volume data.

Accurate and timely analysis of such data can help in understanding the current
status of the environment and also predicting environmental trends.

Weather Monitoring : Weather monitoring systems can collect data from a

number of sensor attached. This data can then be analyzed and visualized for
monitoring weather and generating weather alerts

Air Pollution Monitoring: Air pollution monitoring systems can monitor emission
of harmful gases (CO2, CO, NO, or NO2) by factories and automobiles using
gaseous and meterological sensor
Noise Pollution Monitoring: Due to growing urban development, noise levels in
cities have increased and even become alarmingly high in some cities
Urban noise maps can help the policy makers in urban planning and making
policies to control noise levels near residential areas, schools and parks

Forest Fire Detection: There can be different causes of forest fires including
lightening, human negligence, volcanic eruptions and sparks from rock falls
Forest fire detection systems use a number of monitoring nodes deployed at
different locations in a forest.
River Floods Detection: River flood monitoring system use a number of sensor
nodes that monitor the water level (using ultrasonic sensors) and flow rate
(using the flow velocity sensors).

Big data systems can be used to collect and analyze data from a number of
such sensor nodes and raise alerts when a rapid increase in water level and
flow rate is detected.
Logistics & Transportation
• Real-time Fleet Tracking: Vehicle fleet tracking systems use GPS
technology to track the locations of the vehicles in real-time.
Big data systems can be used to aggregate and analyze vehicle locations
and routes data for detecting bottlenecks in the supply chain such as
traffic congestions on routes, assignment and generation of alternative
routes

Shipment Monitoring:monitoring the conditions inside containers- food

spoilage

Remote Vehicle Diagnostics: Remote vehicle diagnostic systems can detect

faults in the vehicles or warn of impending faults.
Route Generation & Scheduling: Modern transportation systems are driven by
data collected from multiple sources which is processed to provide new
services to the stakeholders. such as advanced route guidance, dynamic vehicle
routing, anticipating customer demands for pickup and delivery problem

Hyper-local Delivery: Hyper-local delivery platforms are being increasingly used

by businesses such as restaurants and grocery stores to expand their reach.
These platforms allow customers to order products (such as grocery and food
items) using web and mobile applications and the products are sourced from
local stores

Cab/Taxi Aggregators: On-demand transport technology aggregators (or

cab/taxi aggregators) allow customers to book cabs using web or mobile
applications and the requests are routed to nearest available cabs
Industry
• Machine Diagnosis & Prognosis: Machine prognosis refers to predicting the
performance of a machine by analyzing the data on the current operating
conditions and the deviations from the normal operating conditions.
Machine diagnosis refers to determining the cause of a machine fault.
Industrial machines have a large number of components that must function
correctly for the machine to perform its operations. Sensors in machines can
monitor the operating conditions

Risk Analysis of Industrial Operations: In many industries, there are strict

requirements on the environment conditions and equipment working conditions
Harmful and toxic gases such as carbon monoxide (CO), nitrogen monoxide (NO),
Nitrogen Dioxide (NO2), for instance, can cause serious health problems. Gas
monitoring systems can help in monitoring the indoor air quality using various gas
Production Planning and Control: Production planning and control systems
measure various parameters of production processes and control the entire
production process in real-time
Retail
Retailers can use big data systems for boosting sales, increasing profitability
and improving customer satisfaction.

Inventory Management: Inventory management -increasingly important in the

recent years with the growing competition. While over-stocking of products can
result in additional storage expenses and risk -under-stocking can lead to loss
of revenue.

Tags attached to the products allow them to be tracked in real-time so that

the inventory levels can be determined accurately and products which are low
Customer Recommendations: Big data systems can be used to analyze the
customer data (such as demographic data, shopping history, or customer
feedback) and predict the customer preferences

Store Layout Optimization: Big data systems can help in analyzing the data on
customer shopping patterns and customer feedback to optimize the store
layouts

Forecasting Demand: Due to a large number of products, seasonal variations

in demands and changing trends and customer preferences, retailers find it
difficult to forecast demand and sales volumes. Big data systems can be used
to analyze the customer purchase patterns and predict demand and sale
volumes
Case Study: Weather Data Analysis

Using the big data stack for analysis of weather data

To come up with a selection of the tools and frameworks from the

big data stack that can be used for weather data analysis, let us first
come up with the analytics flow for the application
Data Collection

Let us assume, we have multiple weather monitoring stations or

end-nodes equipped with temperature, humidity, wind, and
pressure sensors.

To collect and ingest streaming sensor data generated by the

weather monitoring stations, we can use a publish-subscribe
messaging framework to ingest data for real-time analysis within the
Big Data stack and

Source-Sink connector to ingest data into a distributed filesystem

for batch analysis.
Data Preparation

Since the weather data received from different monitoring stations

can have missing values, use different units and have different
formats, we may need to prepare data for analysis by cleaning,
wrangling, normalizing and filtering the data
Analysis Types

The choice of the analysis types is driven by the requirements of the

application.
Let us say, we want our weather analysis application

• to aggregate data on various timescales (minute, hourly, daily or

monthly)

• to determine the mean, maximum and minimum readings for

temperature, humidity, wind and pressure.
to support interactive querying for exploring the data, for example,
queries such as: finding the day with the lowest temperature in each
month of a year, finding the top-10 most wet days in the year, for
instance.

These type of analysis come under the basic statistics category.

Next, we also want the application to make predictions of certain

weather events, for example, predict the occurrence of fog or haze. For
such an analysis, we would require a classification model.

Additionally, if we want to predict values (such as the amount of

rainfall), we would require a regression model
Analysis Modes

Based on the analysis types determined the previous step, we

know that the analysis modes required for the application will be
batch, real-time and interactive.

Visualizations

The front end application for visualizing the analysis results would
be dynamic and interactive.
Mapping Analysis Flow to Big Data Stack
Now that we have the analytics flow for the application, let us map the selections at each step of the flow to the big data
stack
Figure shows a subset of the components of the big data stack based
on the analytics flow.

To collect and ingest streaming sensor data generated by the

weather monitoring stations, we can use a publish-subscribe
messaging framework such as Apache Kafka (for real-time analysis
within the Big Data stack).

Each weather station publishes the sensor data to Kafka

Real-time analysis frameworks such as Storm and Spark Streaming

can receive data from Kafka for processing
For batch analysis, we can use a source-sink connector such as Flume
to move the data to HDFS.

Once the data is in HDFS, we can use batch processing frameworks

such as Hadoop-MapReduce

While the batch and real-time processing frameworks are useful

when the analysis requirements and goals are known upfront,
interactive querying tools can be useful for exploring the data.
We can use interactive querying framework such as Spark SQL,
which can query the data in HDFS for interactive queries.

For presenting the results of batch and real-time analysis, a

NoSQL database such as DynamoDB can be used as a serving
database.

For developing web applications and displaying the analysis

results we can use a web framework such as Django.

Big Data Intro
No ratings yet
Big Data Intro
12 pages
Unit 1
No ratings yet
Unit 1
107 pages
Introduction to Big Data Concepts
100% (2)
Introduction to Big Data Concepts
33 pages
Unit 1
No ratings yet
Unit 1
26 pages
Unit 1 BDT
No ratings yet
Unit 1 BDT
27 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Unit 1
No ratings yet
Unit 1
44 pages
Big Data
No ratings yet
Big Data
110 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Module 1 Intro To Big Data - Hadoop
No ratings yet
Module 1 Intro To Big Data - Hadoop
55 pages
Module 1
No ratings yet
Module 1
60 pages
Big Data Basics for CSE Students
No ratings yet
Big Data Basics for CSE Students
40 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Bda M1
No ratings yet
Bda M1
111 pages
Unit 1 Bigdata
No ratings yet
Unit 1 Bigdata
30 pages
Big Data
No ratings yet
Big Data
24 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Big Data Class 27feb
No ratings yet
Big Data Class 27feb
48 pages
Da Unit - I - Notes
No ratings yet
Da Unit - I - Notes
30 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
No ratings yet
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
62 pages
Big Data Basics for Beginners
No ratings yet
Big Data Basics for Beginners
40 pages
Introduction To Big Data Analytics - Thendral1
No ratings yet
Introduction To Big Data Analytics - Thendral1
26 pages
Introduction To Big Data (Module 1)
No ratings yet
Introduction To Big Data (Module 1)
25 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
136 pages
BDU1
No ratings yet
BDU1
39 pages
Big Data Basics for IT Professionals
No ratings yet
Big Data Basics for IT Professionals
108 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
BDA ppt1
No ratings yet
BDA ppt1
45 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
BigData 1
No ratings yet
BigData 1
14 pages
Unit 1.1 - Introduction To Big Data Analytics
No ratings yet
Unit 1.1 - Introduction To Big Data Analytics
19 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
Chapter 1 Introduction To Big Data
No ratings yet
Chapter 1 Introduction To Big Data
19 pages
Unit I-KCS-061
No ratings yet
Unit I-KCS-061
42 pages
BDT 1
No ratings yet
BDT 1
49 pages
Big Data
No ratings yet
Big Data
34 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Big Data
No ratings yet
Big Data
84 pages
Unit 1
No ratings yet
Unit 1
57 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Big Data
No ratings yet
Big Data
7 pages
Big Data
No ratings yet
Big Data
16 pages
BDT L1u1
No ratings yet
BDT L1u1
18 pages
Presentation 1
No ratings yet
Presentation 1
27 pages
Unit 1
No ratings yet
Unit 1
59 pages
Big Data Analytics Mod1
No ratings yet
Big Data Analytics Mod1
36 pages
big Data
No ratings yet
big Data
21 pages
Big Data Analytics Module-1
No ratings yet
Big Data Analytics Module-1
26 pages
Chapter 4 Data Analytics
No ratings yet
Chapter 4 Data Analytics
19 pages
Unit-I - Big Data
No ratings yet
Unit-I - Big Data
29 pages
Big Data Lecture # 1
No ratings yet
Big Data Lecture # 1
15 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Unit 1-2
No ratings yet
Unit 1-2
78 pages
Big Data and Blockchain Basics: Dr. Poonam Saini Poonamsaini@pec - Edu.in
No ratings yet
Big Data and Blockchain Basics: Dr. Poonam Saini Poonamsaini@pec - Edu.in
42 pages
Bangla
No ratings yet
Bangla
12 pages
Comparative Study Between CNN and LSTM Approaches For Sign Language Recognition
No ratings yet
Comparative Study Between CNN and LSTM Approaches For Sign Language Recognition
4 pages
CC Unit 1 - 2 Notes
No ratings yet
CC Unit 1 - 2 Notes
32 pages
Unit2 Searching 21 10 2024 10am
No ratings yet
Unit2 Searching 21 10 2024 10am
154 pages
Data Visualization Aesthetics Guide
No ratings yet
Data Visualization Aesthetics Guide
129 pages
VND Openxmlformats-Officedocument Presentationml Presentation&rendition 1
No ratings yet
VND Openxmlformats-Officedocument Presentationml Presentation&rendition 1
111 pages
Unit1 AI Agents ProblemSolving 2 10 2024 10 - 30pm
No ratings yet
Unit1 AI Agents ProblemSolving 2 10 2024 10 - 30pm
190 pages
Unit3 Knowledge 9 11 2024-9pm
No ratings yet
Unit3 Knowledge 9 11 2024-9pm
138 pages
Worksheet - Chapter 11 - Biotechnology - Principles and Processes
No ratings yet
Worksheet - Chapter 11 - Biotechnology - Principles and Processes
3 pages
Tutorial 07-MA 1063
No ratings yet
Tutorial 07-MA 1063
2 pages
Executive Leadership Profile
No ratings yet
Executive Leadership Profile
2 pages
Wednesday 13 October 2021: Mathematics
No ratings yet
Wednesday 13 October 2021: Mathematics
13 pages
DICA Lab Manual PDF
No ratings yet
DICA Lab Manual PDF
64 pages
Coffee Cost Breakdown & Gen Z Work Preferences
No ratings yet
Coffee Cost Breakdown & Gen Z Work Preferences
2 pages
Carnot and Rankine Cycle
No ratings yet
Carnot and Rankine Cycle
22 pages
Learning Objectives: Introduction W
No ratings yet
Learning Objectives: Introduction W
238 pages
Grade 9 - English All Unit 3 and Moments #3
No ratings yet
Grade 9 - English All Unit 3 and Moments #3
5 pages
Physical Science Quiz for Students
No ratings yet
Physical Science Quiz for Students
5 pages
FUN Transmissions: by Bill Brayton
No ratings yet
FUN Transmissions: by Bill Brayton
4 pages
A Project Report ON: Online Payroll Management System
No ratings yet
A Project Report ON: Online Payroll Management System
52 pages
Virtual Palletization Plan FNDE
No ratings yet
Virtual Palletization Plan FNDE
299 pages
Material Safety Data Sheet Avafulflow
No ratings yet
Material Safety Data Sheet Avafulflow
4 pages
Park psm74b - 1
No ratings yet
Park psm74b - 1
9 pages
Technical and Grammar Quiz
No ratings yet
Technical and Grammar Quiz
3 pages
Iseg Datasheet BPS en
No ratings yet
Iseg Datasheet BPS en
12 pages
Maths Grade-8 Model 2015
No ratings yet
Maths Grade-8 Model 2015
7 pages
600782556eb44428af903a75 1643114610 GE 7 Course Syllabus 2021 2022 - Rev.2
No ratings yet
600782556eb44428af903a75 1643114610 GE 7 Course Syllabus 2021 2022 - Rev.2
8 pages
Harley-Davidson Procurement Software Selection
No ratings yet
Harley-Davidson Procurement Software Selection
3 pages
Example J.6 Base Plate Bearing On Concrete: Merican Nstitute of Teel Onstruction
100% (1)
Example J.6 Base Plate Bearing On Concrete: Merican Nstitute of Teel Onstruction
4 pages
Carbon-14: 2 Radiocarbon Dating
No ratings yet
Carbon-14: 2 Radiocarbon Dating
6 pages
Economics Module Handbook
No ratings yet
Economics Module Handbook
36 pages
PNL Account Cashflow Forecast: Missing Values
No ratings yet
PNL Account Cashflow Forecast: Missing Values
5 pages
MSC Solid State Physics Lecture#3
No ratings yet
MSC Solid State Physics Lecture#3
17 pages
Android File Management Guide
No ratings yet
Android File Management Guide
19 pages
TL-30 Datasheet - UDNC
No ratings yet
TL-30 Datasheet - UDNC
2 pages
Formulation, Development and in Vitro Characterization of Modified Release Tablets of Capecitabine
No ratings yet
Formulation, Development and in Vitro Characterization of Modified Release Tablets of Capecitabine
42 pages
ENOVIASynchronicityDesignSyncDataManager ProjectSyncUser V6R2011x
No ratings yet
ENOVIASynchronicityDesignSyncDataManager ProjectSyncUser V6R2011x
295 pages
Invariant Variation Problems: 1. Preliminary Remarks and Formulation of Theorems
No ratings yet
Invariant Variation Problems: 1. Preliminary Remarks and Formulation of Theorems
14 pages