Chapter 1. Understanding Big Data
Chapter 1. Understanding Big Data
UNDERSTANDING
BIG DATA
1
OUTLINE
Scientific discoveries
CONCEPTS AND TERMINOLOGY
5
DATASETS
7
DATA ANALYSIS
9
DATA ANALYTICS
11
DATA ANALYTICS
13
DATA ANALYTICS
Figure 1.4: Value and complexity increase from descriptive to prescriptive
analytics.
14
DESCRIPTIVE ANALYTICS
To answer questions about events that have already occurred. It contextualizes
data to generate information.
Sample questions:
What was the sales volume over the past 12 months?
What is the number of support calls received as categorized by severity and geographic location?
What is the monthly commission earned by each sales agent?
The result reports are generally static, display historical data in the form of data grids or charts.
Queries are executed on operational data stores from within an enterprise, for
example a Customer Relationship Management system (CRM) or Enterprise
Resource Planning (ERP) system via ad-hoc reporting or dashboards. (see Figure 15
1.5)
DESCRIPTIVE ANALYTICS
Figure 1.5: The operational systems, pictured left, are queried via descriptive
analytics tools to generate reports or dashboards, pictured right.
16
DIAGNOSTIC ANALYTICS
To determine the cause of a phenomenon that occurred in the past using
questions that focus on the reason behind the event.
Sample questions:
Why were Q2 sales less than Q1 sales?
Why have there been more support calls originating from the Eastern region than
from the Western region?
Why was there an increase in patient re-admission rates over the past three
months?
The executed queries are performed on multidimensional data held in
analytic processing systems performing drill-down and roll-up analysis. 17
18
PREDICTIVE ANALYTICS
20
PREDICTIVE ANALYTICS
Figure 1.7: Predictive analytics tools can provide user-friendly front-end
interfaces.
21
PRESCRIPTIVE ANALYTICS
Prescriptive analytics build upon the results of predictive analytics by
prescribing actions that should be taken and explaining the reason
“why” (because they embed elements of situational understanding).
Sample questions:
Among three drugs, which one provides the best results?
When is the best time to trade a particular stock?
Various outcomes are calculated, and the best course of action for
each outcome is suggested. This approach shifts from explanatory
to advisory and can include the simulation of various scenarios. 22
PRESCRIPTIVE ANALYTICS
23
BUSINESS INTELLIGENCE (BI)
25
KEY PERFORMANCE INDICATORS (KPI)
A KPI is a metric
to gauge success within a particular business context.
to identify business performance problems and demonstrate regulatory
compliance.
is the quantifiable reference points for measuring a specific aspect of a
business’ overall performance.
KPIs are linked with an enterprise’s overall strategic goals and
objectives.
KPIs are often displayed via a KPI dashboard, and compare the 26
27
BIG DATA CHARACTERISTICS
28
FIVE BIG DATA TRAITS
29
VOLUME
The anticipated volume of data is high, substantial and ever-growing.
Figure 1.12 provides a visual representation of the large volume of data being
created daily by organizations and users world-wide.
30
VOLUME
31
VELOCITY
32
VELOCITY
Figure 1.13: Examples of
high-velocity Big Data can
easily be generated in a
given minute
350,000 tweets,
300 hours of video footage
uploaded to YouTube,
171 million emails,
330 GBs of sensor data 33
34
VERACITY
Veracity refers to the quality or fidelity of data.
Noise is data that cannot be converted into information and thus has no value.
Signals are date that have value and lead to meaningful information.
37
VALUE
(b) the timeliness
value and time are inversely
related. The longer it takes to
turn data into meaningful
information, the less value it
has for a business.
because analytics results
have a shelf-life; for
example, a 20 minute
delayed stock quote has little
to no value for making a
trade compared to a quote 38
that is 20 milliseconds old
VALUE
39
DIFFERENT TYPES OF DATA
40
THE DATA PROCESSED BY BIG DATA SOLUTIONS CAN BE
42
PRIMARY TYPES OF DATA
44
UNSTRUCTURED DATA
Data that does not conform to a data model or data schema, and
accounts for 80% of the data within any given enterprise.
Unstructured data has a faster growth rate than structured data.
Form of unstructured data: textual or binary, and often conveyed via
files that are self-contained and non-relational.
A text file may contain the contents of various tweets or blog postings.
Binary files are often media files that contain image, audio or video data.
45
UNSTRUCTURED DATA
47
SEMI-STRUCTURED DATA
50
CASE STUDY BACKGROUND
51
CASE STUDY BACKGROUND
Company introduction
Company history and Company structure
IT environment – Technical Infrastructure and Automation Environment
Business Goals and Obstacles to adopt a data-driven IT solution
Big Data adoption - Case Study Example
52
COMPANY INTRODUCTION
53
COMPANY HISTORY
50 years ago: started as an exclusive health insurance provider.
Later, ETI has extended its services to property and casualty insurance plans in
the building, marine and aviation sectors.
Each of four sectors has a core team of specialized and experienced agents,
actuaries, underwriters and claim adjusters.
ETI’s key
departments
Customer Human
Underwriting Claims Settlement Legal Marketing Accounts IT
care resource
department department department department department department department
department department 54
Agents
• generating the company’s revenue by selling policies
Actuaries
• managing risk assessment
• designing new insurance plans and revising existing plans
• performing what-if analyses and making use of dashboards and scorecards for scenario
evaluation
Underwriters
• evaluating new insurance applications and deciding on the premium amount
Claim adjusters
• dealing with investigating claims made against a policy
• arriving at a settlement amount for the policyholder
55
COMPANY HISTORY
Communication channels between Customer care department and prospective
and existing customers:
telephone
email
social media
Core competence:
providing competitive policies and premium customer service that does not end once
a policy has been sold.
helping to achieve increased levels of customer acquisition and retention.
relying heavily on its actuaries to create insurance plans that reflect the needs of its 56
customers.
policy
quotation
customer
IT ENVIRONMENT – relationship policy
TECHNICAL management
(CRM)
administration
INFRASTRUCTURE AND
AUTOMATION A set of client-
ENVIRONMENT enterprise server,
claims
resource mainframe management
planning (ERP) platforms and
systems
risk
billing
assessment
document
management
57
IT ENVIRONMENT – FUNCTIONS OF EACH SYSTEM
Policy quotation system
To create new insurance plans
To provide quotes to prospective customers
Is integrated with the website and customer care portal to provide website visitors and customer care
agents the ability to obtain insurance quotes
Policy administration system
To handle policy lifecycle management, including issuance, update, renewal and cancellation of policies
regulatory compliance.
BUSINESS GOALS AND OBSTACLES
Over the past few decades, ETI is suffering the falling share price and decrease in market share.
A committee comprised of senior managers was formed to investigate and make recommendations.
The insurance plans are generally based on the actuaries’ Customers whose
experience and analysis of the population as a whole circumstances deviate from the 61
average set are not interested
- -> only apply to an average set of customers in such insurance plans.
Main reason Consequence
The emergence of tech-savvy competitors that employ Loss in the number of customer
the use of telematics to provide personalized policies + declines in revenue
62
STRATEGIC GOALS TO IMPROVE PROFITABILITY
1. Decrease losses by:
(a) improving risk evaluation and maximizing risk mitigation, which applies to both creation of
insurance plans and when new applications are screened at the time of issuing a policy,
(b) implementing a proactive catastrophe management system that decreases the number of potential
claims resulting from a calamity, and
(c) detecting fraudulent claims.
3. Achieve and maintain full regulatory compliance at all times by employing enhanced risk 63
64
= = > a recommendation that ETI should adopt Big Data
BIG DATA ADOPTION - CASE STUDY EXAMPLE
1. IT team and skills for Big Data implementation
Problems
No in-house Big Data skills
Have to choose between hiring a Big Data consultant or sending its IT team on a Big Data training course.
Solutions
Sending only the senior IT team members to the Big Data training course.
For long-term plan, this trained team members will become a permanent in-house Big Data resource and can also
train junior team members to further increase the in-house Big Data skillset.
2. During Big Data training course
Problems
No common vocabulary of terms
Lack of business exposure and understanding BI and the establishment of appropriate KPIs
Solutions
Building a terms glossary for datasets including claims, policies, quotes, customer profile data and census data. 65
Explaining BI by using the monthly report generation process for evaluating the previous month’s performance as
an example
BIG DATA ADOPTION - CASE STUDY EXAMPLE
3. Data Analytics
Deciding to use of both descriptive and diagnostic analytics
Descriptive analytics is for:
querying the policy administration system to determine the number of polices sold each day
querying the claims management system to find out how many claims are submitted daily
querying the billing system to find out how many customers are behind on their premium payments.
Diagnostic analytics is for
various BI activities, such as performing queries to answer questions such as why last month’s sales target was not met.
performing drill-down operations to breakdown sales by type and location so that it can be determined which locations
underperformed for specific types of policies.
In the future, utilizing predictive and prescriptive analytics in a gradual manner by first implementing predictive analytics
and then slowly building up their capabilities to implement prescriptive analytics.
predictive analytics will enable detection of fraudulent claims by predicting which claim is a fraudulent one and in case
of customer defection by predicting which customers are likely to defect. 66
later, via prescriptive analytics, prescribing the correct premium amount considering all risk factors or prescribing the
best course of action to take for mitigating claims when faced with catastrophes, such as floods or storms.
BIG DATA ADOPTION - CASE STUDY EXAMPLE
4. Identifying Data Characteristics
Volume
A large amount of transactional data is generated as a result of processing claims, selling
new policies and changes to existing policies.
A large volumes of unstructured data, both inside and outside the company, including
health records, documents submitted by the customers at the time of submitting an
insurance application, property schedules, fleet data, social media data and weather data.
Velocity
For in-flow data, some is low velocity (such as the claims submission data and the new
policies issued data), some is high (such as webserver logs and insurance quotes).
For out-flow data, social media data and the weather data may arrive at a fast pace.
For catastrophe management and fraudulent claim detection, data needs to be processed 67
Have to draw maximum value out of the available datasets by ensuring the datasets are
stored in their original form and that they are subjected to the right type of analytics.
BIG DATA ADOPTION - CASE STUDY EXAMPLE
5. Identifying Types of Data
Structured data: policy data, claim data, customer profile data and quote data.
Unstructured data: social media data, insurance application documents, call center agent
notes, claim adjuster notes and incident photographs.
Semi-structured data: health records, customer profile data, weather reports, census data,
webserver logs and emails.
Metadata is a new concept as ETI’s current data management procedures do not create nor
append any metadata.
Why? - -> Because all data in ETI is stored and processed is structured in nature and originates
from within the company. Hence, the origins and the characteristics of data are implicitly known.
Solution - -> for the structured data, the data dictionary and the existence of last updated
timestamp and last updated user-id columns within the different relational database tables can be 69
used as metadata.
THANK YOU
70