Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
45 views159 pages

AD8551-BA Unit 1 To Unit 3

The document provides an overview of business analytics, defining it as the science of manipulating data to derive insights for solving business problems. It covers various components of business analytics, types of analytics (descriptive, predictive, prescriptive), and the analytics life cycle, emphasizing the importance of data collection methods. Additionally, it discusses the significance of understanding business objectives and project planning in the context of data-driven decision-making.

Uploaded by

Surya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views159 pages

AD8551-BA Unit 1 To Unit 3

The document provides an overview of business analytics, defining it as the science of manipulating data to derive insights for solving business problems. It covers various components of business analytics, types of analytics (descriptive, predictive, prescriptive), and the analytics life cycle, emphasizing the importance of data collection methods. Additionally, it discusses the significance of understanding business objectives and project planning in the context of data-driven decision-making.

Uploaded by

Surya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 159

AD8551

BUSINESS ANLAYTICS
(Unit 1- Unit 3)
UNIT 1
INTRODUCTION TO BUSINESS ANALYTICS

ANALYTICS AND DATA SCIENCE:


Definition:
Analytics generally refers to the science of manipulating data by applying different models and
statistical formulae on it to find insights.
These insights are the key factors that help us solve various problems. These problems may be
of many types, and when we work with data to find insights and solve business-related
problems, we are actually doing Business Analytics.
The tools used for analytics may range from spreadsheets to predictive analytics for complex
business problems. The process includes using these tools to draw out patterns and identify
relationships. Next, new questions are asked and the iterative process starts again and continues
until the business goal is achieved.
Business analytics refers to a subset of several methodologies, such as data mining, statistical
analysis, and predictive analytics, to analyze and transform data into useful information.
Business analytics is also used to identify and anticipate trends and outcomes. With the help of
these results, it becomes easier to make data-driven business decisions.
The use of business analytics is very popular in some industries such as healthcare, hospitality,
and any other business that has to track or closely monitor its customers. Many high-end
business analytics software solutions and platforms have been developed to ingest and process
large data sets.
Business Analytics Examples
Some of the examples of Business Analytics are:
• A simple example of Business Analytics would be working with data to find out what would
be the optimal price point for a product that a company is about to launch. While doing this
research, there are a lot of factors that it would have to take into consideration before arriving
at a solution.
• Another example would be applying Business Analytics techniques to identify and figure out
how many and which customers are likely to cancel the subscription
• One of the highly appreciated examples of Business Analytics is working with available data
to figure out and assess how and why the tastes and preferences change of customers who visit
a particular restaurant regularly.
Components of Business Analytics
Modern world business strategies are centred around data. Business Analytics, Machine
Learning, Data Science, etc. are used to arrive at solutions for complex and specific business
problems. Even though all of these have various components, the core components still remain
similar. Following are the core components of Business Analytics:
• Data Storage– The data is stored by the computers in a way that it can be further used in the
future. The processing of this data using storage devices is known as data storage. Object
storage, Block Storage, etc. are some of the storage products and services.
• Data Visualization– It is the process of graphically representing the information or insights
drawn through the analysis of data. Data visualization makes the communication of outputs to
the management easier in simple terms.
• Insights– Insights are the outputs and inferences drawn from the analysis of data by
implementing business analytics techniques and tools.
• Data Security– One of the most important components of Business Analytics is Data Security.
It involves monitoring and identifying malicious activities in the security networks. Real-time
data and predictive modelling techniques are used to identify vulnerabilities in the system.

TYPES OF BUSINESS ANALYTICS


There are various types of analytics that are performed on a daily basis across many companies.
Let’s understand each one of them in this section.
Descriptive Analytics
Whenever we are trying to answer questions such as “what were the sales figures last year” or
what has occurred before”, we are basically doing descriptive analysis. In descriptive analysis,
we describe or summarize the past data and transform it into easily comprehensible forms, such
as charts or graphs.
Example –
Let’s take an example of DMart, we can look at the product’s history and find out which
products have been sold more or which products have large demand by looking at the product
sold trends and based on their analysis we can further make the decision of putting a stock of
that item in large quantity for the coming year.
Predictive Analytics
Predictive analytics is exactly what it sounds like. It is that side of business analytics where
predictions about a future event are made. An example of predictive analytics is calculating the
expected sales figures for the upcoming fiscal year. Predictive analytics is majorly used to set
up expectations and follow proper processes and measures to meet those expectations.
Example –
The best example would be Amazon and Netflix recommender system. You might have
noticed that whenever you buy any product from Amazon, on the payment side it shows you
a recommendation saying the customer who purchased this has also purchased this product
that recommendation is based on the customer purchased behavior in the past. By looking at
customer past purchase behavior analyst creates an association between each product and
that’s the reason it shows recommendation when you buy any product.
The next example would be Netflix, when you watch any movies or web series on Netflix
you can see that Netflix provide you with a lot of recommended movies or web series, that
recommendation is based on past data or past trends, it identifies which movie or series has
gain lot of public interest and based on that it creates a recommendation

Prescriptive Analytics
In the case of prescriptive analytics, we make use of simulation, data modelling, and
optimization of algorithms to find answers to questions such as “what needs to be done”. This
is used to provide solutions and identify the potential results of those solutions. This field of
business analytics has recently surfaced and is on heavy rise since it gives multiple solutions,
with their possible effectiveness, to the problems faced by businesses. Let’s say Plan A fails or
there aren’t enough resources to execute it, then there is still Plan B, Plan C, etc., in hand.
Example –
The best example would be Google self-driving Car, by looking at the past trends and
forecasted data it identifies when to turn or when to slow down, works much like a human
driver.

ANALYTICS LIFE CYCLE


In the early 1990's as data mining was evolving from toddler to adolescent. As a community,
we spent a lot of time getting the data ready for the fairly limited tools and computing power.
The CRISP-DM that emerged as a result is still valid today in the era of Big Data & Stream
Analytics.

Business Understanding
Focuses on understanding the project objectives and requirements from a business perspective.
The analyst formulates this knowledge as a data mining problem and develops preliminary plan
Data Understanding
Starting with initial data collection, the analyst proceeds with activities to get familiar with the
data, identify data quality problems & discover first insights into the data. In this phase, the
analyst might also detect interesting subsets to form hypotheses for hidden information
Data Preparation
The data preparation phase covers all activities to construct the final dataset from the initial
raw data
Modelling
The analyst evaluates, selects & applies the appropriate modelling techniques. Since some
techniques like neural nets have specific requirements regarding the form of the data. There
can be a loop back here to data prep
Evaluation
The analyst builds & chooses models that appear to have high quality based on loss functions
that were selected. The analyst them tests them to ensure that they can generalise the models
against unseen data. Subsequently, the analyst also validates that the models sufficiently cover
all key business issues. The end result is the selection of the champion model(s)
Deployment
Generally, this will mean deploying a code representation of the model into an operating
system. This also includes mechanisms to score or categorise new unseen data as it arises. The
mechanism should use the new information in the solution of the original business problem.
Importantly, the code representation must also include all the data prep steps leading up to
modelling. This ensures that the model will treat new raw data in the same manner as during
model development

BUSINESS PROBLEM DEFINITION


The Business Understanding phase is to understand what the business wants to solve. Important
task within this phase according to the Data Science Project Management including:
1. Determine the business question and objective: What to solve from the business perspective,
what the customer wants, and define the business success criteria (Key Performance Indicator
or KPI). For fresher, research what kind of the situation company would face and try to build
your project on top of it.
2. Situation Assessment: You need to assess the resources availability, project requirements,
risks, and cost-benefit from this project. While you might not know the situation within the
company if you are not hired yet, you could assess it based on your research and explain what
your assessment is based on.
3. Determine the project goals: What the technical data mining perspective success criteria. You
could set it based on model metrics or availability time or anything as long as you could explain
it — what is important is that it logically sounded.
4. Project plan: Try to create a detailed plan for each project phase and what kind of tools you
would use.

Determine the business question and objective:


The first thing you must do in any project is to find out exactly what you’re trying to
accomplish! That’s less obvious than it sounds. Many data miners have invested time on data
analysis, only to find that their management wasn’t particularly interested in the issue they
were investigating. You must start with a clear understanding of
• A problem that your management wants to address
• The business goals
• Constraints (limitations on what you may do, the kinds of solutions that can be used, when the
work must be completed, and so on)
• Impact (how the problem and possible solutions fit in with the business)
Deliverables for this task include three items (usually brief reports focusing on just the main
points):
• Background: Explain the business situation that drives the project. This item, like many that
follow, amounts only to a few paragraphs.
• Business goals: Define what your organization intends to accomplish with the project. This is
usually a broader goal than you, as a data miner, can accomplish independently. For example,
the business goal might be to increase sales from a holiday ad campaign by 10 percent year
over year.
• Business success criteria: Define how the results will be measured. Try to get clearly defined
quantitative success criteria. If you must use subjective criteria (hint: terms like gain
insight or get a handle on imply subjective criteria), at least get agreement on exactly who will
judge whether or not those criteria have been fulfilled.

Assessing your situation


This is where you get into more detail on the issues associated with your business goals. Now
you will go deeper into fact-finding, building out a much fleshier explanation of the issues
outlined in the business goals task.
Deliverables for this task include five in-depth reports:
• Inventory of resources: A list of all resources available for the project. These may include
people (not just data miners, but also those with expert knowledge of the business problem,
data managers, technical support, and others), data, hardware, and software.
• Requirements, assumptions, and constraints: Requirements will include a schedule for
completion, legal and security obligations, and requirements for acceptable finished work. This
is the point to verify that you’ll have access to appropriate data!
• Risks and contingencies: Identify causes that could delay completion of the project, and
prepare a contingency plan for each of them. For example, if an Internet outage in your office
could pose a problem, perhaps your contingency could be to work at another office until the
outage has ended.
• Terminology: Create a list of business terms and data-mining terms that are relevant to your
project and write them down in a glossary with definitions (and perhaps examples), so that
everyone involved in the project can have a common understanding of those terms.
• Costs and benefits: Prepare a cost-benefit analysis for the project. Try to state all costs and
benefits in dollar (euro, pound, yen, and so on) terms. If the benefits don’t significantly exceed
the costs, stop and reconsider this analysis and your project.

Defining your project goals


Reaching the business goal often requires action from many people, not just the data miner. So
now, you must define your little part within the bigger picture. If the business goal is to reduce
customer attrition, for example, your data-mining goals might be to identify attrition rates for
several customer segments, and develop models to predict which customers are at greatest risk.
Deliverables for this task include two reports:
• Project goals: Define project deliverables, such as models, reports, presentations, and
processed datasets.
• Project success criteria: Define the project technical criteria necessary to support the business
success criteria. Try to define these in quantitative terms (such as model accuracy or predictive
improvement compared to an existing method). If the criteria must be qualitative, identify the
person who makes the assessment.

Project plan
Now you specify every step that you, the data miner, intend to take until the project is
completed and the results are presented and reviewed.
Deliverables for this task include two reports:
• Project plan: Outline your step-by-step action plan for the project. Expand the outline with a
schedule for completion of each step, required resources, inputs (such as data or a meeting with
a subject matter expert), and outputs (such as cleaned data, a model, or a report) for each step,
and dependencies (steps that can’t begin until this step is completed). Explicitly state that
certain steps must be repeated (for example, modeling and evaluation usually call for several
back-and-forth repetitions).
• Initial assessment of tools and techniques: Identify the required capabilities for meeting your
data-mining goals and assess the tools and resources that you have. If something is missing,
you have to address that concern very early in the process.
DATA COLLECTION
Data is a collection of facts, figures, objects, symbols, and events gathered from different
sources. Organizations collect data to make better decisions. Without data, it would be difficult
for organizations to make appropriate decisions, and so data is collected at various points in
time from different audiences.
For instance, before launching a new product, an organization needs to collect data on product
demand, customer preferences, competitors, etc. In case data is not collected beforehand, the
organization’s newly launched product may lead to failure for many reasons, such as less
demand and inability to meet customer needs.
Although data is a valuable asset for every organization, it does not serve any purpose until
analyzed or processed to get the desired results.
Collecting the information from the numerical fact after observation is known as raw data.
There are two types of data. Below we have provided the types of data: Primary Data and
Secondary Data.
The two types of data are as follows.
1. Primary Data
When an investigator collects data himself with a definite plan or design in his/her way, then
the data is known as primary data. Generally, the results derived from the primary data are
accurate as the researcher gathers the information. But, one of the disadvantages of primary
data collection is the expenses associated with it. Primary data research is very time-
consuming and expensive.
2. Secondary Data
Data that the investigator does not initially collect but instead obtains from published or
unpublished sources are secondary data. Secondary data is collected by an individual or an
institution for some purpose and are used by someone else in another context. It is worth
noting that although secondary data is cheaper to obtain, it raises concerns about accuracy.
As the data is second-hand, one cannot fully rely on the information to be authentic.
Data Collection: Methods
Data collection is defined as collecting and analysing data to validate and research using
some techniques. It is done to diagnose a problem and learn its outcome and future trends.
When there is a need to solve a question, data collection methods help assume the future
result.
We must collect reliable data from the correct sources to make the calculations and analysis
easier. There are two types of data collection methods. This is dependent on the kind of data
that is being collected. They are:
1. Primary Data Collection Methods
2. Secondary Data Collection Methods
Types of Data Collection
Students require primary or secondary data while doing their research. Both primary and
secondary data have their own advantages and disadvantages. Both the methods come into
the picture in different scenarios. One can use secondary data to save time and primary data
to get accurate results.
Primary Data Collection Method
Primary or raw data is obtained directly from the first-hand source through experiments,
surveys, or observations. The primary data collection method is further classified into two
types, and they are given below:
1. Quantitative Data Collection Methods
2. Qualitative Data Collection Methods
Quantitative Data Collection Methods
The term ‘Quantity’ tells us a specific number. Quantitative data collection methods express
the data in numbers using traditional or online data collection methods. Once this data is
collected, the results can be calculated using Statistical methods and Mathematical tools.
Some of the quantitative data collection methods include
Time Series Analysis
The term time series refers to a sequential order of values of a variable, known as a trend, at
equal time intervals. Using patterns, an organization can predict the demand for its products
and services for the projected time.
Smoothing Techniques
In cases where the time series lacks significant trends, smoothing techniques can be used. They
eliminate a random variation from the historical demand. It helps in identifying patterns and
demand levels to estimate future demand. The most common methods used in smoothing
demand forecasting techniques are the simple moving average method and the weighted
moving average method.
Barometric Method
Also known as the leading indicators approach, researchers use this method to speculate future
trends based on current developments. When the past events are considered to predict future
events, they act as leading indicators.
Qualitative Data Collection Methods
The qualitative method does not involve any mathematical calculations. This method is
closely connected with elements that are not quantifiable. The qualitative data collection
method includes several ways to collect this type of data, and they are given below:
Interview Method
As the name suggests, data collection is done through the verbal conversation of
interviewing the people in person or on a telephone or by using any computer-aided
model. This is one of the most often used methods by researchers. A brief description of
each of these methods is shown below:
Personal or Face-to-Face Interview: In this type of interview, questions are asked
personally directly to the respondent. For this, a researcher can do online surveys to take
note of the answers.
Telephonic Interview: This method is done by asking questions on a telephonic call. Data
is collected from the people directly by collecting their views or opinions.
Computer-Assisted Interview: The computer-assisted type of interview is the same as a
personal interview, except that the interviewer and the person being interviewed will be
doing it on a desktop or laptop. Also, the data collected is directly updated in a database to
make the process quicker and easier. In addition, it eliminates a lot of paperwork to be done
in updating the collection of data.
Questionnaire Method of Collecting Data
The questionnaire method is nothing but conducting surveys with a set of quantitative
research questions. These survey questions are done by using online survey questions
creation software. It also ensures that the people’s trust in the surveys is legitimised. Some
types of questionnaire methods are given below:
Web-Based Questionnaire: The interviewer can send a survey link to the selected
respondents. Then the respondents click on the link, which takes them to the survey
questionnaire. This method is very cost-efficient and quick, which people can do at their
own convenient time. Moreover, the survey has the flexibility of being done on any device.
So, it is reliable and flexible.
Mail-Based Questionnaire: Questionnaires are sent to the selected audience via email. At
times, some incentives are also given to complete this survey which is the main attraction.
The advantage of this method is that the respondent’s name remains confidential to the
researchers, and there is the flexibility of time to complete this survey.
Observation Method
As the word ‘observation’ suggests, data is collected directly by observing this method. This
can be obtained by counting the number of people or the number of events in a particular
time frame. Generally, it’s effective in small-scale scenarios. The primary skill needed here
is observing and arriving at the numbers correctly. Structured observation is the type of
observation method in which a researcher detects certain specific behaviours.
Document Review Method
The document review method is a data aggregation method used to collect data from existing
documents with data about the past. There are two types of documents from which we can
collect data. They are given below:
Public Records: The data collected in an organisation like annual reports and sales
information of the past months are used to do future analysis.
Personal Records: As the name suggests, the documents about an individual such as type
of job, designation, and interests are taken into account.
Secondary Data Collection Method
The data collected by another person other than the researcher is secondary data. Secondary
data is readily available and does not require any particular collection methods. It is
available in the form of historical archives, government data, organisational records etc.
This data can be obtained directly from the company or the organization where the research
is being organised or from outside sources.
The internal sources of secondary data gathering include company documents, financial
statements, annual reports, team member information, and reports got from customers or
dealers. Now, the external data sources include information from books, journals,
magazines, the census taken by the government, and the information available on the internet
about research. The leading edge of this data aggregation method is that it is easy to collect
since they are readily accessible.

The secondary data collection methods, too, can involve both quantitative and qualitative
techniques. Secondary data is easily available and hence, less time-consuming and expensive
as compared to the primary data. However, with the secondary data collection methods, the
authenticity of the data gathered cannot be verified.
Collection of Data in Statistics
There are various ways to represent data after gathering. But, the most popular method is to
tabulate the data using tally marks and then represent them in a frequency distribution table.
The frequency distribution table is constructed by using the tally marks. Tally marks are a
form of a numerical system used for counting. The vertical lines are used for the counting.
The cross line is placed over the four lines giving the total at 55.

Example:
Consider a jar containing the different colours of pieces of bread as shown below:

Construct a frequency distribution table for the above-mentioned data.


Ans:

DATA PREPARATION
Data preparation is the process of gathering, combining, structuring and organizing data so it
can be used in business intelligence (BI), analytics and data visualization applications. The
components of data preparation include data preprocessing, profiling, cleansing, validation and
transformation; it often also involves pulling together data from different internal systems and
external sources.
Data preparation work is done by information technology (IT), BI and data management teams
as they integrate data sets to load into a data warehouse, NoSQL database or data lake
repository, and then when new analytics applications are developed with those data sets. In
addition, data scientists, data engineers, other data analysts and business users increasingly use
self-service data preparation tools to collect and prepare data themselves.
Data preparation is often referred to informally as data prep. It's also known as data wrangling,
although some practitioners use that term in a narrower sense to refer to cleansing, structuring
and transforming data; that usage distinguishes data wrangling from the data pre-
processing stage.
Purposes of data preparation
One of the primary purposes of data preparation is to ensure that raw data being readied for
processing and analysis is accurate and consistent so the results of BI and analytics
applications will be valid. Data is commonly created with missing values, inaccuracies or other
errors, and separate data sets often have different formats that need to be reconciled when
they're combined. Correcting data errors, validating data quality and consolidating data sets are
big parts of data preparation projects.
Data preparation also involves finding relevant data to ensure that analytics applications deliver
meaningful information and actionable insights for business decision-making. The data often
is enriched and optimized to make it more informative and useful -- for example, by blending
internal and external data sets, creating new data fields, eliminating outlier values and
addressing imbalanced data sets that could skew analytics results.
In addition, BI and data management teams use the data preparation process to curate data sets
for business users to analyse. Doing so helps streamline and guide self-service BI applications
for business analysts, executives and workers.
What are the benefits of data preparation?
Data scientists often complain that they spend most of their time gathering, cleansing and
structuring data instead of analysing it. A big benefit of an effective data preparation process
is that they and other end users can focus more on data mining and data analysis -- the parts of
their job that generate business value. For example, data preparation can be done more quickly,
and prepared data can automatically be fed to users for recurring analytics applications.
Done properly, data preparation also helps an organization do the following:
• ensure the data used in analytics applications produces reliable results;
• identify and fix data issues that otherwise might not be detected;
• enable more informed decision-making by business executives and operational workers;
• reduce data management and analytics costs;
• avoid duplication of effort in preparing data for use in multiple applications; and
• get a higher ROI from BI and analytics initiatives.
Effective data preparation is particularly beneficial in big data environments that store a
combination of structured, semi structured and unstructured data, often in raw form until it's
needed for specific analytics uses. Those uses include predictive analytics, machine learning
(ML) and other forms of advanced analytics that typically involve large amounts of data to
prepare. For example, in an article on preparing data for machine learning, Felix Wick,
corporate vice president of data science at supply chain software vendor Blue Yonder, is quoted
as saying that data preparation "is at the heart of ML."
Steps in the data preparation process
Data preparation is done in a series of steps. There's some variation in the data preparation
steps listed by different data professionals and software vendors, but the process typically
involves the following tasks:
1. Data discovery and profiling. The next step is to explore the collected data to better
understand what it contains and what needs to be done to prepare it for the intended uses. To
help with that, data profiling identifies patterns, relationships and other attributes in the data,
as well as inconsistencies, anomalies, missing values and other issues so they can be addressed.
What is data profiling?
Data profiling refers to the process of examining, analyzing, reviewing and summarizing data
sets to gain insight into the quality of data. Data quality is a measure of the condition of data
based on factors such as its accuracy, completeness, consistency, timeliness and accessibility.
Additionally, data profiling involves a review of source data to understand the data's structure,
content and interrelationships.
This review process delivers two high-level values to the organization: It provides a high-level
view of the quality of its data sets; and two, it helps the organization identify potential data
projects.
Given those benefits, data profiling is an important component of data preparation programs.
Its assistance helping organizations to identify quality data makes it an important precursor to
data processing and data analytics activities.
Moreover, an organization can use data profiling and the insights it produces to continuously
improve the quality of its data and measure the results of that effort.
Data profiling may also be known as data archaeology, data assessment, data discovery or data
quality analysis.
Organizations use data profiling at the beginning of a project to determine if enough data has
been gathered, if any data can be reused or if the project is worth pursuing. The process of data
profiling itself can be based on specific business rules that will uncover how the data set aligns
with business standards and goals.
Types of data profiling
There are three types of data profiling.
• Structure discovery. This focuses on the formatting of the data, making sure everything is
uniform and consistent. It uses basic statistical analysis to return information about the validity
of the data.
• Content discovery. This process assesses the quality of individual pieces of data. For example,
ambiguous, incomplete and null values are identified.
• Relationship discovery. This detects connections, similarities, differences and associations
among data sources.
What are the steps in the data profiling process?
Data profiling helps organizations identify and fix data quality problems before the data is
analyzed, so data professionals aren't dealing with inconsistencies, null values or incoherent
schema designs as they process data to make decisions.
Data profiling statistically examines and analyzes data at its source and when loaded. It also
analyzes the metadata to check for accuracy and completeness.
It typically involves either writing queries or using data profiling tools.
A high-level breakdown of the process is as follows:
1. The first step of data profiling is gathering one or multiple data sources and the associated
metadata for analysis.
2. The data is then cleaned to unify structure, eliminate duplications, identify interrelationships
and find anomalies.
3. Once the data is cleaned, data profiling tools will return various statistics to describe the data
set. This could include the mean, minimum/maximum value, frequency, recurring patterns,
dependencies or data quality risks.
For example, by examining the frequency distribution of different values for each column in a
table, a data analyst could gain insight into the type and use of each column. Cross-column
analysis can be used to expose embedded value dependencies; inter-table analysis allows the
analyst to discover overlapping value sets that represent foreign key relationships between
entities.
Benefits of data profiling
Data profiling returns a high-level overview of data that can result in the following benefits:
• leads to higher-quality, more credible data;
• helps with more accurate predictive analytics and decision-making;
• makes better sense of the relationships between different data sets and sources;
• keeps company information centralized and organized;
• eliminates errors, such as missing values or outliers, that add costs to data-driven projects;
• highlights areas within a system that experience the most data quality issues, such as data
corruption or user input errors; and
• produces insights surrounding risks, opportunities and trends.
Data profiling challenges
Although the objectives of data profiling are straightforward, the actual work involved is quite
complex, with multiple tasks occurring from the ingestion of data through its warehousing.
That complexity is one of the challenges organizations encounter when trying to implement
and run a successful data profiling program.
The sheer volume of data being collected by a typical organization is another challenge, as is
the range of sources -- from cloud-based systems to endpoint devices deployed as part of an
internet-of-things ecosystem -- that produce data.
The speed at which data enters an organization creates further challenges to having a successful
data profiling program.
These data prep challenges are even more significant in organizations that have not adopted
modern data profiling tools and still rely on manual processes for large parts of this work.
On a similar note, organizations that don't have adequate resources -- including trained data
professionals, tools and the funding for them -- will have a harder time overcoming these
challenges.
However, those same elements make data profiling more critical than ever to ensure that the
organization has the quality data it needs to fuel intelligent systems, customer personalization,
productivity-boosting automation projects and more.
Examples of data profiling
Data profiling can be implemented in a variety of use cases where data quality is important.
For example, projects that involve data warehousing or business intelligence may require
gathering data from multiple disparate systems or databases for one report or analysis.
Applying data profiling to these projects can help identify potential issues and corrections that
need to be made in extract, transform and load (ETL) jobs and other data integration processes
before moving forward.
Additionally, data profiling is crucial in data conversion or data migration initiatives that
involve moving data from one system to another. Data profiling can help identify data quality
issues that may get lost in translation or adaptions that must be made to the new system prior
to migration.
The following four methods, or techniques, are used in data profiling:
• column profiling, which assesses tables and quantifies entries in each column;
• cross-column profiling, which features both key analysis and dependency analysis;
• cross-table profiling, which uses key analysis to identify stray data as well as semantic and
syntactic discrepancies; and
• data rule validation, which assesses data sets against established rules and standards to validate
that they're being followed.
Data profiling tools
Data profiling tools replace much, if not all, of the manual effort of this function by discovering
and investigating issues that affect data quality, such as duplication, inaccuracies,
inconsistencies and lack of completeness.
These technologies work by analyzing data sources and linking sources to their metadata to
allow for further investigation into errors.
Furthermore, they offer data professionals quantitative information and statistics around data
quality, typically in tabular and graph formats.
Data management applications, for example, can manage the profiling process through tools
that eliminate errors and apply consistency to data extracted from multiple sources without the
need for hand coding.
Such tools are essential for many, if not most, organizations today as the volume of data they
use for their business activities significantly outpaces even a large team's ability to perform this
function through mostly manual efforts.
Data profile tools also generally include data wrangling, data gap and metadata discovery
capabilities as well as the ability to detect and merge duplicates, check for data similarities and
customize data assessments.
Commercial vendors that provide data profiling capabilities include Datameer, Informatica,
Oracle and SAS. Open source solutions include Aggregate Profiler, Apache Griffin, Quadient
DataCleaner and Talend.
2. Data cleansing. Next, the identified data errors and issues are corrected to create complete and
accurate data sets. For example, as part of cleansing data sets, faulty data is removed or fixed,
missing values are filled in and inconsistent entries are harmonized.
What is data cleansing?
Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing
incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying
data errors and then changing, updating or removing data to correct them. Data cleansing
improves data quality and helps provide more accurate, consistent and reliable information for
decision-making in an organization.
Data cleansing is a key part of the overall data management process and one of the core
components of data preparation work that readies data sets for use in business intelligence (BI)
and data science applications. It's typically done by data quality analysts and engineers or other
data management professionals. But data scientists, BI analysts and business users may also
clean data or take part in the data cleansing process for their own applications.
Data cleansing vs. data cleaning vs. data scrubbing
Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most
part, they're considered to be the same thing. In some cases, though, data scrubbing is viewed
as an element of data cleansing that specifically involves removing duplicate, bad, unneeded
or old data from data sets.
Data scrubbing also has a different meaning in connection with data storage. In that context,
it's an automated function that checks disk drives and storage systems to make sure the data
they contain can be read and to identify any bad sectors or blocks.

Why is clean data important?


Business operations and decision-making are increasingly data-driven, as organizations look
to use data analytics to help improve business performance and gain competitive advantages
over rivals. As a result, clean data is a must for BI and data science teams, business executives,
marketing managers, sales reps and operational workers. That's particularly true in retail,
financial services and other data-intensive industries, but it applies to organizations across the
board, both large and small.
If data isn't properly cleansed, customer records and other business data may not be accurate
and analytics applications may provide faulty information. That can lead to flawed business
decisions, misguided strategies, missed opportunities and operational problems, which
ultimately may increase costs and reduce revenue and profits. IBM estimated that data quality
issues cost organizations in the U.S. a total of $3.1 trillion in 2016, a figure that's still widely
cited.
What kind of data errors does data scrubbing fix?
Data cleansing addresses a range of errors and issues in data sets, including inaccurate, invalid,
incompatible and corrupt data. Some of those problems are caused by human error during the
data entry process, while others result from the use of different data structures, formats and
terminology in separate systems throughout an organization.
The types of issues that are commonly fixed as part of data cleansing projects include the
following:
• Typos and invalid or missing data. Data cleansing corrects various structural errors in data
sets. For example, that includes misspellings and other typographical errors, wrong numerical
entries, syntax errors and missing values, such as blank or null fields that should contain data.
• Inconsistent data. Names, addresses and other attributes are often formatted differently from
system to system. For example, one data set might include a customer's middle initial, while
another doesn't. Data elements such as terms and identifiers may also vary. Data cleansing
helps ensure that data is consistent so it can be analyzed accurately.
• Duplicate data. Data cleansing identifies duplicate records in data sets and either removes or
merges them through the use of deduplication measures. For example, when data from two
systems is combined, duplicate data entries can be reconciled to create single records.
• Irrelevant data. Some data -- outliers or out-of-date entries, for example -- may not be relevant
to analytics applications and could skew their results. Data cleansing removes redundant data
from data sets, which streamlines data preparation and reduces the required amount of data
processing and storage resources.
What are the steps in the data cleansing process?
The scope of data cleansing work varies depending on the data set and analytics requirements.
For example, a data scientist doing fraud detection analysis on credit card transaction data may
want to retain outlier values because they could be a sign of fraudulent purchases. But the data
scrubbing process typically includes the following actions:
1. Inspection and profiling. First, data is inspected and audited to assess its quality level and
identify issues that need to be fixed. This step usually involves data profiling, which documents
relationships between data elements, checks data quality and gathers statistics on data sets to
help find errors, discrepancies and other problems.
2. Cleaning. This is the heart of the cleansing process, when data errors are corrected and
inconsistent, duplicate and redundant data is addressed.
3. Verification. After the cleaning step is completed, the person or team that did the work should
inspect the data again to verify its cleanliness and make sure it conforms to internal data quality
rules and standards.
4. Reporting. The results of the data cleansing work should then be reported to IT and business
executives to highlight data quality trends and progress. The report could include the number
of issues found and corrected, plus updated metrics on the data's quality levels.
The cleansed data can then be moved into the remaining stages of data preparation, starting
with data structuring and data transformation, to continue readying it for analytics uses.
Characteristics of clean data
Various data characteristics and attributes are used to measure the cleanliness and overall
quality of data sets, including the following:
• accuracy
• completeness
• consistency
• integrity
• timeliness
• uniformity
• validity
Data management teams create data quality metrics to track those characteristics, as well as
things like error rates and the overall number of errors in data sets. Many also try to calculate
the business impact of data quality problems and the potential business value of fixing them,
partly through surveys and interviews with business executives.
The benefits of effective data cleansing
Done well, data cleansing provides the following business and data management benefits:
• Improved decision-making. With more accurate data, analytics applications can produce
better results. That enables organizations to make more informed decisions on business
strategies and operations, as well as things like patient care and government programs.
• More effective marketing and sales. Customer data is often wrong, inconsistent or out of
date. Cleaning up the data in customer relationship management and sales systems helps
improve the effectiveness of marketing campaigns and sales efforts.
• Better operational performance. Clean, high-quality data helps organizations avoid
inventory shortages, delivery snafus and other business problems that can result in higher costs,
lower revenues and damaged relationships with customers.
• Increased use of data. Data has become a key corporate asset, but it can't generate business
value if it isn't used. By making data more trustworthy, data cleansing helps convince business
managers and workers to rely on it as part of their jobs.
• Reduced data costs. Data cleansing stops data errors and issues from further propagating in
systems and analytics applications. In the long term, that saves time and money, because IT
and data management teams don't have to continue fixing the same errors in data sets.
Data cleansing and other data quality methods are also a key part of data governance programs,
which aim to ensure that the data in enterprise systems is consistent and gets used properly.
Clean data is one of the hallmarks of a successful data governance initiative.
Data cleansing challenges
Data cleansing doesn't lack for challenges. One of the biggest is that it's often time-consuming,
due to the number of issues that need to be addressed in many data sets and the difficulty of
pinpointing the causes of some errors. Other common challenges include the following:
• deciding how to resolve missing data values so they don't affect analytics applications;
• fixing inconsistent data in systems controlled by different business units;
• cleaning up data quality issues in big data systems that contain a mix of structured, semi
structured and unstructured data;
• getting sufficient resources and organizational support; and
• dealing with data silos that complicate the data cleansing process.
Data cleansing tools and vendors
Numerous tools can be used to automate data cleansing tasks, including both commercial
software and open source technologies. Typically, the tools include a variety of functions for
correcting data errors and issues, such as adding missing values, replacing null ones, fixing
punctuation, standardizing fields and combining duplicate records. Many also do data matching
to find duplicate or related records.
Tools that help cleanse data are available in a variety of products and platforms, including the
following:
• specialized data cleaning tools from vendors such as Data Ladder and WinPure;
• data quality software from vendors such as Datactics, Experian, Innovative Systems, Melissa,
Microsoft and Precisely;
• data preparation tools from vendors such as Altair, DataRobot, Tableau, Tibco Software and
Trifacta;
• data management platforms from vendors such as Alteryx, Ataccama, IBM, Informatica, SAP,
SAS, Syniti and Talend;
• customer and contact data management software from vendors such as Redpoint Global,
RingLead, Synthio and Tye;
• tools for cleansing data in Salesforce systems from vendors such as Cloudingo and Plauti; and
• open-source tools, such as DataCleaner and OpenRefine
3. Data structuring. At this point, the data needs to be modeled and organized to meet the
analytics requirements. For example, data stored in comma-separated values (CSV) files or
other file formats has to be converted into tables to make it accessible to BI and analytics tools.
4. Data transformation and enrichment. In addition to being structured, the data typically must
be transformed into a unified and usable format. For example, data transformation may involve
creating new fields or columns that aggregate values from existing ones. Data enrichment
further enhances and optimizes data sets as needed, through measures such as augmenting and
adding data.
What is data transformation?
Data transformation is the process of converting data from one format, such as a database file,
XML document or Excel spreadsheet, into another.
Transformations typically involve converting a raw data source into a cleansed, validated and
ready-to-use format. Data transformation is crucial to data management processes that include
data integration, data migration, data warehousing and data preparation.
The process of data transformation can also be referred to as extract/transform/load (ETL). The
extraction phase involves identifying and pulling data from the various source systems that
create data and then moving the data to a single repository. Next, the raw data is cleansed, if
needed. It's then transformed into a target format that can be fed into operational systems or
into a data warehouse, a date lake or another repository for use in business intelligence and
analytics applications. The transformation may involve converting data types, removing
duplicate data and enriching the source data.
Data transformation is crucial to processes that include data integration, data management, data
migration, data warehousing and data wrangling.
It is also a critical component for any organization seeking to leverage its data to generate
timely business insights. As the volume of data has proliferated, organizations must have an
efficient way to harness data to effectively put it to business use. Data transformation is one
element of harnessing this data, because -- when done properly -- it ensures data is easy to
access, consistent, secure and ultimately trusted by the intended business users.
What are the key steps in data transformation?
The process of data transformation, as noted, involves identifying data sources and types;
determining the structure of transformations that need to occur; and defining how fields will
be changed or aggregated. It includes extracting data from its original source, transforming it
and sending it to the target destination, such as a database or data warehouse. Extractions can
come from many locations, including structured sources, streaming sources or log files from
web applications.
Data analysts, data engineers and data scientists are typically in charge of data transformation
within an organization. They identify the source data, determine the required data formats and
perform data mapping, as well as execute the actual transformation process before moving the
data into appropriate databases for storage and use.
Their work involves five main steps:
1. data discovery, in which data professionals use data profiling tools or profiling
scripts to understand the structure and characteristics of the data and also to
determine how it should be transformed;
2. data mapping, during which data professionals connect, or match, data fields from
one source to data fields in another;
3. code generation, a part of the process where the software code required to
transform the data is created (either by data transformation tools or the data
professionals themselves writing script);
4. execution of the code, where the data undergoes the transformation; and
5. review, during which data professionals or the business/end users confirm that the
output data meets the established transformation requirements and, if not, address
and correct any anomalies and errors.
These steps fall in the middle of the ETL process for organizations that use on-premises
warehouses. However, scalable cloud-based data warehouses have given rise to a slightly
different process called ELT for extract, load, transform; in this process, organizations can
load raw data into data warehouses and then transform data at the time of use.
What are the benefits and challenges of data transformation?
Organizations across the board need to analyze their data for a host of business operations,
from customer service to supply chain management. They also need data to feed the increasing
number of automated and intelligent systems within their enterprise.
To gain insight into and improve these operations, organizations need high-quality data in
formats compatible with the systems consuming the data.
Thus, data transformation is a critical component of an enterprise data program because it
delivers the following benefits:
• higher data quality;
• reduced number of mistakes, such as missing values;
• faster queries and retrieval times;
• less resources needed to manipulate data;
• better data organization and management; and
• more usable data, especially for advanced business intelligence or analytics.
The data transformation process, however, can be complex and complicated. The challenges
organizations face include the following:
• high cost of transformation tools and professional expertise;
• significant compute resources, with the intensity of some on-premises
transformation processes having the potential to slow down other operations;
• difficulty recruiting and retaining the skilled data professionals required for this
work, with data professionals some of the most in-demand workers today; and
• difficulty of properly aligning data transformation activities to the business's data-
related priorities and requirements.
Reasons to do data transformation
Organizations must be able to mine their data for insights in order to successfully compete in
the digital marketplace, optimize operations, cut costs and boost productivity. They also require
data to feed systems that use artificial intelligence, machine learning, natural language
processing and other advanced technologies.
To gain accurate insights and to ensure accurate operations of intelligent systems, organizations
must collect data and merge it from multiple sources and ensure that integrated data is high
quality.
This is where data transformation plays the star role, by ensuring that data collected from one
system is compatible with data from other systems and that the combined data is ultimately
compatible for use in the systems that require it. For example, databases might need to be
combined following a corporate acquisition, transferred to a cloud data warehouse or merged
for analysis.
Examples of data transformation
There are various data transformation methods, including the following:
• aggregation, in which data is collected from multiple sources and stored in a single
format;
• attribute construction, in which new attributes are added or created from existing
attributes;
• discretization, which involves converting continuous data values into sets of data
intervals with specific values to make the data more manageable for analysis;
• generalization, where low-level data attributes are converted into high-level data
attributes (for example, converting data from multiple brackets broken up by ages
into the more general "young" and "old" attributes) to gain a more comprehensive
view of the data;
• integration, a step that involves combining data from different sources into a single
view;
• manipulation, where the data is changed or altered to make it more readable and
organized;
• normalization, a process that converts source data into another format to limit the
occurrence of duplicated data; and
• smoothing, which uses algorithms to reduce "noise" in data sets, thereby helping
to more efficiently and effectively identify trends in the data.
Data transformation tools
Data professionals have a number of tools at their disposal to support the ETL process. These
technologies automate many of the steps within data transformation, replacing much, if not all,
of the manual scripting and hand coding that had been a major part of the data transformation
process.
Both commercial and open-source data transformation tools are available, with some options
designed for on-premises transformation processes and others catering to cloud-based
transformation activities.
Moreover, some data transformation tools are focused on the data transformation process itself,
handling the string of actions required to transform data. However, other ETL tools on the
market are part of platforms that offer a broad range of capabilities for managing enterprise
data.
Options include IBM InfoSphere, DataStage, Matillion, SAP Data Services and Talend.
5. Data validation and publishing. In this last step, automated routines are run against the data
to validate its consistency, completeness and accuracy. The prepared data is then stored in a
data warehouse, a data lake or another repository and either used directly by whoever prepared
it or made available for other users to access.
What is data validation?
Data validation is the practice of checking the integrity, accuracy and structure of data before
it is used for a business operation. Data validation operation results can provide data used for
data analytics, business intelligence or training a machine learning model. It can also be used
to ensure the integrity of data for financial accounting or regulatory compliance.
Data can be examined as part of a validation process in a variety of ways, including data type,
constraint, structured, consistency and code validation. Each type of data validation is designed
to make sure the data meets the requirements to be useful.
Data validation is related to data quality. Data validation can be a component to measure data
quality, which ensures that a given data set is supplied with information sources that are of the
highest quality, authoritative and accurate.
Data validation is also used as part of application workflows, including spell checking and rules
for strong password creation.
Why validate data?
For data scientists, data analysts and others working with data, validating it is very important.
The output of any given system can only be as good as the data the operation is based on. These
operations can include machine learning or artificial intelligence models, data analytics reports
and business intelligence dashboards. Validating the data ensures that the data is accurate,
which means all systems relying on a validated given data set will be as well.
Data validation is also important for data to be useful for an organization or for a specific
application operation. For example, if data is not in the right format to be consumed by a
system, then the data can't be used easily, if at all.
As data moves from one location to another, different needs for the data arise based on the
context for how the data is being used. Data validation ensures that the data is correct for
specific contexts. The right type of data validation makes the data useful.
What are the different types of data validation?
Multiple types of data validation are available to ensure that the right data is being used. The
most common types of data validation include the following:
• Data type validation is common and confirms that the data in each field, column,
list, range or file matches a specified data type and format.
• Constraint validation checks to see if a given data field input fits a specified
requirement within certain ranges. For example, it verifies that a data field has a
minimum or maximum number of characters.
• Structured validation ensures that data is compliant with a specified data format,
structure or schema.
• Consistency validation makes sure data styles are consistent. For example, it
confirms that all values are listed to two decimal points.
• Code validation is similar to a consistency check and confirms that codes used for
different data inputs are correct. For example, it checks a country code or North
American Industry Classification System (NAICS) codes.
How to perform data validation
Among the most basic and common ways that data is used is within a spreadsheet program
such as Microsoft Excel or Google Sheets. In both Excel and Sheets, the data validation process
is a straightforward, integrated feature. Excel and Sheets both have a menu item listed as Data
> Data Validation. By selecting the Data Validation menu, a user can choose the specific data
type or constraint validation required for a given file or data range.
ETL (Extract, Transform and Load) and data integration tools typically integrate data
validation policies to be executed as data is extracted from one source and then loaded into
another. Popular open source tools, such as dbt, also include data validation options and are
commonly used for data transformation.
Data validation can also be done programmatically in an application context for an input value.
For example, as an input variable is sent, such as a password, it can be checked by a script to
make sure it meets constraint validation for the right length.
Data preparation can also incorporate or feed into data curation work that creates and oversees
ready-to-use data sets for BI and analytics. Data curation involves tasks such as indexing,
cataloging and maintaining data sets and their associated metadata to help users find and access
the data. In some organizations, data curator is a formal role that works collaboratively with
data scientists, business analysts, other users and the IT and data management teams. In others,
data may be curated by data stewards, data engineers, database administrators or data scientists
and business users themselves.
What are the challenges of data preparation?
Data preparation is inherently complicated. Data sets pulled together from different source
systems are highly likely to have numerous data quality, accuracy and consistency issues to
resolve. The data also must be manipulated to make it usable, and irrelevant data needs to be
weeded out. As noted above, it's a time-consuming process: The 80/20 rule is often applied to
analytics applications, with about 80% of the work said to be devoted to collecting and
preparing data and only 20% to analyzing it.
In an article on common data preparation challenges, Rick Sherman, managing partner of
consulting firm Athena IT Solutions, detailed the following seven challenges along with advice
on how to overcome each of them:
• Inadequate or non-existent data profiling. If data isn't properly profiled, errors, anomalies
and other problems might not be identified, which can result in flawed analytics.
• Missing or incomplete data. Data sets often have missing values and other forms of
incomplete data; such issues need to be assessed as possible errors and addressed if so.
• Invalid data values. Misspellings, other typos and wrong numbers are examples of invalid
entries that frequently occur in data and must be fixed to ensure analytics accuracy.
• Name and address standardization. Names and addresses may be inconsistent in data from
different systems, with variations that can affect views of customers and other entities.
• Inconsistent data across enterprise systems. Other inconsistencies in data sets drawn from
multiple source systems, such as different terminology and unique identifiers, are also a
pervasive issue in data preparation efforts.
• Data enrichment. Deciding how to enrich a data set -- for example, what to add to it -- is a
complex task that requires a strong understanding of business needs and analytics goals.
• Maintaining and expanding data prep processes. Data preparation work often becomes a
recurring process that needs to be sustained and enhanced on an ongoing basis.

HYPOTHESIS GENERATION
Data scientists work with data sets small and large, and are tellers of stories. These stories have
entities, properties and relationships, all described by data. Their apparatus and methods open
up data scientists to opportunities to identify, consolidate and validate hypotheses with data,
and use these hypotheses as starting points for our data narratives. Hypothesis generation is a
key challenge for data scientists. Hypothesis generation and by extension hypothesis
refinement constitute the very purpose of data analysis and data science.
Hypothesis generation for a data scientist can take numerous forms, such as:
1. They may be interested in the properties of a certain stream of data or a certain
measurement. These properties and their default or exceptional values may form a
certain hypothesis.
2. They may be keen on understanding how a certain measure has evolved over time. In
trying to understand this evolution of a system’s metric, or a person’s behaviour, they
could rely on a mathematical model as a hypothesis.
3. They could consider the impact of some properties on the states of systems, interactions
and people. In trying to understand such relationships between different measures and
properties, they could construct machine learning models of different kinds.
Ultimately, the purpose of such hypothesis generation is to simplify some aspect of system
behaviour and represent such behaviour in a manner that’s tangible and tractable based on
simple, explicable rules. This makes story-telling easier for data scientists when they become
new-age raconteurs, straddling data visualisations, dashboards with data summaries and
machine learning models.

Understanding Hypothesis Generation:


The importance of hypothesis generation in data science teams is many folds:
1. Hypothesis generation allows the team to experiment with theories about
the data.
2. Hypothesis generation can allow the team to take a systems-thinking
approach to the problem to be solved.
3. Hypothesis generation allows us to build more sophisticated models based
on prior hypotheses and understanding.
When data science teams approach complex projects, some of them may be wont
to diving right into building complex systems based on available resources,
libraries and software. By taking a hypothesis-centred view of the data science
problem, they could build up complexity and different understanding in a very
natural way, and build up hypotheses and ideas in the process.
What is Hypothesis Generation?
Hypothesis generation is an educated “guess” of various factors that are impacting the
business problem that needs to be solved using machine learning. In framing a hypothesis, the
data scientist must not know the outcome of the hypothesis that has been generated based on
any evidence.
“A hypothesis may be simply defined as a guess. A scientific hypothesis is an
intelligent guess.” – Isaac Asimov
Hypothesis generation is a crucial step in any data science project. If you skip this
or skim through this, the likelihood of the project failing increases exponentially.

Hypothesis Generation vs. Hypothesis Testing


Hypothesis generation is a process beginning with an educated guess
whereas hypothesis testing is a process to conclude that the educated guess is
true/false or the relationship between the variables is statistically significant or
not.
This latter part could be used for further research using statistical proof. A
hypothesis is accepted or rejected based on the significance level and test score
of the test used for testing the hypothesis.

How Does Hypothesis Generation Help?


Here are 5 key reasons why hypothesis generation is so important in data science:
• Hypothesis generation helps in comprehending the business problem as we
dive deep in inferring the various factors affecting our target variable
• You will get a much better idea of what are the major factors that are
responsible to solve the problem
• Data that needs to be collected from various sources that are key in
converting your business problem into a data science-based problem
• Improves your domain knowledge if you are new to the domain as you
spend time understanding the problem
• Helps to approach the problem in a structured manner
When Should you Perform Hypothesis Generation?
The million-dollar question – when in the world should you perform hypothesis
generation?
• The hypothesis generation should be made before looking at the dataset or
collection of the data
• You will notice that if you have done your hypothesis generation
adequately, you would have included all the variables present in the dataset
in your hypothesis generation
• You might also have included variables that are not present in the dataset

Case Study: Hypothesis Generation on “RED Taxi Trip


Duration Prediction”
Let us now look at the “RED TAXI TRIP DURATION PREDICTION” problem
statement and generate a few hypotheses that would affect our taxi trip duration to
understand hypothesis generation.
Here’s the problem statement:
To predict the duration of a trip so that the company can assign the cabs that are free
for the next trip. This will help in reducing the wait time for customers and will also
help in earning customer trust.
Let’s begin!
Hypothesis Generation Based On Various Factors
1. Distance/Speed based Features
Let us try to come up with a formula that would have a relation with trip duration
and would help us in generating various hypotheses for the problem:
TIME=DISTANCE/SPEED
Distance and speed play an important role in predicting the trip duration.
We can notice that the trip duration is directly proportional to the distance travelled
and inversely proportional to the speed of the taxi. Using this we can come up with
a hypothesis based on distance and speed.
• Distance: More the distance travelled by the taxi, the more will be the trip
duration.
• Interior drop point: Drop points to congested or interior lanes could result
in an increase in trip duration
• Speed: Higher the speed, the lower the trip duration

2. Features based on Car


Cars are of various types, sizes, brands, and these features of the car could be vital
for commute not only on the basis of the safety of the passengers but also for the trip
duration. Let us now generate a few hypotheses based on the features of the car.
• Condition of the car: Good conditioned cars are unlikely to have breakdown
issues and could have a lower trip duration
• Car Size: Small-sized cars (Hatchback) may have a lower trip duration and
larger-sized cars (XUV) may have higher trip duration based on the size of
the car and congestion in the city

3. Type of the Trip


Trip types can be different based on trip vendors – it could be an outstation trip,
single or pool rides. Let us now define a hypothesis based on the type of trip used.
• Pool Car: Trips with pooling can lead to higher trip duration as the car
reaches multiple places before reaching your assigned destination

4. Features based on Driver Details


A driver is an important person when it comes to commute time. Various factors
about the driver can help in understanding the reason behind trip duration and here
are a few hypotheses this.
• Age of driver: Older drivers could be more careful and could contribute to
higher trip duration
• Gender: Female drivers are likely to drive slowly and could contribute to
higher trip duration
• Driver experience: Drivers with very less driving experience can cause
higher trip duration
• Medical condition: Drivers with a medical condition can contribute to higher
trip duration

5. Passenger details
Passengers can influence the trip duration knowingly or unknowingly. We usually
come across passengers requesting drivers to increase the speed as they are getting
late and there could be other factors to hypothesize which we can look at.
• Age of passengers: Senior citizens as passengers may contribute to higher
trip duration as drivers tend to go slow in trips involving senior citizens
• Medical conditions or pregnancy: Passengers with medical conditions
contribute to a longer trip duration
• Emergency: Passengers with an emergency could contribute to a shorter trip
duration
• Passenger count: Higher passenger count leads to shorter duration trips due
to congestion in seating
6. Date-Time Features
The day and time of the week are important as New York is a busy city and could
be highly congested during office hours or weekdays. Let us now generate a few
hypotheses on the date and time-based features.
Pickup Day:
• Weekends could contribute to more outstation trips and could have a higher
trip duration
• Weekdays tend to have higher trip duration due to high traffic
• If the pickup day falls on a holiday then the trip duration may be shorter
• If the pickup day falls on a festive week then the trip duration could be lower
due to lesser traffic
Time:
• Early morning trips have a lesser trip duration due to lesser traffic
• Evening trips have a higher trip duration due to peak hours
7. Road-based Features
Roads are of different types and the condition of the road or obstructions in the road
are factors that can’t be ignored. Let’s form some hypotheses based on these factors.
• Condition of the road: The duration of the trip is more if the condition of the
road is bad
• Road type: Trips in concrete roads tend to have a lower trip duration
• Strike on the road: Strikes carried out on roads in the direction of the trip
causes the trip duration to increase
8. Weather Based Features
Weather can change at any time and could possibly impact the commute if the
weather turns bad. Hence, this is an important feature to consider in our hypothesis.
• Weather at the start of the trip: Rainy weather condition contributes to a
higher trip duration
After writing down our hypothesis and looking at the dataset you will notice that
you would have covered the writing of hypothesis on most of the features present in
the data set. There could also be a possibility that you might have to work with fewer
features and the features on which you have generated hypotheses are not currently
being captured/stored by the business and are not available.
Always go ahead and capture data from external sources if you think that the data
is relevant for your prediction. Ex: Getting weather information
It is also important to note that since hypothesis generation is an estimated guess,
the hypothesis generated could come out to be true or false once exploratory data
analysis and hypothesis testing is performed on the data.
MODELING:
After all the cleaning, formatting and feature selection, we will now feed the
data to the chosen model. But how does one select a model to use?
How to choose a model?
IT DEPENDS. It all depends on what the goal of your task or project is and this should already
be identified in the Business Understanding phase
Steps in choosing a model
1. Determine size of training data — if you have a small dataset, fewer number of
observations, high number of features, you can choose high bias/low variance
algorithms (Linear Regression, Naïve Bayes, Linear SVM). If your dataset is large and
has a high number of observations compared to number of features, you can choose a
low bias/high variance algorithms (KNN, Decision trees).
2. Accuracy and/or interpretability of the output — if your goal is inference, choose
restrictive models as it is more interpretable (Linear Regression, Least Squares). If your
goal is higher accuracy, then choose flexible models (Bagging, Boosting, SVM).
3. Speed or training time — always remember that higher accuracy as well as large
datasets means higher training time. Examples of easy to run and to implement
algorithms are: Naïve Bayes, Linear and Logistic Regression. Some examples
of algorithms that need more time to train are: SVM, Neural Networks, and Random
Forests.
4. Linearity —try checking first the linearity of your data by fitting a linear line or by
trying to run a logistic regression, you can also check their residual errors. Higher errors
mean that the data is not linear and needs complex algorithms to fit. If data is Linear,
you can choose: Linear Regression, Logistic Regression, Support Vector Machines. If
Non-linear: Kernel SVM, Random Forest, Neural Nets.
Parametric vs. Non-Parametric Machine Learning Models
Parametric Machine Learning Algorithms
Parametric ML Algorithms are algorithms that simplify the function to a know form. They are
often are called the “Linear ML Algorithms”.
Parametric ML Algorithms
• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naïve Bayes
• Simple Neural Networks
Benefits of Parametric ML Algorithms
• Simpler — easy to understand methods and easy to interpret results
• Speed — very fast to learn from the data provided
• Less data — it does not require as much training data
Limitations of Parametric ML Algorithms
• Limited Complexity —suited only to simpler problems
• Poor Fit — the methods are unlikely to match the underlying mapping function
Non-Parametric Machine Learning Algorithms
Non-Parametric ML Algorithms are algorithms that do not make assumptions about the form
of the mapping functions. It is good to use when you have a lot of data and no prior knowledge
and you don’t want to worry too much about choosing the right features.
Non-Parametric ML Algorithms
• K-Nearest Neighbors (KNN)
• Decision Trees like CART
• Support Vector Machines (SVM)
Benefits of Non-Parametric ML Algorithms
• Flexibility— it is capable of fitting a large number of functional forms
• Power — do not assume about the underlying function
• Performance — able to give a higher performance model for predictions
Limitations of Non-Parametric ML Algorithms
• Needs more data — requires a large training dataset
• Slower processing — they often have more parameters which means that training time
is much longer
• Overfitting — higher risk of overfitting the training data and results are harder to
explain why specific predictions were made
In the process flow above, Data Modeling is broken down into four tasks
together with its projected outcome or output in detail.
Simply put, the Data Modeling phase’s goal is to:
1.Selecting modeling techniques
The wonderful world of data mining offers lots of modeling techniques, but
not all of them will suit your needs. Narrow the list based on the kinds of
variables involved, the selection of techniques available in your tools, and
any business considerations that are important to you.
For example, many organizations favour methods with output that’s easy to
interpret, so decision trees or logistic regression might be acceptable, but
neural networks would probably not be accepted.
Deliverables for this task include two reports:
• Modeling technique: Specify the technique(s) that you will use.
• Modeling assumptions: Many modeling techniques are based on
certain assumptions. For example, a model type may be intended for
use with data that has a specific type of distribution. Document these
assumptions in this report.

2.Designing tests
The test in this task is the test that you’ll use to determine how well your model works. It may
be as simple as splitting your data into a group of cases for model training and another group
for model testing.
Training data is used to fit mathematical forms to the data model, and test data is used during
the model-training process to avoid overfitting: making a model that’s perfect for one dataset,
but no other. You may also use holdout data, data that is not used during the model-training
process, for an additional test.
The deliverable for this task is your test design. It need not be elaborate, but you should at least
take care that your training and test data are similar and that you avoid introducing any bias
into the data.

3. Building model(s)
Modeling is what many people imagine to be the whole job of the data miner, but it’s just one
task of dozens! Nonetheless, modeling to address specific business goals is the heart of the
data-mining profession.
Deliverables for this task include three items:
• Parameter settings: When building models, most tools give you the option of
adjusting a variety of settings, and these settings have an impact on the structure of the
final model. Document these settings in a report.
• Model descriptions: Describe your models. State the type of model (such as linear
regression or neural network) and the variables used. Explain how the model is
interpreted. Document any difficulties encountered in the modeling process.
• Models: This deliverable is the models themselves. Some model types can be easily
defined with a simple equation; others are far too complex and must be transmitted in
a more sophisticated format.
4. Assessing model(s)
Now you will review the models that you’ve created, from a technical standpoint and also from
a business standpoint (often with input from business experts on your project team).
Deliverables for this task include two reports:
• Model assessment: Summarizes the information developed in your model review. If
you have created several models, you may rank them based on your assessment of their
value for a specific application.
• Revised parameter settings: You may choose to fine-tune settings that were used to
build the model and conduct another round of modeling and try to improve your results.

VALIDATION:
Why data validation?
Data validation happens immediately after data preparation/wrangling and before
modeling. it is because during data preparation there is a high possibility of things going wrong
especially in complex scenarios.
Data validation ensures that modeling happens on the right data. faulty data as input to
the model would generate faulty insight!
How is data validation done?
Data validation should be done by involving minimum one external person who has a
proper understanding of the data and business. I
t is usually clients who technically good enough to check the data. Once we go through
data preparation and just before data modeling, we usually make data visualization and give
my newly prepared data to the client.
The clients with the help of SQL queries or any other tools try to validate if output
contains no error.
Combing CRISP-DM/ASUM-DM with the agile methodology, steps can be taken in
parallel meaning you do not have to wait for the green light for data validation to do the
modeling. But once you get feedback from the domain expert that there are faults in the data,
we need to correct the data by re-doing the data-preparation and re-model the data.
What are the common causes leading to a faulty output from data preparation?
Common causes are:
1. Lack of proper understanding of the data, therefore, the logic of the data preparation
is not correct.
2. Common bugs in programming/data preparation pipeline that led to a faulty output.
EVALUATION:
The evaluation phase includes three tasks. These are
• Evaluating results
• Reviewing the process
• Determining the next steps

Task: Evaluating results


At this stage, you’ll assess the value of your models for meeting the business goals that started
the data-mining process. You’ll look for any reasons why the model would not be satisfactory
for business use. If possible, you’ll test the model in a practical application, to determine
whether it works as well in the workplace as it did in your tests.
Deliverables for this task include two items:
• Assessment of results (for business goals): Summarize the results with respect to the
business success criteria that you established in the business-understanding phase.
Explicitly state whether you have reached the business goals defined at the start of the
project.
• Approved models: These include any models that meet the business success criteria.

Task: Reviewing the process


Now that you have explored data and developed models, take time to review your process. This
is an opportunity to spot issues that you might have overlooked and that might draw your
attention to flaws in the work that you’ve done while you still have time to correct the problem
before deployment. Also consider ways that you might improve your process for future
projects.
The deliverable for this task is the review of process report. In it, you should outline your
review process and findings and highlight any concerns that require immediate attention, such
as steps that were overlooked or that should be revisited.
Task: Determining the next steps
The evaluation phase concludes with your recommendations for the next move. The model
may be ready to deploy, or you may judge that it would be better to repeat some steps and try
to improve it. Your findings may inspire new data-mining projects.
Deliverables for this task include two items:
• List of possible actions: Describe each alternative action, along with the strongest
reasons for and against it.
• Decision: State the final decision on each possible action, along with the reasoning
behind the decision.

INTERPRETATION
Data interpretation as the process of assigning meaning to the collected information and
determining the conclusions, significance, and implications of the findings.
Data Interpretation Examples
Data interpretation is the final step of data analysis. This is where you turn results into
actionable items. To better understand it, here are 2 instances of interpreting data:
Let's say you've got four age groups of the user base. So a company can notice which age group
is most engaged with their content or product. Based on bar charts or pie charts, they can
either: develop a marketing strategy to make their product more appealing to non-involved
groups or develop an outreach strategy that expands on their core user base.
Steps Of Data Interpretation
Data interpretation is conducted in 4 steps:
• Assembling the information you need (like bar graphs and pie charts);
• Developing findings or isolating the most relevant inputs;
• Developing conclusions;
• Coming up with recommendations or actionable solutions.
Considering how these findings dictate the course of action, data analysts must be accurate
with their conclusions and examine the raw data from multiple angles. Different variables may
allude to various problems, so having the ability to backtrack data and repeat the analysis
using different templates is an integral part of a successful business strategy.
What Should Users Question During Data Interpretation?
To interpret data accurately, users should be aware of potential pitfalls present within this
process. You need to ask yourself if you are mistaking correlation for causation. If two things
occur together, it does not indicate that one caused the other.
The 2nd thing you need to be aware of is your own confirmation bias. This occurs when you
try to prove a point or a theory and focus only on the patterns or findings that support that
theory while discarding those that do not.
The 3rd problem is irrelevant data. To be specific, you need to make sure that the data you
have collected and analyzed is relevant to the problem you are trying to solve.
Data Interpretation Methods
Data analysts or data analytics tools help people make sense of the numerical data that has been
aggregated, transformed, and displayed. There are two main methods for data interpretation:
quantitative and qualitative.
Qualitative Data Interpretation Method
This is a method for breaking down or analyzing so-called qualitative data, also known as
categorical data. It is important to note that no bar graphs or line charts are used in this method.
Instead, they rely on text. Because qualitative data is collected through person-to-person
techniques, it isn't easy to present using a numerical approach.
Surveys are used to collect data because they allow you to assign numerical values to answers,
making them easier to analyze. If we rely solely on the text, it would be a time-consuming and
error-prone process. This is why it must be transformed.
Quantitative Data Interpretation Method
This data interpretation is applied when we are dealing with quantitative or numerical data.
Since we are dealing with numbers, the values can be displayed in a bar chart or pie chart.
There are two main types: Discrete and Continuous. Moreover, numbers are easier to analyze
since they involve statistical modeling techniques like mean and standard deviation.
Mean is an average value of a particular data set obtained or calculated by dividing the sum of
the values within that data set by the number of values within that same set.
Standard Deviation is a technique is used to ascertain how responses align with or deviate
from the average value or mean. It relies on the meaning to describe the consistency of the
replies within a particular data set. You can use this when calculating the average pay for a
certain profession and then displaying the upper and lower values in the data set.
As stated, some tools can do this automatically, especially when it comes to quantitative data.
Whatagraph is one such tool as it can aggregate data from multiple sources using different
system integrations. It will also automatically organize and analyze that which will later be
displayed in pie charts, line charts, or bar charts, however you wish.
Benefits Of Data Interpretation
Multiple data interpretation benefits explain its significance within the corporate world,
medical industry, and financial industry:
Informed decision-making. The managing board must examine the data to take action and
implement new methods. This emphasizes the significance of well-analyzed data as well as a
well-structured data collection process.
Anticipating needs and identifying trends. Data analysis provides users with relevant
insights that they can use to forecast trends. It would be based on customer concerns and
expectations.
For example, a large number of people are concerned about privacy and the leakage of personal
information. Products that provide greater protection and anonymity are more likely to become
popular.

Clear foresight. Companies that analyze and aggregate data better understand their own
performance and how consumers perceive them. This provides them with a better
understanding of their shortcomings, allowing them to work on solutions that will significantly
improve their performance.

DEPLOYMENT AND ITERATIONS:


The deployment phase includes four tasks. These are
• Planning deployment (your methods for integrating data-mining discoveries into use)
• Planning monitoring and maintenance
• Reporting final results
• Reviewing final results

Task: Planning deployment


When your model is ready to use, you will need a strategy for putting it to work in your
business.
The deliverable for this task is the deployment plan. This is a summary of your strategy for
deployment, the steps required, and the instructions for carrying out those steps.
Task: Planning monitoring and maintenance
Data-mining work is a cycle, so expect to stay actively involved with your models as they are
integrated into everyday use.
The deliverable for this task is the monitoring and maintenance plan. This is a summary of your
strategy for ongoing review of the model’s performance. You’ll need to ensure that it is being
used properly on an ongoing basis, and that any decline in model performance will be detected.

Task: Reporting final results


Deliverables for this task include two items:
• Final report: The final report summarizes the entire project by assembling all the
reports created up to this point, and adding an overview summarizing the entire project
and its results.
• Final presentation: A summary of the final report is presented in a meeting with
management. This is also an opportunity to address any open questions.

Task: Review project


Finally, the data-mining team meets to discuss what worked and what didn’t, what would be
good to do again, and what should be avoided!
This step, too, has a deliverable, although it is only for the use of the data-mining team, not the
manager (or client). It’s the experience documentation report.
This is where you should outline any work methods that worked particularly well, so that they
are documented to use again in the future, and any improvements that might be made to your
process. It’s also the place to document problems and bad experiences, with your
recommendations for avoiding similar problems in the future.
Iterations are done to upgrade the performance of the system
The outcome of decision, action and the conclusion conducted from the model are documented
and updated into the database. This helps in changing and upgrading the performance of the
existing system.

Some queries are updated in the database such as “were the decision and action impactful?”
“What was the return or investment?”,” how was the analysis group compared with the
regulating class?”. The performance-based database is continuously updated once the new
insight or knowledge is extracted.
Decision Support Systems (DSS)
Decision Support Systems (DSS) help executives make better decisions by using
historical and current data from internal Information Systems and external sources. By
combining massive amounts of data with sophisticated analytical models and tools,
and by making the system easy to use, they provide a much better source of
information to use in the decision-making process.

Decision Support Systems (DSS) are a class of computerized information systems that
support decision-making activities. DSS are interactive computer-based systems and
subsystems intended to help decision makers use communications technologies, data,
documents, knowledge and/or models to successfully complete decision process
tasks.

While many people think of decision support systems as a specialized part of a


business, most companies have actually integrated this system into their day to day
operating activities. For instance, many companies constantly download and analyze
sales data, budget sheets and forecasts and they update their strategy once they
analyze and evaluate the current results. Decision support systems have a definite
structure in businesses, but in reality, the data and decisions that are based on it are
fluid and constantly changing.

Types of Decision Support Systems (DSS)


1. Data-Driven DSS take the massive amounts of data available through the company’s
TPS and MIS systems and cull from it useful information which executives can use to
make more informed decisions. They don’t have to have a theory or model but can
“free-flow” the data. The first generic type of Decision Support System is a Data-Driven
DSS. These systems include file drawer and management reporting systems, data
warehousing and analysis systems, Executive Information Systems (EIS) and Spatial
Decision Support Systems. Business Intelligence Systems are also examples of Data-
Driven DSS. Data-Driven DSS emphasize access to and manipulation of large databases
of structured data and especially a time-series of internal company data and
sometimes external data. Simple file systems accessed by query and retrieval tools
provide the most elementary level of functionality. Data warehouse systems that allow
the manipulation of data by computerized tools tailored to a specific task and setting
or by more general tools and operators provide additional functionality. Data-Driven
DSS with Online Analytical Processing (OLAP) provide the highest level of functionality
and decision support that is linked to analysis of large collections of historical data.
2. Model-Driven DSS A second category, Model-Driven DSS, includes systems that use
accounting and financial models, representational models, and optimization models.
Model-Driven DSS emphasize access to and manipulation of a model. Simple statistical
and analytical tools provide the most elementary level of functionality. Some OLAP
systems that allow complex analysis of data may be classified as hybrid DSS systems
providing modeling, data retrieval and data summarization functionality. Model-Driven
DSS use data and parameters provided by decision-makers to aid them in analyzing a
situation, but they are not usually data intensive. Very large databases are usually not
needed for Model-Driven DSS. Model-Driven DSS were isolated from the
main Information Systems of the organization and were primarily used for the typical
“what-if” analysis. That is, “What if we increase production of our products and
decrease the shipment time?” These systems rely heavily on models to help executives
understand the impact of their decisions on the organization, its suppliers, and its
customers.
3. Knowledge-Driven DSS The terminology for this third generic type of DSS is still
evolving. Currently, the best term seems to be Knowledge-Driven DSS. Adding the
modifier “driven” to the word knowledge maintains a parallelism in the framework and
focuses on the dominant knowledge base component. Knowledge-Driven DSS can
suggest or recommend actions to managers. These DSS are personal computer
systems with specialized problem-solving expertise. The “expertise” consists of
knowledge about a particular domain, understanding of problems within that domain,
and “skill” at solving some of these problems. A related concept is Data Mining. It refers
to a class of analytical applications that search for hidden patterns in a database. Data
mining is the process of sifting through large amounts of data to produce data content
relationships.
4. Document-Driven DSS A new type of DSS, a Document-Driven DSS or Knowledge
Management System, is evolving to help managers retrieve and manage unstructured
documents and Web pages. A Document-Driven DSS integrates a variety of storage
and processing technologies to provide complete document retrieval and analysis. The
Web provides access to large document databases including databases of hypertext
documents, images, sounds and video. Examples of documents that would be accessed
by a Document-Based DSS are policies and procedures, product specifications,
catalogs, and corporate historical documents, including minutes of meetings,
corporate records, and important correspondence. A search engine is a powerful
decision aiding tool associated with a Document-Driven DSS.
5. Communications-Driven and Group DSS Group Decision Support Systems (GDSS)
came first, but now a broader category of Communications-Driven DSS or groupware
can be identified. This fifth generic type of Decision Support System includes
communication, collaboration and decision support technologies that do not fit within
those DSS types identified. Therefore, we need to identify these systems as a specific
category of DSS. A Group DSS is a hybrid Decision Support System that emphasizes
both the use of communications and decision models. A Group Decision Support
System is an interactive computer-based system intended to facilitate the solution of
problems by decision-makers working together as a group. Groupware supports
electronic communication, scheduling, document sharing, and other group
productivity and decision support enhancing activities We have a number of
technologies and capabilities in this category in the framework — Group DSS, two-way
interactive video, White Boards, Bulletin Boards, and Email.
Components of Decision Support Systems (DSS)
Traditionally, academics and MIS staffs have discussed building Decision Support
Systems in terms of four major components:

• The user interface


• The database
• The models and analytical tools and
• The DSS architecture and network

This traditional list of components remains useful because it identifies similarities and
differences between categories or types of DSS. The DSS framework is primarily based on the
different emphases placed on DSS components when systems are actually constructed.

Data-Driven, Document-Driven and Knowledge-Driven DSS need specialized database


components. A Model- Driven DSS may use a simple flat-file database with fewer than
1,000 records, but the model component is very important. Experience and some
empirical evidence indicate that design and implementation issues vary for Data-
Driven, Document-Driven, Model-Driven and Knowledge-Driven DSS.

Multi-participant systems like Group and Inter-Organizational DSS also create complex
implementation issues. For instance, when implementing a Data-Driven DSS a designer
should be especially concerned about the user’s interest in applying the DSS in
unanticipated or novel situations. Despite the significant differences created by the
specific task and scope of a DSS, all Decision Support Systems have similar technical
components and share a common purpose, supporting decision-making.

A Data-Driven DSS database is a collection of current and historical structured data


from a number of sources that have been organized for easy access and analysis. We
are expanding the data component to include unstructured documents in Document-
Driven DSS and “knowledge” in the form of rules or frames in Knowledge-Driven DSS.
Supporting management decision-making means that computerized tools are used to
make sense of the structured data or documents in a database.

Mathematical and analytical models are the major component of a Model-Driven DSS.
Each Model-Driven DSS has a specific set of purposes and hence different models are
needed and used. Choosing appropriate models is a key design issue. Also, the
software used for creating specific models needs to manage needed data and the user
interface. In Model-Driven DSS the values of key variables or parameters are changed,
often repeatedly, to reflect potential changes in supply, production, the economy,
sales, the marketplace, costs, and/or other environmental and internal factors.
Information from the models is then analyzed and evaluated by the decision-maker.

Knowledge-Driven DSS use special models for processing rules or identifying


relationships in data. The DSS architecture and networking design component refers
to how hardware is organized, how software and data are distributed in the system,
and how components of the system are integrated and connected. A major issue today
is whether DSS should be available using a Web browser on a company intranet and
also available on the Global Internet. Networking is the key driver of Communications-
Driven DSS.

Advantages of Decision Support Systems (DSS)


• Time savings. For all categories of decision support systems, research has
demonstrated and substantiated reduced decision cycle time, increased employee
productivity and more timely information for decision making. The time savings that
have been documented from using computerized decision support are often
substantial. Researchers, however, have not always demonstrated that decision quality
remained the same or actually improved.
• Enhance effectiveness. A second category of advantage that has been widely
discussed and examined is improved decision making effectiveness and better
decisions. Decision quality and decision making effectiveness are however hard to
document and measure. Most researches have examined soft measures like perceived
decision quality rather than objective measures. Advocates of building data
warehouses identify the possibility of more and better analysis that can improve
decision making.
• Improve interpersonal communication. DSS can improve communication and
collaboration among decision makers. In appropriate circumstances, communications-
driven and group DSS have had this impact. Model-driven DSS provides a means for
sharing facts and assumptions. Data-driven DSS make “one version of the truth” about
company operations available to managers and hence can encourage fact-based
decision making. Improved data accessibility is often a major motivation for building a
data-driven DSS. This advantage has not been adequately demonstrated for most
types of DSS.
• Competitive advantage. Vendors frequently cite this advantage for business
intelligence systems, performance management systems, and web-based DSS.
Although it is possible to gain a competitive advantage from computerized decision
support, this is not a likely outcome. Vendors routinely sell the same product to
competitors and even help with the installation. Organizations are most likely to gain
this advantage from novel, high risk, enterprise-wide, inward facing decision support
systems. Measuring this is and will continue to be difficult.
• Cost reduction. Some researches and especially case studies have documented DSS
cost saving from labor savings in making decisions and from lower infrastructure or
technology costs. This is not always a goal of building DSS.
• Increase decision maker satisfaction. The novelty of using computers has and may
continue to confound analysis of this outcome. DSS may reduce frustrations of decision
makers, create perceptions that better information is being used and/or creates
perceptions that the individual is a “better” decision maker. Satisfaction is a complex
measure and researchers often measure satisfaction with the DSS rather than
satisfaction with using a DSS in decision making. Some studies have compared
satisfaction with and without computerized decision aids. Those studies suggest the
complexity and “love/hate” tension of using computers for decision support.
• Promote learning. Learning can occur as a by-product of initial and ongoing use of a
DSS. Two types of learning seem to occur: learning of new concepts and the
development of a better factual understanding of the business and decision making
environment. Some DSS serve as “de facto” training tools for new employees. This
potential advantage has not been adequately examined.
• Increase organizational control. Data-driven DSS often make business transaction
data available for performance monitoring and ad hoc querying. Such systems can
enhance management understanding of business operations and managers perceive
that this is useful. What is not always evident is the financial benefit from increasingly
detailed data.

Regulations like Sarbanes-Oxley often dictate reporting requirements and hence


heavily influence the control information that is made available to managers. On a
more ominous note, some DSS provide summary data about decisions made, usage
of the systems, and recommendations of the system. Managers need to be very careful
about how decision-related information is collected and then used for organizational
control purposes. If employees feel threatened or spied upon when using a DSS, the
benefits of the DSS can be reduced. More research is needed on these questions.
Disadvantages of Decision Support Systems
(DSS)
Decision Support Systems can create advantages for organizations and can have
positive benefits, however building and using DSS can create negative outcomes in
some situations.

• Monetary cost. The decision support system requires investing in information system
to collect data from many sources and analyze them to support the decision making.
Some analysis for Decision Support System needs the advance of data analysis,
statistics, econometrics and information system, so it is the high cost to hire the
specialists to set up the system.
• Overemphasize decision making. Clearly the focus of those of us interested in
computerized decision support is on decisions and decision making. Implementing
Decision Support System may reinforce the rational perspective and overemphasize
decision processes and decision making. It is important to educate managers about
the broader context of decision making and the social, political and emotional factors
that impact organizational success. It is especially important to continue examining
when and under what circumstances Decision Support System should be built and
used. We must continue asking if the decision situation is appropriate for using any
type of Decision Support System and if a specific Decision Support System is or remains
appropriate to use for making or informing a specific decision.
• Assumption of relevance. According to Wino grad and Flores (1986), “Once a
computer system has been installed it is difficult to avoid the assumption that the
things it can deal with are the most relevant things for the manager’s concern.” The
danger is that once DSS become common in organizations, that managers will use
them inappropriately. There is limited evidence that this occurs. Again training is the
only way to avoid this potential problem.
• Transfer of power. Building Decision Support Systems, especially knowledge-driven
Decision Support System, may be perceived as transferring decision authority to a
software program. This is more a concern with decision automation systems than with
DSS. We advocate building computerized decision support systems because we want
to improve decision making while keeping a human decision maker in the “decision
loop”. In general, we value the “need for human discretion and innovation” in the
decision making process.
• Unanticipated effects. Implementing decision support technologies may have
unanticipated consequences. It is conceivable and it has been demonstrated that some
DSS reduce the skill needed to perform a decision task. Some Decision Support System
overload decision makers with information and actually reduce decision making
effectiveness.
• Obscuring responsibility. The computer does not make a “bad” decision, people do.
Unfortunately some people may deflect personal responsibility to a DSS. Managers
need to be continually reminded that the computerized decision support system is an
intermediary between the people who built the system and the people who use the
system. The entire responsibility associated with making a decision using a DSS resides
with people who built and use the system.
• False belief in objectivity. Managers who use Decision Support Systems may or may
not be more objective in their decision making. Computer software can encourage
more rational action, but managers can also use decision support technologies to
rationalize their actions. It is an overstatement to suggest that people using a DSS are
more objective and rational than managers who are not using computerized decision
support.
• Status reduction. Some managers argue using a Decision Support System will
diminish their status and force them to do clerical work. This perceptual problem can
be a disadvantage of implementing a DSS. Managers and IS staff who advocate
building and using computerized decision support need to deal with any status issues
that may arise. This perception may or should be less common now that computer
usage is common and accepted in organizations.
• Information overload. Too much information is a major problem for people and
many DSS increase the information load. Although this can be a problem, Decision
Support System can help managers organize and use information. Decision Support
System can actually reduce and manage the information load of a user. Decision
Support System developers need to try to measure the information load created by
the system and Decision Support System users need to monitor their perceptions of
how much information they are receiving. The increasing ubiquity of handheld, wireless
computing devices may exacerbate this problem and disadvantage.

In conclusion, before firms will invest in the Decision Support Systems, they must
compare the advantages and disadvantages of the decision support system to get
valuable investment.
Business Forecasting

3.1 Introduction
The growing competition, rapidity of change in circumstances and the trend towards
automation demand that decisions in business are based on a careful analysis of data
concerning the future course of events and not purely on guesses and hunches. The
future is unknown to us and yet every day we are forced to make decisions involving
the future and therefore, there is uncertainty. Great risk is associated with business
affairs. All businessmen are forced to make forecasts regarding business activities.
Success in business depends upon successful forecasts of business events. In recent
times, considerable research has been conducted in this field. Attempts are being
made to make forecasting as scientific as possible.
Business forecasting is not a new development. Every businessman must forecast;
even if the entire product is sold before production. Forecasting has always been
necessary. What is new in the attempt to put forecasting on a scientific basis is to
forecast by reference to past history and statistics rather than by pure intuition and
guess-work.
One of the most important tasks before businessmen and economists these days is to
make estimates for the future. For example, a businessman is interested in finding
out his likely sales next year or as long term planning in next five or ten years so that
he adjusts his production accordingly and avoid the possibility of either inadequate
production to meet the demand or unsold stocks.
Similarly, an economist is interested in estimating the likely population in the coming
years so that proper planning can be carried out with regard to jobs for the people,
food supply, etc. First step in making estimates for the future consists of gathering
information from the past. In this connection we usually deal with statistical data
which is collected, observed or recorded at successive intervals of time. Such data is
generally referred to as time series. Thus, when we observe numerical data at
different points of time the set of observations is known as time series.
Objectives:
After studying this unit, you should be able to:
• describe the meaning of business forecasting
• distinguish between prediction, projection and forecast
• describe the forecasting methods available
• apply the forecasting theories in taking effective business decisions
3.2 Business Forecasting
Business forecasting refers to the analysis of past and present economic conditions
with the object of drawing inferences about probable future business conditions. The
process of making definite estimates of future course of events is referred to as
forecasting and the figure or statements obtained from the process is known as
‘forecast’; future course of events is rarely known. In order to be assured of the
coming course of events, an organised system of forecasting helps. The following are
two aspects of scientific business forecasting:
1. Analysis of past economic conditions
For this purpose, the components of time series are to be studied. The secular trend
shows how the series has been moving in the past and what its future course is likely
to be over a long period of time. The cyclic fluctuations would reveal whether the
business activity is subjected to a boom or depression. The seasonal fluctuations
would indicate the seasonal changes in the business activity.
2. Analysis of present economic conditions
The object of analysing present economic conditions is to study those factors which
affect the sequential changes expected on the basis of the past conditions. Such
factors are new inventions, changes in fashion, changes in economic and political
spheres, economic and monetary policies of the government, war, etc. These factors
may affect and alter the duration of trade cycle. Therefore, it is essential to keep in
mind the present economic conditions since they have an important bearing on the
probable future tendency.
3.2.1 Objectives of forecasting in business
Forecasting is a part of human nature. Businessmen also need to look to the future.
Success in business depends on correct predictions. In fact when a man enters
business, he automatically takes with it the responsibility for attempting to forecast
the future.
To a very large extent, success or failure would depend upon the ability to
successfully forecast the future course of events. Without some element of continuity
between past, present and future, there would be little possibility of successful
prediction. But history is not likely to repeat itself and we would hardly expect
economic conditions next year or over the next 10 years to follow a clear cut
prediction. Yet, past patterns prevail sufficiently to justify using the past as a basis
for predicting the future.
A businessman cannot afford to base his decisions on guesses. Forecasting helps a
businessman in reducing the areas of uncertainty that surround management decision
making with respect to costs, sales, production, profits, capital investment, pricing,
expansion of production, extension of credit, development of markets, increase of
inventories and curtailment of loans. These decisions are to be based on present
indications of future conditions.
However, we know that it is impossible to forecast the future precisely. There is a
possibility of occurrence of some range of error in the forecast. Statistical forecasts
are the methods in which we can use the mathematical theory of probability to
measure the risks of errors in predictions.
3.2.1.1 Prediction, Projection and Forecasting
A great amount of confusion seems to have grown up in the use of words ‘forecast’,
‘prediction’ and ‘projection’.

Key Statistic
A prediction is an estimate based solely on past data of the series under
investigation. It is purely a statistical extrapolation.
A projection is a prediction, where the extrapolated values are subject to
certain numerical assumptions.
A forecast is an estimate, which relates the series in which we are
interested into external factors.

Forecasts are made by estimating future values of the external factors by means of
prediction, projection or forecast and from these values calculating the estimate of
the dependent variable.
3.2.2 Characteristics of Business Forecasting
• Based on past and present conditions
Business forecasting is based on past and present economic condition of the business.
To forecast the future, various data, information and facts concerning to economic
condition of business for past and present are analysed.
• Based on mathematical and statistical methods
The process of forecasting includes the use of statistical and mathematical methods.
By using these methods, the actual trend which may take place in future can be
forecasted.
• Period
The forecasting can be made for long term, short term, medium term or any specific
period.
• Estimation of future
Business forecasting is to forecast the future regarding probable economic conditions.
• Scope
Forecasting can be physical as well as financial.
3.2.3 Steps in forecasting
Forecasting of business fluctuations consists of the following steps:
1. Understanding why changes in the past have occurred
One of the basic principles of statistical forecasting is that the forecaster should use
past performance data. The current rate and changes in the rate constitute the basis of
forecasting. Once they are known, various mathematical techniques can develop
projections from them. If an attempt is made to forecast business fluctuations without
understanding why past changes have taken place, the forecast will be purely
mechanical.
Business fluctuations are based solely upon the application of mathematical formulae
and are subject to serious error.
2. Determining which phases of business activity must be measured After
understanding the reasons of occurrence of business fluctuations, it is necessary to
measure certain phases of business activity in order to predict what changes will
probably follow the present level of activity.

3. Selecting and compiling data to be used as measuring devices


There is an independent relationship between the selection of statistical data and
determination of why business fluctuations occur. Statistical data cannot be collected
and analysed in an intelligent manner unless there is sufficient understanding of
business fluctuations. It is important that reasons for business fluctuations be stated
in such a manner that it is possible to secure data that is related to the reasons.
4. Analysing the data
Lastly, the data is analysed to understanding the reason why change occurs. For
example, if it is reasoned that a certain combination of forces will result in a given
change, the statistical part of the problem is to measure these forces, from the data
available, to draw conclusions on the future course of action. The methods of drawing
conclusions may be called forecasting techniques.

3.2.4 Methods of Business Forecasting


Almost all businessmen forecast about the conditions related to their business. In
recent years scientific methods of forecasting have been developed. The base of
scientific forecasting is statistics. To handle the increasing variety of managerial
forecasting problems, several forecasting techniques have been developed in recent
years. Forecasting techniques vary from simple expert guesses to complex analysis
of mass data. Each technique has its special use, and care must be taken to select the
correct technique for a particular situation.
Before applying a method of forecasting, the following questions should be
answered:
• What is the purpose of the forecast and how is it to be used?
• What are the dynamics and components of the system for which the
forecast will be made?
• How important is the past, in estimating the future?
The following are the two main types of business forecasting methods: quantitative
and qualitative. While both have unique approaches, they’re similar in their goals and
the information used to make predictions – company data and market knowledge.

Quantitative forecasting
The quantitative forecasting method relies on historical data to predict future needs and
trends. The data can be from your own company, market activity, or both. It focuses on cold,
hard numbers that can show clear courses of change and action. This method is beneficial
for companies that have an extensive amount of data at their disposal.

There are four quantitative forecasting methods:


1. Trend series method:Also referred to as time series analysis, this is the most common
forecasting method. Trend series collects as much historical data as possible to identify
common shifts over time. This method is useful if your company has a lot of past data that
already shows reliable trends.
2. The average approach: This method is also based on repetitive trends. The average
approach assumes that the average of past metrics will predict future events. Companies
most commonly use the average approach for inventory forecasting.
3. Indicator approach: This approach follows different sets of indicator data that help predict
potential influences on the general economic conditions, specific target markets, and supply
chain. Some examples of indicators include changes in Gross Domestic Product (GDP),
unemployment rate, and Consumer Price Index (CPI). By monitoring the applicable
indicators, companies can easily predict how these changes may affect their own business
needs and profitability by observing how they interact with each other. This approach would
be the most effective for companies whose sales are heavily affected by specific economic
factors.
4. Econometric modeling:This method takes a mathematical approach using regression
analysis to measure the consistency in company data over time. Regression analysis uses
statistical equations to predict how variables of interest interact and affect a company. The
data used in this analysis can be internal datasets or external factors that can affect a business,
such as market trends, weather, GDP growth, political changes, and more. Econometric
modeling observes the consistency in those datasets and factors to identify the potential for
repeat scenarios in the future.
For example, a company that sells hurricane impact windows may use econometric modeling
to measure how hurricane season has affected their sales in the past and create forecasts for
future hurricane seasons.

Qualitative forecasting
The qualitative forecasting method relies on the input of those who influence your company’s
success. This includes your target customer base and even your leadership team. This method
is beneficial for companies that don’t have enough complex data to conduct a quantitative
forecast.
There are two approaches to qualitative forecasting:
1. Market research: The process of collecting data points through direct correspondence with
the market community. This includes conducting surveys, polls, and focus groups to gather
real-time feedback and opinions from the target market. Market research looks at
competitors to see how they adjust to market fluctuations and adapt to changing supply and
demand. Companies commonly utilize market research to forecast expected sales for new
product launches.
2. Delphi method:This method collects forecasting data from company professionals. The
company’s foreseeable needs are presented to a panel of experts, who then work together to
forecast the expectations and business decisions that can be made with the derived insights.
This method is used to create long-term business predictions and can also be applied to sales
forecasts.
3.2 Utility of Business Forecasting
Business forecasting acquires an important place in every field of the economy.
Business forecasting helps the businessmen and industrialists to form the policies and
plans related with their activities. On the basis of the forecasting, businessmen can
forecast the demand of the product, price of the product, condition of the market and
so on. The business decisions can also be reviewed on the basis of business
forecasting.
3.3.1 Advantages of business forecasting
• Helpful in increasing profit and reducing losses
Every business is carried out with the purpose of earning maximum profits. So, by
forecasting the future price of the product and its demand, the businessman can
predetermine the production cost, production and the level of stock to be determined.
Thus, business forecasting is regarded as the key of success of business.
• Helpful in taking management decisions
Business forecasting provides the basis for management decisions, because in
present times the management has to take the decision in the atmosphere of
uncertainties. Also, business forecasting explains the future conditions and enables
the management to select the best alternative.
• Useful to administration
On the basis of forecasting, the government can control the circulation of money. It
can also modify the economic, fiscal and monetary policies to avoid adverse effects
of trade cycles. So, with the help of forecasting, the government can control the
expected fluctuations in future.
• Basis for capital market
Business forecasting helps in estimating the requirement of capital, position of stock
exchange and the nature of investors.
• Useful in controlling the business cycles
The trade cycles cause various depressions in business such as sudden change in price
level, increase in the risk of business, increase in unemployment, etc. By adopting a
systematic business forecasting, businessmen and government can handle and control
the depression of trade cycles.
• Helpful in achieving the goals
Business forecasting helps to achieve the objective of business goals through proper
planning of business improvement activities.
• Facilitates control
By business forecasting, the tendency of black marketing, speculation, uneconomic
activities and corruption can be controlled.
• Utility to society
With the help of business forecasting the entire society is also benefited because the
adverse effects of fluctuations in the conditions of business are kept under control.
3.3.2Limitations of business forecasting
Business forecasting cannot be accurate due to various limitations which are
mentioned below.
• Forecasting cannot be accurate, because it is largely based on future events and
there is no guarantee that they will happen.
• Business forecasting is generally made by using statistical and mathematical
methods. However, these methods cannot claim to make an uncertain future a
definite one.
• The underlying assumptions of business forecasting cannot be satisfied
simultaneously. In such a case, the results of forecasting will be misleading.
• The forecasting cannot guarantee the elimination of errors and mistakes. The
managerial decision will be wrong if the forecasting is done in a wrong way.
• Factors responsible for economic changes are often difficult to discover and
measure. Hence, business forecasting becomes an unnecessary exercise.
• Business forecasting does not evaluate risks.
• The forecasting is made on the basis of past information and data and relies on
the assumption that economic events are repeated under the same conditions. But
there may be circumstances where these conditions are not repeated.
• Forecasting is not a continuous process. In order to be effective, it requires
continuous attention.
Predictive Analytics
Predictive Analytics is a statistical method that utilizes algorithms and machine learning to
identify trends in data and predict future behaviors.
With increasing pressure to show a return on investment (ROI) for implementing learning
analytics, it is no longer enough for a business to simply show how learners performed or
how they interacted with learning content. It is now desirable to go beyond descriptive
analytics and gain insight into whether training initiatives are working and how they can be
improved.
Predictive Analytics can take both past and current data and offer predictions of what could
happen in the future. This identification of possible risks or opportunities enables businesses
to take actionable intervention in order to improve future learning initiatives.

How does Predictive Analytics work?

The software for predictive analytics has moved beyond the realm of statisticians and is
becoming more affordable and accessible for different markets and industries, including the
field of learning & development.
For online learning specifically, predictive analytics is often found incorporated in the
Learning Management System (LMS), but can also be purchased separately as specialized
software.
For the learner, predictive forecasting could be as simple as a dashboard located on the main
screen after logging in to access a course. Analyzing data from past and current progress,
visual indicators in the dashboard could be provided to signal whether the employee was on
track with training requirements.
At the business level, an LMS system with predictive analytic capability can help improve
decision-making by offering in-depth insight to strategic questions and concerns. This could
range from anything to course enrolment, to course completion rates, to employee
performance.
Predictive analytic models
Because predictive analytics goes beyond sorting and describing data, it relies heavily on
complex models designed to make inferences about the data it encounters. These models
utilize algorithms and machine learning to analyze past and present data in order to provide
future trends.
Each model differs depending on the specific needs of those employing predictive analytics.
Some common basic models that are utilized at a broad level include:
• Decision trees use branching to show possibilities stemming from each
outcome or choice.
• Regression techniques assist with understanding relationships between
variables.

• Neural networks utilize algorithms to figure out possible relationships


within data sets.

What does a business need to know before using predictive analytics?


For businesses who want to incorporate predictive analytics into their learning analytics
strategy, the following steps should be considered:
• Establish a clear direction Predictive analytics rely on specifically
programmed algorithms and machine learning to track and analyze data, all
of which depend on the unique questions being asked. For example, wanting
to know whether employees will complete a course is a specific question; the
software would need to analyze the relevant data in order to formulate
possible trends on completion rates. It is important that businesses know
what their needs are.

• Be actively involved Predictive analytics require active input and


involvement from those utilizing the technique. This means deciding and
understanding what data is being collected and why. The quality of data
should also be monitored. Without human involvement, the data collected
and models used for analysis may provide no beneficial meaning.
What are the benefits of using predictive analytics?
Here are a few key benefits that businesses can expect to find when incorporating predictive
analytics into their overall learning analytics strategy:
• Personalize the training needs of employees by identifying their gaps,
strengths, and weaknesses; specific learning resources and training can be
offered to support individual needs.
• Retain Talent by tracking and understanding employee career progression
and forecasting what skills and learning resources would best benefit their
career paths. Knowing what skills employees need also benefits the design
of future training.
• Support employees who may be falling behind or not reaching their
potential by offering intervention support before their performance puts them
at risk.
• Simplified reporting and visuals that keep everyone updated when
predictive forecasting is required.
Examples of how Predictive Analytics are being used in online learning
Many businesses are beginning to incorporate predictive analytics into their learning
analytics strategy by utilizing the predictive forecasting features offered in Learning
Management Systems and specialized software.
Here are a few examples:
1. Training targets Some systems monitor and collect data on how employees
interact within the learning environment, such as tracking how often courses
or resources are accessed and whether they are completed. Achievement
level can also be analyzed, including assessment performance, length of time
to complete training, and outstanding training requirements. An analysis of
these aggregated data patterns can reveal how employees may continue to
perform in the future. This makes it easier to identify employees who are not
on track to fulfilling ongoing training requirements.

2. Talent management Predictive reporting can also forecast how employees


are developing in their role and within the company; this involves tracking
and forecasting on individual employee learning paths, training, and
upskilling activity. This is important for Human Resources (HR) who may
need to manage the talent pool for a large number of employees or training
departments wanting to know what resources will be effective for individual
skill development.

Predictive Modeling
Predictive modeling means developing models that can be used to forecast or predict future events. In
business analytics, models can be developed based on logic or data.
Logic-Driven Models
A logic-driven model is one based on experience, knowledge, and logical relationships of
variables and constants connected to the desired business performance outcome situation.
The question here is how to put variables and constants together to create a model that can
predict the future. Doing this requires business experience. Model building requires an
understanding of business systems and the relationships of variables and constants that seek
to generate a desirable business performance outcome. To help conceptualize the
relationships inherent in a business system, diagramming methods can be helpful. For
example, the cause-and-effect diagram is a visual aid diagram that permits a user to
hypothesize relationships between potential causes of an outcome (see Figure). This diagram
lists potential causes in terms of human, technology, policy, and process resources in an
effort to establish some basic relationships that impact business performance. The diagram
is used by tracing contributing and relational factors from the desired business performance
goal back to possible causes, thus allowing the user to better picture sources of potential
causes that could affect the performance. This diagram is sometimes referred to as a fishbone
diagram because of its appearance.
Fig Cause-and-effect diagram*

Another useful diagram to conceptualize potential relationships with business performance


variables is called the influence diagram. According to Evans influence diagrams can be
useful to conceptualize the relationships of variables in the development of models. An
example of an influence diagram is presented in Figure 6.2. It maps the relationship of
variables and a constant to the desired business performance outcome of profit. From such
a diagram, it is easy to convert the information into a quantitative model with constants and
variables that define profit in this situation:

Profit = Revenue − Cost,

or Profit = (Unit Price × Quantity Sold) − [(Fixed Cost) + (Variable Cost ×Quantity Sold)],

or P = (UP × QS) − [FC + (VC × QS)]


Figure 6.2 An influence diagram

The relationships in this simple example are based on fundamental business knowledge.
Consider, however, how complex cost functions might become without some idea of how
they are mapped together. It is necessary to be knowledgeable about the business systems
being modeled in order to capture the relevant business behavior. Cause-and-effect diagrams
and influence diagrams provide tools to conceptualize relationships, variables, and
constants, but it often takes many other methodologies to explore and develop predictive
models.

6.2.2 Data-Driven Models

Logic-driven modeling is often used as a first step to establish relationships


through data-driven models (using data collected from many sources to
quantitatively establish model relationships). To avoid duplication of content
and focus on conceptual material in the chapters, we have relegated most of the
computational aspects and some computer usage content to the appendixes. In
addition, some of the methodologies are illustrated in the case problems
presented in this book. Please refer to the Additional Information column in
Table 6.1 to obtain further information on the use and application of the data-
driven models.
Table 6.1 Data-Driven Models

6.3 Data Mining


As mentioned in Chapter 3, data mining is a discovery-driven software
application process that provides insights into business data by finding hidden
patterns and relationships in big or small data and inferring rules from them to
predict future behavior. These observed patterns and rules guide decision-
making. This is not just numbers, but text and social media information from
the Web. For example, Abrahams et al. (2013) developed a set of text-mining
rules that automobile manufacturers could use to distill or mine specific vehicle
component issues that emerge on the Web but take months to show up in
complaints or other damaging media. These rules cut through the mountainous
data that exists on the Web and are reported to provide marketing and
competitive intelligence to manufacturers, distributors, service centers, and
suppliers. Identifying a product’s defects and quickly recalling or correcting
the problem before customers experience a failure reduce customer
dissatisfaction when problems occur.

6.3.1 A Simple Illustration of Data Mining

Suppose a grocery store has collected a big data file on what customers put
into their baskets at the market (the collection of grocery items a customer
purchases at one time). The grocery store would like to know if there are any
associated items in a typical market basket. (For example, if a customer
purchases product A, she will most often associate it or purchase it with product
B.) If the customer generally purchases product A and B together, the store
might only need to advertise product A to gain both product A’s and B’s sales.
The value of knowing this association of products can improve the performance
of the store by reducing the need to spend money on advertising both products.
The benefit is real if the association holds true.

Finding the association and proving it to be valid require some analysis.


From the descriptive analytics analysis, some possible associations may have
been uncovered, such as product A’s and B’s association. With any size data
file, the normal procedure in data mining would be to divide the file into two
parts. One is referred to as a training data set, and the other as a validation data
set. The training data set develops the association rules, and the validation data
set tests and proves that the rules work. Starting with the training data set, a
common data mining methodology is what-if analysis using logic-based
software. SAS has a what-if logic-based software application, and so do a
number of other software vendors (see Chapter 3). These software applications
allow logic expressions. (For example, if product A is present, then is product
B present?) The systems can also provide frequency and probability
information to show the strength of the association. These software systems
have differing capabilities, which permit users to deterministically simulate
different scenarios to identify complex combinations of associations between
product purchases in a market basket.

Once a collection of possible associations is identified and their


probabilities are computed, the same logic associations (now considered
association rules) are rerun using the validation data set. A new set of
probabilities can be computed, and those can be statistically compared using
hypothesis testing methods to determine their similarity. Other software
systems compute correlations for testing purposes to judge the strength and the
direction of the relationship. In other words, if the consumer buys product A
first, it could be referred to as the Head and product B as the Body of the
association (Nisbet et al., 2009, p. 128). If the same basic probabilities are
statistically significant, it lends validity to the association rules and their use
for predicting market basket item purchases based on groupings of products.

6.3.2 Data Mining Methodologies

Data mining is an ideal predictive analytics tool used in the BA process.


We mentioned in Chapter 3 different types of information that data mining can
glean, and Table 6.2 lists a small sampling of data mining methodologies to
acquire different types of information. Some of the same tools used in the
descriptive analytics step are used in the predictive step but are employed to
establish a model (either based on logical connections or quantitative formulas)
that may be useful in predicting the future.
Table 6.2 Types of Information and Data Mining Methodologies

Several computer-based methodologies listed in Table 6.2 are briefly


introduced here. Neural networks are used to find associations where
connections between words or numbers can be determined. Specifically, neural
networks can take large volumes of data and potential variables and explore
variable associations to express a beginning variable (referred to as an input
layer), through middle layers of interacting variables, and finally to an ending
variable (referred to as an output). More than just identifying simple one-on-
one associations, neural networks link multiple association pathways through
big data like a collection of nodes in a network. These nodal relationships
constitute a form of classifying groupings of variables as related to one another,
but even more, related in complex paths with multiple associations (Nisbet et
al., 2009, pp. 128–138). Differing software have a variety of association
network function capabilities. SAS offers a series of search engines that can
identify associations. SPSS has two versions of neural network software
functions: Multilayer Perception (MLP) and Radial Basis Function (RBF).
Both procedures produce a predictive model for one or more dependent
variables based on the values of the predictive variables. Both allow a decision
maker to develop, train, and use the software to identify particular traits (such
as bad loan risks for a bank) based on characteristics from data collected on
past customers.

Discriminant analysis is similar to a multiple regression model except that


it permits continuous independent variables and a categorical dependent
variable. The analysis generates a regression function whereby values of the
independent variables can be incorporated to generate a predicted value for the
dependent variable. Similarly, logistic regression is like multiple regression.
Like discriminant analysis, its dependent variable can be categorical. The
independent variables in logistic regression can be either continuous or
categorical. For example, in predicting potential outsource providers, a firm
might use a logistic regression, in which the dependent variable would be to
classify an outsource provider as either rejected (represented by the value of
the dependent variable being zero) or acceptable (represented by the value of
one for the dependent variable).

Hierarchical clustering is a methodology that establishes a hierarchy of


clusters that can be grouped by the hierarchy. Two strategies are suggested for
this methodology: agglomerative and divisive. The agglomerative strategy is a
bottom-up approach, in which one starts with each item in the data and begins
to group them. The divisive strategy is a top-down approach, in which one starts
with all the items in one group and divides the group into clusters. How the
clustering takes place can involve many different types of algorithms and
differing software applications. One method commonly used is to employ a
Euclidean distance formula that looks at the square root of the sum of distances
between two variables, their differences squared. Basically, the formula seeks
to match up variable candidates that have the least squared error differences.
(In other words, they’re closer together.)

K-mean clustering is a classification methodology that permits a set of data


to be reclassified into K groups, where K can be set as the number of groups
desired. The algorithmic process identifies initial candidates for the K groups
and then interactively searches other candidates in the data set to be averaged
into a mean value that represents a particular K group. The process of selection
is based on maximizing the distance from the initial K candidates selected in
the initial run through the list. Each run or iteration through the data set allows
the software to select further candidates for each group.

The K-mean clustering process provides a quick way to classify data into
differentiated groups. To illustrate this process, use the sales data in Figure 6.3
and assume these are sales from individual customers. Suppose a company
wants to classify the sales customers into high and low sales groups.

Figure 6.3 Sales data for cluster classification problem

The SAS K-Mean cluster software can be found in Proc Cluster. Any
integer value can designate the K number of clusters desired. In this problem
set, K=2. The SAS printout of this classification process is shown in Table 6.3.
The Initial Cluster Centers table listed the initial high (20167) and a low
(12369) value from the data set as the clustering process begins. As it turns out,
the software divided the customers into 9 high sales customers and 11 low sales
customers.

Table 6.3 SAS K-Mean Cluster Solution

Consider how large big data sets can be. Then realize this kind of
classification capability can be a useful tool for identifying and predicting sales
based on the mean values.

There are so many BA methodologies that no single section, chapter, or


even book can explain or contain them all. The analytic treatment and computer
usage in this chapter have been focused mainly on conceptual use. For a more
applied use of some of these methodologies, note the case study that follows
and some of the content in the appendixes.

6.4 Continuation of Marketing/Planning Case Study Example:


Prescriptive Analytics Step in the BA Process

In the last sections, an ongoing marketing/planning case study of the


relevant BA step discussed in those chapters is presented to illustrate some of
the tools and strategies used in a BA problem analysis. This is the second
installment of the case study dealing with the predictive analytics analysis step
in BA.

6.4.1 Case Study Background Review

The case study firm had collected a random sample of monthly sales
information presented in Figure 6.4 listed in thousands of dollars. What the
firm wants to know is, given a fixed budget of $350,000 for promoting this
service product, when it is offered again, how best should the company allocate
budget dollars in hopes of maximizing the future estimated month’s product
sales? Before the firm makes any allocation of budget, there is a need to
understand how to estimate future product sales. This requires understanding
the behavior of product sales relative to sales promotion efforts using radio,
paper, TV, and point-of-sale (POS) ads.
Figure 6.4 Data for marketing/planning case study

The previous descriptive analytics analysis in Chapter 5 revealed a


potentially strong relationship between radio and TV commercials that might
be useful in predicting future product sales. The analysis also revealed little
regarding the relationship of newspaper and POS ads to product sales. So
although radio and TV commercials are most promising, a more in-depth
predictive analytics analysis is called for to accurately measure and document
the degree of relationship that may exist in the variables to determine the best
predictors of product sales.

6.4.2 Predictive Analytics Analysis

An ideal multiple variable modeling approach that can be used in this


situation to explore variable importance in this case study and eventually lead
to the development of a predictive model for product sales is correlation and
multiple regression. We will use SAS’s statistical package to compute the
statistics in this step of the BA process.

First, we must consider the four independent variables—radio, TV,


newspaper, POS—before developing the model. One way to see the statistical
direction of the relationship (which is better than just comparing graphic charts)
is to compute the Pearson correlation coefficients r between each of the
independent variables with the dependent variable (product sales). The SAS
correlation coefficients and their levels of significance are presented in Table
6.4. The larger the Pearson correlation (regardless of the sign) and the smaller
the Significance test values (these are t-tests measuring the significance of the
Pearson r value; see Appendix A), the more significant the relationship. Both
radio and TV are statistically significant correlations, whereas at a 0.05 level
of significance, paper and POS are not statistically significant.
Table 6.4 SAS Pearson Correlation Coefficients: Marketing/Planning Case
Study

Although it can be argued that the positive or negative correlation


coefficients should not automatically discount any variable from what will be
a predictive model, the negative correlation of newspapers suggests that as a
firm increases investment in newspaper ads, it will decrease product sales. This
does not make sense in this case study. Given the illogic of such a relationship,
its potential use as an independent variable in a model is questionable. Also,
this negative correlation poses several questions that should be considered.
Was the data set correctly collected? Is the data set accurate? Was the sample
large enough to have included enough data for this variable to show a positive
relationship? Should it be included for further analysis? Although it is possible
that a negative relationship can statistically show up like this, it does not make
sense in this case. Based on this reasoning and the fact that the correlation is
not statistically significant, this variable (newspaper ads) will be removed from
further consideration in this exploratory analysis to develop a predictive model.
Some researchers might also exclude POS based on the insignificance
(p=0.479) of its relationship with product sales. However, for purposes of
illustration, continue to consider it a candidate for model inclusion. Also, the
other two independent variables (radio and TV) were found to be significantly
related to product sales, as reflected in the correlation coefficients in the tables.

At this point, there is a dependent variable (product sales) and three


candidate independent variables (POS, TV, and Radio) in which to establish a
predictive model that can show the relationship between product sales and
those independent variables. Just as a line chart was employed to reveal the
behavior of product sales and the other variables in the descriptive analytic
step, a statistical method can establish a linear model that combines the three
predictive variables. We will use multiple regression, which can incorporate
any of the multiple independent variables, to establish a relational model for
product sales in this case study. Multiple regression also can be used to
continue our exploration of the candidacy of the three independent variables.

The procedure by which multiple regression can be used to evaluate which


independent variables are best to include or exclude in a linear model is called
step-wise multiple regression. It is based on an evaluation of regression models
and their validation statistics—specifically, the multiple correlation
coefficients and the F-ratio from an ANOVA. SAS software and many other
statistical systems build in the step-wise process. Some are called backward
selection or step-wise regression, and some are called forward selection or
step-wise regression. The backward step-wise regression starts with all the
independent variables placed in the model, and the step-wise process removes
them one at a time based on worst predictors first until a statistically significant
model emerges. The forward step-wise regression starts with the best related
variable (using correction analysis as a guide), and then step-wise adds other
variables until adding more will no longer improve the accuracy of the model.
The forward step-wise regression process will be illustrated here manually. The
first step is to generate individual regression models and statistics for each
independent variable with the dependent variable one at a time. These three
SAS models are presented in Tables 6.5, 6.6, and 6.7 for the POS, radio, and
TV variables, respectively.
Table 6.5 SAS POS Regression Model: Marketing/Planning Case Study
Table 6.6 SAS Radio Regression Model: Marketing/Planning Case Study
Table 6.7 SAS TV Regression Model: Marketing/Planning Case Study

The computer printouts in the tables provide a variety of statistics for


comparative purposes. Discussion will be limited here to just a few. The R-
Square statistics are a precise proportional measure of the variation that is
explained by the independent variable’s behavior with the dependent variable.
The closer the R-Square is to 1.00, the more of the variation is explained, and
the better the predictive variable. The three variables ’R-Squares are 0.0002
(POS), 0.9548 (radio), and 0.9177 (TV). Clearly, radio is the best predictor
variable of the three, followed by TV and, without almost any relationship,
POS. This latter result was expected based on the prior Pearson correlation.
What it is suggesting is that only 0.0823 percent (1.000−0.9177) of the
variation in product sales is explained by TV commercials.

From ANOVA, the F-ratio statistic is useful in actually comparing the


regression model’s capability to predict the dependent variable. As R-Square
increases, so does the F-ratio because of the way in which they are computed
and what is measured by both. The larger the F-ratio (like the R-Square
statistic), the greater the statistical significance in explaining the variable’s
relationships. The three variables ’F-ratios from the ANOVA tables are 0.00
(POS), 380.22 (radio), and 200.73 (TV). Both radio and TV are statistically
significant, but POS has an insignificant relationship. To give some idea of how
significant the relationships are, assuming a level of significance where α=0.01,
one would only need a cut-off value for the F-ratio of 8.10 to designate it as
being significant. Not exceeding that F-ratio (as in the case of POS at 0.00) is
the same as saying that the coefficient in the regression model for POS is no
different from a value of zero (no contribution to Product Sales). Clearly, the
independent variables radio and TV appear to have strong relationships with
the dependent variable. The question is whether the two combined or even three
variables might provide a more accurate forecasting model than just using the
one best variable like radio.

Continuing with the step-wise multiple regression procedure, we next


determine the possible combinations of variables to see if a particular
combination is better than the single variable models computed previously. To
measure this, we have to determine the possible combinations for the variables
and compute their regression models. The combinations are (1) POS and radio;
(2) POS and TV; (3) POS, radio, and TV; and (4) radio and TV.

The resulting regression model statistics are summarized and presented in


Table 6.8. If one is to base the selection decision solely on the R-Square
statistic, there is a tie between the POS/radio/TV and the radio/TV combination
(0.979 R-Square values). If the decision is based solely on the F-ratio value
from ANOVA, one would select just the radio/TV combination, which one
might expect of the two most significantly correlated variables.
Table 6.8 SAS Variable Combinations and Regression Model Statistics:
Marketing/Planning Case Study

To aid in supporting a final decision and to ensure these analytics are the
best possible estimates, we can consider an additional statistic. That tie breaker
is the R-Squared (Adjusted) statistic, which is commonly used in multiple
regression models.

The R-Square Adjusted statistic does not have the same interpretation as R-
Square (a precise, proportional measure of variation in the relationship). It is
instead a comparative measure of suitability of alternative independent
variables. It is ideal for selection between independent variables in a multiple
regression model. The R-Square adjusted seeks to take into account the
phenomenon of the R-Square automatically increasing when additional
independent variables are added to the model. This phenomenon is like a
painter putting paint on a canvas, where more paint additively increases the
value of the painting. Yet by continually adding paint, there comes a point at
which some paint covers other paint, diminishing the value of the original.
Similarly, statistically adding more variables should increase the ability of the
model to capture what it seeks to model. On the other hand, putting in too many
variables, some of which may be poor predictors, might bring down the total
predictive ability of the model. The R-Square adjusted statistic provides some
information to aid in revealing this behavior.

The value of the R-Square adjusted statistic can be negative, but it will
always be less than or equal to that of the R-Square in which it is related. Unlike
R-Square, the R-Square adjusted increases when a new independent variable is
included only if the new variable improves the R-Square more than would be
expected in the absence of any independent value being added. If a set of
independent variables is introduced into a regression model one at a time in
forward step-wise regression using the highest correlations ordered first, the R-
Square adjusted statistic will end up being equal to or less than the R-Square
value of the original model. By systematic experimentation with the R-Square
adjusted recomputed for each added variable or combination, the value of the
R-Square adjusted will reach a maximum and then decrease. The multiple
regression model with the largest R-Square adjusted statistic will be the most
accurate combination of having the best fit without excessive or unnecessary
independent variables. Again, just putting all the variables into a model may
add unneeded variability, which can decrease its accuracy. Thinning out the
variables is important.

Finally, in the step-wise multiple regression procedure, a final decision on


the variables to be included in the model is needed. Basing the decision on the
R-Square adjusted, the best combination is radio/TV. The SAS multiple
regression model and support statistics are presented in Table 6.9.

Table 6.9 SAS Best Variable Combination Regression Model and Statistics:
Marketing/Planning Case Study

Although there are many other additional analyses that could be performed
to validate this model, we will use the SAS multiple regression model in Table
6.9 for the firm in this case study. The forecasting model can be expressed as
follows:

Yp = −17150 + 275.69065 X1 + 48.34057 X2

where:

Yp = the estimated number of dollars of product sales

X1 = the number of dollars to invest in radio commercials

X2 = the number of dollars to invest in TV commercials

Because all the data used in the model is expressed as dollars, the
interpretation of the model is made easier than using more complex data. The
interpretation of the multiple regression model suggests that for every dollar
allocated to radio commercials (represented by X1), the firm will receive
$275.69 in product sales (represented by Yp in the model). Likewise, for every
dollar allocated to TV commercials (represented by X2), the firm will receive
$48.34 in product sales.

A caution should be mentioned on the results of this case study. Many


factors might challenge a result, particularly those derived from using powerful
and complex methodologies like multiple regression. As such, the results may
not occur as estimated, because the model is not reflecting past performance.
What is being suggested here is that more analysis can always be performed in
questionable situations. Also, additional analysis to confirm a result should be
undertaken to strengthen the trust that others must have in the results to achieve
the predicted higher levels of business performance.

In summary, for this case study, the predictive analytics analysis has
revealed a more detailed, quantifiable relationship between the generation of
product sales and the sources of promotion that best predict sales. The best way
to allocate the $350,000 budget to maximize product sales might involve
placing the entire budget into radio commercials because they give the best
return per dollar of budget. Unfortunately, there are constraints and limitations
regarding what can be allocated to the different types of promotional methods.
Optimizing the allocation of a resource and maximizing business performance
necessitate the use of special business analytic methods designed to accomplish
this task. This requires the additional step of prescriptive analytics analysis in
the BA process, which will be presented in the last section of Chapter 7.

Summary

This chapter dealt with the predictive analytics step in the BA process.
Specifically, it discussed logic-driven models based on experience and aided
by methodologies like the cause-and-effect and the influence diagrams. This
chapter also defined data-driven models useful in the predictive step of the BA
analysis. A further discussion of data mining was presented. Data mining
methodology such as neural networks, discriminant analysis, logistic
regression, and hierarchical clustering was described. An illustration of K-
mean clustering using SAS was presented. Finally, this chapter discussed the
second installment of a case study illustrating the predictive analytics step of
the BA process. The remaining installment of the case study will be presented
in Chapter 7.

Once again, several of this book’s appendixes are designed to augment the
chapter material by including technical, mathematical, and statistical tools. For
both a greater understanding of the methodologies discussed in this chapter and
a basic review of statistical and other quantitative methods, a review of the
appendixes is recommended.

As previously stated, the goal of using predictive analytics is to generate a


forecast or path for future improved business performance. Given this predicted
path, the question now is how to exploit it as fully as possible. The purpose of
the prescriptive analytics step in the BA process is to serve as a guide to fully
maximize the outcome in using the information provided by the predictive
analytics step. The subject of Chapter 7 is the prescriptive analytics step in the
BA process.

Discussion Questions

1. Why is predictive analytics analysis the next logical step in any business
analytics (BA) process?

2. Why would one use logic-driven models to aid in developing data-driven


models?

3. How are neural networks helpful in determining both associations and


classification tasks required in some BA analyses?

4. Why is establishing clusters important in BA?

5. Why is establishing associations important in BA?

6. How can F-tests from the ANOVA be useful in BA?

Problems

1. Using a similar equation to the one developed in this chapter for predicting
dollar product sales (note below), what is the forecast for dollar product sales
if the firm could invest $70,000 in radio commercials and $250,000 in TV
commercials?

Yp = –17150.455 + 275.691 X1 + 48.341 X2

where:

Yp = the estimated number of dollars of product sales


X1 = the number of dollars to invest in radio commercials

X2 = the number of dollars to invest in TV commercials

2. Using the same formula as in Question 1, but now using an investment of


$100,000 in radio commercials and $300,000 in TV commercials, what is the
prediction on dollar product sales?

3. Assume for this problem the following table would have held true for the
resulting marketing/planning case study problem. Which combination of
variables is estimated here to be the best predictor set? Explain why.

4. Assume for this problem that the following table would have held true for
the resulting marketing/planning case study problem. Which of the variables is
estimated here to be the best predictor? Explain why.

• PRIVACY POLICY

You might also like