AD8551-BA Unit 1 To Unit 3
AD8551-BA Unit 1 To Unit 3
BUSINESS ANLAYTICS
(Unit 1- Unit 3)
UNIT 1
INTRODUCTION TO BUSINESS ANALYTICS
Prescriptive Analytics
In the case of prescriptive analytics, we make use of simulation, data modelling, and
optimization of algorithms to find answers to questions such as “what needs to be done”. This
is used to provide solutions and identify the potential results of those solutions. This field of
business analytics has recently surfaced and is on heavy rise since it gives multiple solutions,
with their possible effectiveness, to the problems faced by businesses. Let’s say Plan A fails or
there aren’t enough resources to execute it, then there is still Plan B, Plan C, etc., in hand.
Example –
The best example would be Google self-driving Car, by looking at the past trends and
forecasted data it identifies when to turn or when to slow down, works much like a human
driver.
Business Understanding
Focuses on understanding the project objectives and requirements from a business perspective.
The analyst formulates this knowledge as a data mining problem and develops preliminary plan
Data Understanding
Starting with initial data collection, the analyst proceeds with activities to get familiar with the
data, identify data quality problems & discover first insights into the data. In this phase, the
analyst might also detect interesting subsets to form hypotheses for hidden information
Data Preparation
The data preparation phase covers all activities to construct the final dataset from the initial
raw data
Modelling
The analyst evaluates, selects & applies the appropriate modelling techniques. Since some
techniques like neural nets have specific requirements regarding the form of the data. There
can be a loop back here to data prep
Evaluation
The analyst builds & chooses models that appear to have high quality based on loss functions
that were selected. The analyst them tests them to ensure that they can generalise the models
against unseen data. Subsequently, the analyst also validates that the models sufficiently cover
all key business issues. The end result is the selection of the champion model(s)
Deployment
Generally, this will mean deploying a code representation of the model into an operating
system. This also includes mechanisms to score or categorise new unseen data as it arises. The
mechanism should use the new information in the solution of the original business problem.
Importantly, the code representation must also include all the data prep steps leading up to
modelling. This ensures that the model will treat new raw data in the same manner as during
model development
Project plan
Now you specify every step that you, the data miner, intend to take until the project is
completed and the results are presented and reviewed.
Deliverables for this task include two reports:
• Project plan: Outline your step-by-step action plan for the project. Expand the outline with a
schedule for completion of each step, required resources, inputs (such as data or a meeting with
a subject matter expert), and outputs (such as cleaned data, a model, or a report) for each step,
and dependencies (steps that can’t begin until this step is completed). Explicitly state that
certain steps must be repeated (for example, modeling and evaluation usually call for several
back-and-forth repetitions).
• Initial assessment of tools and techniques: Identify the required capabilities for meeting your
data-mining goals and assess the tools and resources that you have. If something is missing,
you have to address that concern very early in the process.
DATA COLLECTION
Data is a collection of facts, figures, objects, symbols, and events gathered from different
sources. Organizations collect data to make better decisions. Without data, it would be difficult
for organizations to make appropriate decisions, and so data is collected at various points in
time from different audiences.
For instance, before launching a new product, an organization needs to collect data on product
demand, customer preferences, competitors, etc. In case data is not collected beforehand, the
organization’s newly launched product may lead to failure for many reasons, such as less
demand and inability to meet customer needs.
Although data is a valuable asset for every organization, it does not serve any purpose until
analyzed or processed to get the desired results.
Collecting the information from the numerical fact after observation is known as raw data.
There are two types of data. Below we have provided the types of data: Primary Data and
Secondary Data.
The two types of data are as follows.
1. Primary Data
When an investigator collects data himself with a definite plan or design in his/her way, then
the data is known as primary data. Generally, the results derived from the primary data are
accurate as the researcher gathers the information. But, one of the disadvantages of primary
data collection is the expenses associated with it. Primary data research is very time-
consuming and expensive.
2. Secondary Data
Data that the investigator does not initially collect but instead obtains from published or
unpublished sources are secondary data. Secondary data is collected by an individual or an
institution for some purpose and are used by someone else in another context. It is worth
noting that although secondary data is cheaper to obtain, it raises concerns about accuracy.
As the data is second-hand, one cannot fully rely on the information to be authentic.
Data Collection: Methods
Data collection is defined as collecting and analysing data to validate and research using
some techniques. It is done to diagnose a problem and learn its outcome and future trends.
When there is a need to solve a question, data collection methods help assume the future
result.
We must collect reliable data from the correct sources to make the calculations and analysis
easier. There are two types of data collection methods. This is dependent on the kind of data
that is being collected. They are:
1. Primary Data Collection Methods
2. Secondary Data Collection Methods
Types of Data Collection
Students require primary or secondary data while doing their research. Both primary and
secondary data have their own advantages and disadvantages. Both the methods come into
the picture in different scenarios. One can use secondary data to save time and primary data
to get accurate results.
Primary Data Collection Method
Primary or raw data is obtained directly from the first-hand source through experiments,
surveys, or observations. The primary data collection method is further classified into two
types, and they are given below:
1. Quantitative Data Collection Methods
2. Qualitative Data Collection Methods
Quantitative Data Collection Methods
The term ‘Quantity’ tells us a specific number. Quantitative data collection methods express
the data in numbers using traditional or online data collection methods. Once this data is
collected, the results can be calculated using Statistical methods and Mathematical tools.
Some of the quantitative data collection methods include
Time Series Analysis
The term time series refers to a sequential order of values of a variable, known as a trend, at
equal time intervals. Using patterns, an organization can predict the demand for its products
and services for the projected time.
Smoothing Techniques
In cases where the time series lacks significant trends, smoothing techniques can be used. They
eliminate a random variation from the historical demand. It helps in identifying patterns and
demand levels to estimate future demand. The most common methods used in smoothing
demand forecasting techniques are the simple moving average method and the weighted
moving average method.
Barometric Method
Also known as the leading indicators approach, researchers use this method to speculate future
trends based on current developments. When the past events are considered to predict future
events, they act as leading indicators.
Qualitative Data Collection Methods
The qualitative method does not involve any mathematical calculations. This method is
closely connected with elements that are not quantifiable. The qualitative data collection
method includes several ways to collect this type of data, and they are given below:
Interview Method
As the name suggests, data collection is done through the verbal conversation of
interviewing the people in person or on a telephone or by using any computer-aided
model. This is one of the most often used methods by researchers. A brief description of
each of these methods is shown below:
Personal or Face-to-Face Interview: In this type of interview, questions are asked
personally directly to the respondent. For this, a researcher can do online surveys to take
note of the answers.
Telephonic Interview: This method is done by asking questions on a telephonic call. Data
is collected from the people directly by collecting their views or opinions.
Computer-Assisted Interview: The computer-assisted type of interview is the same as a
personal interview, except that the interviewer and the person being interviewed will be
doing it on a desktop or laptop. Also, the data collected is directly updated in a database to
make the process quicker and easier. In addition, it eliminates a lot of paperwork to be done
in updating the collection of data.
Questionnaire Method of Collecting Data
The questionnaire method is nothing but conducting surveys with a set of quantitative
research questions. These survey questions are done by using online survey questions
creation software. It also ensures that the people’s trust in the surveys is legitimised. Some
types of questionnaire methods are given below:
Web-Based Questionnaire: The interviewer can send a survey link to the selected
respondents. Then the respondents click on the link, which takes them to the survey
questionnaire. This method is very cost-efficient and quick, which people can do at their
own convenient time. Moreover, the survey has the flexibility of being done on any device.
So, it is reliable and flexible.
Mail-Based Questionnaire: Questionnaires are sent to the selected audience via email. At
times, some incentives are also given to complete this survey which is the main attraction.
The advantage of this method is that the respondent’s name remains confidential to the
researchers, and there is the flexibility of time to complete this survey.
Observation Method
As the word ‘observation’ suggests, data is collected directly by observing this method. This
can be obtained by counting the number of people or the number of events in a particular
time frame. Generally, it’s effective in small-scale scenarios. The primary skill needed here
is observing and arriving at the numbers correctly. Structured observation is the type of
observation method in which a researcher detects certain specific behaviours.
Document Review Method
The document review method is a data aggregation method used to collect data from existing
documents with data about the past. There are two types of documents from which we can
collect data. They are given below:
Public Records: The data collected in an organisation like annual reports and sales
information of the past months are used to do future analysis.
Personal Records: As the name suggests, the documents about an individual such as type
of job, designation, and interests are taken into account.
Secondary Data Collection Method
The data collected by another person other than the researcher is secondary data. Secondary
data is readily available and does not require any particular collection methods. It is
available in the form of historical archives, government data, organisational records etc.
This data can be obtained directly from the company or the organization where the research
is being organised or from outside sources.
The internal sources of secondary data gathering include company documents, financial
statements, annual reports, team member information, and reports got from customers or
dealers. Now, the external data sources include information from books, journals,
magazines, the census taken by the government, and the information available on the internet
about research. The leading edge of this data aggregation method is that it is easy to collect
since they are readily accessible.
The secondary data collection methods, too, can involve both quantitative and qualitative
techniques. Secondary data is easily available and hence, less time-consuming and expensive
as compared to the primary data. However, with the secondary data collection methods, the
authenticity of the data gathered cannot be verified.
Collection of Data in Statistics
There are various ways to represent data after gathering. But, the most popular method is to
tabulate the data using tally marks and then represent them in a frequency distribution table.
The frequency distribution table is constructed by using the tally marks. Tally marks are a
form of a numerical system used for counting. The vertical lines are used for the counting.
The cross line is placed over the four lines giving the total at 55.
Example:
Consider a jar containing the different colours of pieces of bread as shown below:
DATA PREPARATION
Data preparation is the process of gathering, combining, structuring and organizing data so it
can be used in business intelligence (BI), analytics and data visualization applications. The
components of data preparation include data preprocessing, profiling, cleansing, validation and
transformation; it often also involves pulling together data from different internal systems and
external sources.
Data preparation work is done by information technology (IT), BI and data management teams
as they integrate data sets to load into a data warehouse, NoSQL database or data lake
repository, and then when new analytics applications are developed with those data sets. In
addition, data scientists, data engineers, other data analysts and business users increasingly use
self-service data preparation tools to collect and prepare data themselves.
Data preparation is often referred to informally as data prep. It's also known as data wrangling,
although some practitioners use that term in a narrower sense to refer to cleansing, structuring
and transforming data; that usage distinguishes data wrangling from the data pre-
processing stage.
Purposes of data preparation
One of the primary purposes of data preparation is to ensure that raw data being readied for
processing and analysis is accurate and consistent so the results of BI and analytics
applications will be valid. Data is commonly created with missing values, inaccuracies or other
errors, and separate data sets often have different formats that need to be reconciled when
they're combined. Correcting data errors, validating data quality and consolidating data sets are
big parts of data preparation projects.
Data preparation also involves finding relevant data to ensure that analytics applications deliver
meaningful information and actionable insights for business decision-making. The data often
is enriched and optimized to make it more informative and useful -- for example, by blending
internal and external data sets, creating new data fields, eliminating outlier values and
addressing imbalanced data sets that could skew analytics results.
In addition, BI and data management teams use the data preparation process to curate data sets
for business users to analyse. Doing so helps streamline and guide self-service BI applications
for business analysts, executives and workers.
What are the benefits of data preparation?
Data scientists often complain that they spend most of their time gathering, cleansing and
structuring data instead of analysing it. A big benefit of an effective data preparation process
is that they and other end users can focus more on data mining and data analysis -- the parts of
their job that generate business value. For example, data preparation can be done more quickly,
and prepared data can automatically be fed to users for recurring analytics applications.
Done properly, data preparation also helps an organization do the following:
• ensure the data used in analytics applications produces reliable results;
• identify and fix data issues that otherwise might not be detected;
• enable more informed decision-making by business executives and operational workers;
• reduce data management and analytics costs;
• avoid duplication of effort in preparing data for use in multiple applications; and
• get a higher ROI from BI and analytics initiatives.
Effective data preparation is particularly beneficial in big data environments that store a
combination of structured, semi structured and unstructured data, often in raw form until it's
needed for specific analytics uses. Those uses include predictive analytics, machine learning
(ML) and other forms of advanced analytics that typically involve large amounts of data to
prepare. For example, in an article on preparing data for machine learning, Felix Wick,
corporate vice president of data science at supply chain software vendor Blue Yonder, is quoted
as saying that data preparation "is at the heart of ML."
Steps in the data preparation process
Data preparation is done in a series of steps. There's some variation in the data preparation
steps listed by different data professionals and software vendors, but the process typically
involves the following tasks:
1. Data discovery and profiling. The next step is to explore the collected data to better
understand what it contains and what needs to be done to prepare it for the intended uses. To
help with that, data profiling identifies patterns, relationships and other attributes in the data,
as well as inconsistencies, anomalies, missing values and other issues so they can be addressed.
What is data profiling?
Data profiling refers to the process of examining, analyzing, reviewing and summarizing data
sets to gain insight into the quality of data. Data quality is a measure of the condition of data
based on factors such as its accuracy, completeness, consistency, timeliness and accessibility.
Additionally, data profiling involves a review of source data to understand the data's structure,
content and interrelationships.
This review process delivers two high-level values to the organization: It provides a high-level
view of the quality of its data sets; and two, it helps the organization identify potential data
projects.
Given those benefits, data profiling is an important component of data preparation programs.
Its assistance helping organizations to identify quality data makes it an important precursor to
data processing and data analytics activities.
Moreover, an organization can use data profiling and the insights it produces to continuously
improve the quality of its data and measure the results of that effort.
Data profiling may also be known as data archaeology, data assessment, data discovery or data
quality analysis.
Organizations use data profiling at the beginning of a project to determine if enough data has
been gathered, if any data can be reused or if the project is worth pursuing. The process of data
profiling itself can be based on specific business rules that will uncover how the data set aligns
with business standards and goals.
Types of data profiling
There are three types of data profiling.
• Structure discovery. This focuses on the formatting of the data, making sure everything is
uniform and consistent. It uses basic statistical analysis to return information about the validity
of the data.
• Content discovery. This process assesses the quality of individual pieces of data. For example,
ambiguous, incomplete and null values are identified.
• Relationship discovery. This detects connections, similarities, differences and associations
among data sources.
What are the steps in the data profiling process?
Data profiling helps organizations identify and fix data quality problems before the data is
analyzed, so data professionals aren't dealing with inconsistencies, null values or incoherent
schema designs as they process data to make decisions.
Data profiling statistically examines and analyzes data at its source and when loaded. It also
analyzes the metadata to check for accuracy and completeness.
It typically involves either writing queries or using data profiling tools.
A high-level breakdown of the process is as follows:
1. The first step of data profiling is gathering one or multiple data sources and the associated
metadata for analysis.
2. The data is then cleaned to unify structure, eliminate duplications, identify interrelationships
and find anomalies.
3. Once the data is cleaned, data profiling tools will return various statistics to describe the data
set. This could include the mean, minimum/maximum value, frequency, recurring patterns,
dependencies or data quality risks.
For example, by examining the frequency distribution of different values for each column in a
table, a data analyst could gain insight into the type and use of each column. Cross-column
analysis can be used to expose embedded value dependencies; inter-table analysis allows the
analyst to discover overlapping value sets that represent foreign key relationships between
entities.
Benefits of data profiling
Data profiling returns a high-level overview of data that can result in the following benefits:
• leads to higher-quality, more credible data;
• helps with more accurate predictive analytics and decision-making;
• makes better sense of the relationships between different data sets and sources;
• keeps company information centralized and organized;
• eliminates errors, such as missing values or outliers, that add costs to data-driven projects;
• highlights areas within a system that experience the most data quality issues, such as data
corruption or user input errors; and
• produces insights surrounding risks, opportunities and trends.
Data profiling challenges
Although the objectives of data profiling are straightforward, the actual work involved is quite
complex, with multiple tasks occurring from the ingestion of data through its warehousing.
That complexity is one of the challenges organizations encounter when trying to implement
and run a successful data profiling program.
The sheer volume of data being collected by a typical organization is another challenge, as is
the range of sources -- from cloud-based systems to endpoint devices deployed as part of an
internet-of-things ecosystem -- that produce data.
The speed at which data enters an organization creates further challenges to having a successful
data profiling program.
These data prep challenges are even more significant in organizations that have not adopted
modern data profiling tools and still rely on manual processes for large parts of this work.
On a similar note, organizations that don't have adequate resources -- including trained data
professionals, tools and the funding for them -- will have a harder time overcoming these
challenges.
However, those same elements make data profiling more critical than ever to ensure that the
organization has the quality data it needs to fuel intelligent systems, customer personalization,
productivity-boosting automation projects and more.
Examples of data profiling
Data profiling can be implemented in a variety of use cases where data quality is important.
For example, projects that involve data warehousing or business intelligence may require
gathering data from multiple disparate systems or databases for one report or analysis.
Applying data profiling to these projects can help identify potential issues and corrections that
need to be made in extract, transform and load (ETL) jobs and other data integration processes
before moving forward.
Additionally, data profiling is crucial in data conversion or data migration initiatives that
involve moving data from one system to another. Data profiling can help identify data quality
issues that may get lost in translation or adaptions that must be made to the new system prior
to migration.
The following four methods, or techniques, are used in data profiling:
• column profiling, which assesses tables and quantifies entries in each column;
• cross-column profiling, which features both key analysis and dependency analysis;
• cross-table profiling, which uses key analysis to identify stray data as well as semantic and
syntactic discrepancies; and
• data rule validation, which assesses data sets against established rules and standards to validate
that they're being followed.
Data profiling tools
Data profiling tools replace much, if not all, of the manual effort of this function by discovering
and investigating issues that affect data quality, such as duplication, inaccuracies,
inconsistencies and lack of completeness.
These technologies work by analyzing data sources and linking sources to their metadata to
allow for further investigation into errors.
Furthermore, they offer data professionals quantitative information and statistics around data
quality, typically in tabular and graph formats.
Data management applications, for example, can manage the profiling process through tools
that eliminate errors and apply consistency to data extracted from multiple sources without the
need for hand coding.
Such tools are essential for many, if not most, organizations today as the volume of data they
use for their business activities significantly outpaces even a large team's ability to perform this
function through mostly manual efforts.
Data profile tools also generally include data wrangling, data gap and metadata discovery
capabilities as well as the ability to detect and merge duplicates, check for data similarities and
customize data assessments.
Commercial vendors that provide data profiling capabilities include Datameer, Informatica,
Oracle and SAS. Open source solutions include Aggregate Profiler, Apache Griffin, Quadient
DataCleaner and Talend.
2. Data cleansing. Next, the identified data errors and issues are corrected to create complete and
accurate data sets. For example, as part of cleansing data sets, faulty data is removed or fixed,
missing values are filled in and inconsistent entries are harmonized.
What is data cleansing?
Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing
incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying
data errors and then changing, updating or removing data to correct them. Data cleansing
improves data quality and helps provide more accurate, consistent and reliable information for
decision-making in an organization.
Data cleansing is a key part of the overall data management process and one of the core
components of data preparation work that readies data sets for use in business intelligence (BI)
and data science applications. It's typically done by data quality analysts and engineers or other
data management professionals. But data scientists, BI analysts and business users may also
clean data or take part in the data cleansing process for their own applications.
Data cleansing vs. data cleaning vs. data scrubbing
Data cleansing, data cleaning and data scrubbing are often used interchangeably. For the most
part, they're considered to be the same thing. In some cases, though, data scrubbing is viewed
as an element of data cleansing that specifically involves removing duplicate, bad, unneeded
or old data from data sets.
Data scrubbing also has a different meaning in connection with data storage. In that context,
it's an automated function that checks disk drives and storage systems to make sure the data
they contain can be read and to identify any bad sectors or blocks.
HYPOTHESIS GENERATION
Data scientists work with data sets small and large, and are tellers of stories. These stories have
entities, properties and relationships, all described by data. Their apparatus and methods open
up data scientists to opportunities to identify, consolidate and validate hypotheses with data,
and use these hypotheses as starting points for our data narratives. Hypothesis generation is a
key challenge for data scientists. Hypothesis generation and by extension hypothesis
refinement constitute the very purpose of data analysis and data science.
Hypothesis generation for a data scientist can take numerous forms, such as:
1. They may be interested in the properties of a certain stream of data or a certain
measurement. These properties and their default or exceptional values may form a
certain hypothesis.
2. They may be keen on understanding how a certain measure has evolved over time. In
trying to understand this evolution of a system’s metric, or a person’s behaviour, they
could rely on a mathematical model as a hypothesis.
3. They could consider the impact of some properties on the states of systems, interactions
and people. In trying to understand such relationships between different measures and
properties, they could construct machine learning models of different kinds.
Ultimately, the purpose of such hypothesis generation is to simplify some aspect of system
behaviour and represent such behaviour in a manner that’s tangible and tractable based on
simple, explicable rules. This makes story-telling easier for data scientists when they become
new-age raconteurs, straddling data visualisations, dashboards with data summaries and
machine learning models.
5. Passenger details
Passengers can influence the trip duration knowingly or unknowingly. We usually
come across passengers requesting drivers to increase the speed as they are getting
late and there could be other factors to hypothesize which we can look at.
• Age of passengers: Senior citizens as passengers may contribute to higher
trip duration as drivers tend to go slow in trips involving senior citizens
• Medical conditions or pregnancy: Passengers with medical conditions
contribute to a longer trip duration
• Emergency: Passengers with an emergency could contribute to a shorter trip
duration
• Passenger count: Higher passenger count leads to shorter duration trips due
to congestion in seating
6. Date-Time Features
The day and time of the week are important as New York is a busy city and could
be highly congested during office hours or weekdays. Let us now generate a few
hypotheses on the date and time-based features.
Pickup Day:
• Weekends could contribute to more outstation trips and could have a higher
trip duration
• Weekdays tend to have higher trip duration due to high traffic
• If the pickup day falls on a holiday then the trip duration may be shorter
• If the pickup day falls on a festive week then the trip duration could be lower
due to lesser traffic
Time:
• Early morning trips have a lesser trip duration due to lesser traffic
• Evening trips have a higher trip duration due to peak hours
7. Road-based Features
Roads are of different types and the condition of the road or obstructions in the road
are factors that can’t be ignored. Let’s form some hypotheses based on these factors.
• Condition of the road: The duration of the trip is more if the condition of the
road is bad
• Road type: Trips in concrete roads tend to have a lower trip duration
• Strike on the road: Strikes carried out on roads in the direction of the trip
causes the trip duration to increase
8. Weather Based Features
Weather can change at any time and could possibly impact the commute if the
weather turns bad. Hence, this is an important feature to consider in our hypothesis.
• Weather at the start of the trip: Rainy weather condition contributes to a
higher trip duration
After writing down our hypothesis and looking at the dataset you will notice that
you would have covered the writing of hypothesis on most of the features present in
the data set. There could also be a possibility that you might have to work with fewer
features and the features on which you have generated hypotheses are not currently
being captured/stored by the business and are not available.
Always go ahead and capture data from external sources if you think that the data
is relevant for your prediction. Ex: Getting weather information
It is also important to note that since hypothesis generation is an estimated guess,
the hypothesis generated could come out to be true or false once exploratory data
analysis and hypothesis testing is performed on the data.
MODELING:
After all the cleaning, formatting and feature selection, we will now feed the
data to the chosen model. But how does one select a model to use?
How to choose a model?
IT DEPENDS. It all depends on what the goal of your task or project is and this should already
be identified in the Business Understanding phase
Steps in choosing a model
1. Determine size of training data — if you have a small dataset, fewer number of
observations, high number of features, you can choose high bias/low variance
algorithms (Linear Regression, Naïve Bayes, Linear SVM). If your dataset is large and
has a high number of observations compared to number of features, you can choose a
low bias/high variance algorithms (KNN, Decision trees).
2. Accuracy and/or interpretability of the output — if your goal is inference, choose
restrictive models as it is more interpretable (Linear Regression, Least Squares). If your
goal is higher accuracy, then choose flexible models (Bagging, Boosting, SVM).
3. Speed or training time — always remember that higher accuracy as well as large
datasets means higher training time. Examples of easy to run and to implement
algorithms are: Naïve Bayes, Linear and Logistic Regression. Some examples
of algorithms that need more time to train are: SVM, Neural Networks, and Random
Forests.
4. Linearity —try checking first the linearity of your data by fitting a linear line or by
trying to run a logistic regression, you can also check their residual errors. Higher errors
mean that the data is not linear and needs complex algorithms to fit. If data is Linear,
you can choose: Linear Regression, Logistic Regression, Support Vector Machines. If
Non-linear: Kernel SVM, Random Forest, Neural Nets.
Parametric vs. Non-Parametric Machine Learning Models
Parametric Machine Learning Algorithms
Parametric ML Algorithms are algorithms that simplify the function to a know form. They are
often are called the “Linear ML Algorithms”.
Parametric ML Algorithms
• Logistic Regression
• Linear Discriminant Analysis
• Perceptron
• Naïve Bayes
• Simple Neural Networks
Benefits of Parametric ML Algorithms
• Simpler — easy to understand methods and easy to interpret results
• Speed — very fast to learn from the data provided
• Less data — it does not require as much training data
Limitations of Parametric ML Algorithms
• Limited Complexity —suited only to simpler problems
• Poor Fit — the methods are unlikely to match the underlying mapping function
Non-Parametric Machine Learning Algorithms
Non-Parametric ML Algorithms are algorithms that do not make assumptions about the form
of the mapping functions. It is good to use when you have a lot of data and no prior knowledge
and you don’t want to worry too much about choosing the right features.
Non-Parametric ML Algorithms
• K-Nearest Neighbors (KNN)
• Decision Trees like CART
• Support Vector Machines (SVM)
Benefits of Non-Parametric ML Algorithms
• Flexibility— it is capable of fitting a large number of functional forms
• Power — do not assume about the underlying function
• Performance — able to give a higher performance model for predictions
Limitations of Non-Parametric ML Algorithms
• Needs more data — requires a large training dataset
• Slower processing — they often have more parameters which means that training time
is much longer
• Overfitting — higher risk of overfitting the training data and results are harder to
explain why specific predictions were made
In the process flow above, Data Modeling is broken down into four tasks
together with its projected outcome or output in detail.
Simply put, the Data Modeling phase’s goal is to:
1.Selecting modeling techniques
The wonderful world of data mining offers lots of modeling techniques, but
not all of them will suit your needs. Narrow the list based on the kinds of
variables involved, the selection of techniques available in your tools, and
any business considerations that are important to you.
For example, many organizations favour methods with output that’s easy to
interpret, so decision trees or logistic regression might be acceptable, but
neural networks would probably not be accepted.
Deliverables for this task include two reports:
• Modeling technique: Specify the technique(s) that you will use.
• Modeling assumptions: Many modeling techniques are based on
certain assumptions. For example, a model type may be intended for
use with data that has a specific type of distribution. Document these
assumptions in this report.
2.Designing tests
The test in this task is the test that you’ll use to determine how well your model works. It may
be as simple as splitting your data into a group of cases for model training and another group
for model testing.
Training data is used to fit mathematical forms to the data model, and test data is used during
the model-training process to avoid overfitting: making a model that’s perfect for one dataset,
but no other. You may also use holdout data, data that is not used during the model-training
process, for an additional test.
The deliverable for this task is your test design. It need not be elaborate, but you should at least
take care that your training and test data are similar and that you avoid introducing any bias
into the data.
3. Building model(s)
Modeling is what many people imagine to be the whole job of the data miner, but it’s just one
task of dozens! Nonetheless, modeling to address specific business goals is the heart of the
data-mining profession.
Deliverables for this task include three items:
• Parameter settings: When building models, most tools give you the option of
adjusting a variety of settings, and these settings have an impact on the structure of the
final model. Document these settings in a report.
• Model descriptions: Describe your models. State the type of model (such as linear
regression or neural network) and the variables used. Explain how the model is
interpreted. Document any difficulties encountered in the modeling process.
• Models: This deliverable is the models themselves. Some model types can be easily
defined with a simple equation; others are far too complex and must be transmitted in
a more sophisticated format.
4. Assessing model(s)
Now you will review the models that you’ve created, from a technical standpoint and also from
a business standpoint (often with input from business experts on your project team).
Deliverables for this task include two reports:
• Model assessment: Summarizes the information developed in your model review. If
you have created several models, you may rank them based on your assessment of their
value for a specific application.
• Revised parameter settings: You may choose to fine-tune settings that were used to
build the model and conduct another round of modeling and try to improve your results.
VALIDATION:
Why data validation?
Data validation happens immediately after data preparation/wrangling and before
modeling. it is because during data preparation there is a high possibility of things going wrong
especially in complex scenarios.
Data validation ensures that modeling happens on the right data. faulty data as input to
the model would generate faulty insight!
How is data validation done?
Data validation should be done by involving minimum one external person who has a
proper understanding of the data and business. I
t is usually clients who technically good enough to check the data. Once we go through
data preparation and just before data modeling, we usually make data visualization and give
my newly prepared data to the client.
The clients with the help of SQL queries or any other tools try to validate if output
contains no error.
Combing CRISP-DM/ASUM-DM with the agile methodology, steps can be taken in
parallel meaning you do not have to wait for the green light for data validation to do the
modeling. But once you get feedback from the domain expert that there are faults in the data,
we need to correct the data by re-doing the data-preparation and re-model the data.
What are the common causes leading to a faulty output from data preparation?
Common causes are:
1. Lack of proper understanding of the data, therefore, the logic of the data preparation
is not correct.
2. Common bugs in programming/data preparation pipeline that led to a faulty output.
EVALUATION:
The evaluation phase includes three tasks. These are
• Evaluating results
• Reviewing the process
• Determining the next steps
INTERPRETATION
Data interpretation as the process of assigning meaning to the collected information and
determining the conclusions, significance, and implications of the findings.
Data Interpretation Examples
Data interpretation is the final step of data analysis. This is where you turn results into
actionable items. To better understand it, here are 2 instances of interpreting data:
Let's say you've got four age groups of the user base. So a company can notice which age group
is most engaged with their content or product. Based on bar charts or pie charts, they can
either: develop a marketing strategy to make their product more appealing to non-involved
groups or develop an outreach strategy that expands on their core user base.
Steps Of Data Interpretation
Data interpretation is conducted in 4 steps:
• Assembling the information you need (like bar graphs and pie charts);
• Developing findings or isolating the most relevant inputs;
• Developing conclusions;
• Coming up with recommendations or actionable solutions.
Considering how these findings dictate the course of action, data analysts must be accurate
with their conclusions and examine the raw data from multiple angles. Different variables may
allude to various problems, so having the ability to backtrack data and repeat the analysis
using different templates is an integral part of a successful business strategy.
What Should Users Question During Data Interpretation?
To interpret data accurately, users should be aware of potential pitfalls present within this
process. You need to ask yourself if you are mistaking correlation for causation. If two things
occur together, it does not indicate that one caused the other.
The 2nd thing you need to be aware of is your own confirmation bias. This occurs when you
try to prove a point or a theory and focus only on the patterns or findings that support that
theory while discarding those that do not.
The 3rd problem is irrelevant data. To be specific, you need to make sure that the data you
have collected and analyzed is relevant to the problem you are trying to solve.
Data Interpretation Methods
Data analysts or data analytics tools help people make sense of the numerical data that has been
aggregated, transformed, and displayed. There are two main methods for data interpretation:
quantitative and qualitative.
Qualitative Data Interpretation Method
This is a method for breaking down or analyzing so-called qualitative data, also known as
categorical data. It is important to note that no bar graphs or line charts are used in this method.
Instead, they rely on text. Because qualitative data is collected through person-to-person
techniques, it isn't easy to present using a numerical approach.
Surveys are used to collect data because they allow you to assign numerical values to answers,
making them easier to analyze. If we rely solely on the text, it would be a time-consuming and
error-prone process. This is why it must be transformed.
Quantitative Data Interpretation Method
This data interpretation is applied when we are dealing with quantitative or numerical data.
Since we are dealing with numbers, the values can be displayed in a bar chart or pie chart.
There are two main types: Discrete and Continuous. Moreover, numbers are easier to analyze
since they involve statistical modeling techniques like mean and standard deviation.
Mean is an average value of a particular data set obtained or calculated by dividing the sum of
the values within that data set by the number of values within that same set.
Standard Deviation is a technique is used to ascertain how responses align with or deviate
from the average value or mean. It relies on the meaning to describe the consistency of the
replies within a particular data set. You can use this when calculating the average pay for a
certain profession and then displaying the upper and lower values in the data set.
As stated, some tools can do this automatically, especially when it comes to quantitative data.
Whatagraph is one such tool as it can aggregate data from multiple sources using different
system integrations. It will also automatically organize and analyze that which will later be
displayed in pie charts, line charts, or bar charts, however you wish.
Benefits Of Data Interpretation
Multiple data interpretation benefits explain its significance within the corporate world,
medical industry, and financial industry:
Informed decision-making. The managing board must examine the data to take action and
implement new methods. This emphasizes the significance of well-analyzed data as well as a
well-structured data collection process.
Anticipating needs and identifying trends. Data analysis provides users with relevant
insights that they can use to forecast trends. It would be based on customer concerns and
expectations.
For example, a large number of people are concerned about privacy and the leakage of personal
information. Products that provide greater protection and anonymity are more likely to become
popular.
Clear foresight. Companies that analyze and aggregate data better understand their own
performance and how consumers perceive them. This provides them with a better
understanding of their shortcomings, allowing them to work on solutions that will significantly
improve their performance.
Some queries are updated in the database such as “were the decision and action impactful?”
“What was the return or investment?”,” how was the analysis group compared with the
regulating class?”. The performance-based database is continuously updated once the new
insight or knowledge is extracted.
Decision Support Systems (DSS)
Decision Support Systems (DSS) help executives make better decisions by using
historical and current data from internal Information Systems and external sources. By
combining massive amounts of data with sophisticated analytical models and tools,
and by making the system easy to use, they provide a much better source of
information to use in the decision-making process.
Decision Support Systems (DSS) are a class of computerized information systems that
support decision-making activities. DSS are interactive computer-based systems and
subsystems intended to help decision makers use communications technologies, data,
documents, knowledge and/or models to successfully complete decision process
tasks.
This traditional list of components remains useful because it identifies similarities and
differences between categories or types of DSS. The DSS framework is primarily based on the
different emphases placed on DSS components when systems are actually constructed.
Multi-participant systems like Group and Inter-Organizational DSS also create complex
implementation issues. For instance, when implementing a Data-Driven DSS a designer
should be especially concerned about the user’s interest in applying the DSS in
unanticipated or novel situations. Despite the significant differences created by the
specific task and scope of a DSS, all Decision Support Systems have similar technical
components and share a common purpose, supporting decision-making.
Mathematical and analytical models are the major component of a Model-Driven DSS.
Each Model-Driven DSS has a specific set of purposes and hence different models are
needed and used. Choosing appropriate models is a key design issue. Also, the
software used for creating specific models needs to manage needed data and the user
interface. In Model-Driven DSS the values of key variables or parameters are changed,
often repeatedly, to reflect potential changes in supply, production, the economy,
sales, the marketplace, costs, and/or other environmental and internal factors.
Information from the models is then analyzed and evaluated by the decision-maker.
• Monetary cost. The decision support system requires investing in information system
to collect data from many sources and analyze them to support the decision making.
Some analysis for Decision Support System needs the advance of data analysis,
statistics, econometrics and information system, so it is the high cost to hire the
specialists to set up the system.
• Overemphasize decision making. Clearly the focus of those of us interested in
computerized decision support is on decisions and decision making. Implementing
Decision Support System may reinforce the rational perspective and overemphasize
decision processes and decision making. It is important to educate managers about
the broader context of decision making and the social, political and emotional factors
that impact organizational success. It is especially important to continue examining
when and under what circumstances Decision Support System should be built and
used. We must continue asking if the decision situation is appropriate for using any
type of Decision Support System and if a specific Decision Support System is or remains
appropriate to use for making or informing a specific decision.
• Assumption of relevance. According to Wino grad and Flores (1986), “Once a
computer system has been installed it is difficult to avoid the assumption that the
things it can deal with are the most relevant things for the manager’s concern.” The
danger is that once DSS become common in organizations, that managers will use
them inappropriately. There is limited evidence that this occurs. Again training is the
only way to avoid this potential problem.
• Transfer of power. Building Decision Support Systems, especially knowledge-driven
Decision Support System, may be perceived as transferring decision authority to a
software program. This is more a concern with decision automation systems than with
DSS. We advocate building computerized decision support systems because we want
to improve decision making while keeping a human decision maker in the “decision
loop”. In general, we value the “need for human discretion and innovation” in the
decision making process.
• Unanticipated effects. Implementing decision support technologies may have
unanticipated consequences. It is conceivable and it has been demonstrated that some
DSS reduce the skill needed to perform a decision task. Some Decision Support System
overload decision makers with information and actually reduce decision making
effectiveness.
• Obscuring responsibility. The computer does not make a “bad” decision, people do.
Unfortunately some people may deflect personal responsibility to a DSS. Managers
need to be continually reminded that the computerized decision support system is an
intermediary between the people who built the system and the people who use the
system. The entire responsibility associated with making a decision using a DSS resides
with people who built and use the system.
• False belief in objectivity. Managers who use Decision Support Systems may or may
not be more objective in their decision making. Computer software can encourage
more rational action, but managers can also use decision support technologies to
rationalize their actions. It is an overstatement to suggest that people using a DSS are
more objective and rational than managers who are not using computerized decision
support.
• Status reduction. Some managers argue using a Decision Support System will
diminish their status and force them to do clerical work. This perceptual problem can
be a disadvantage of implementing a DSS. Managers and IS staff who advocate
building and using computerized decision support need to deal with any status issues
that may arise. This perception may or should be less common now that computer
usage is common and accepted in organizations.
• Information overload. Too much information is a major problem for people and
many DSS increase the information load. Although this can be a problem, Decision
Support System can help managers organize and use information. Decision Support
System can actually reduce and manage the information load of a user. Decision
Support System developers need to try to measure the information load created by
the system and Decision Support System users need to monitor their perceptions of
how much information they are receiving. The increasing ubiquity of handheld, wireless
computing devices may exacerbate this problem and disadvantage.
In conclusion, before firms will invest in the Decision Support Systems, they must
compare the advantages and disadvantages of the decision support system to get
valuable investment.
Business Forecasting
3.1 Introduction
The growing competition, rapidity of change in circumstances and the trend towards
automation demand that decisions in business are based on a careful analysis of data
concerning the future course of events and not purely on guesses and hunches. The
future is unknown to us and yet every day we are forced to make decisions involving
the future and therefore, there is uncertainty. Great risk is associated with business
affairs. All businessmen are forced to make forecasts regarding business activities.
Success in business depends upon successful forecasts of business events. In recent
times, considerable research has been conducted in this field. Attempts are being
made to make forecasting as scientific as possible.
Business forecasting is not a new development. Every businessman must forecast;
even if the entire product is sold before production. Forecasting has always been
necessary. What is new in the attempt to put forecasting on a scientific basis is to
forecast by reference to past history and statistics rather than by pure intuition and
guess-work.
One of the most important tasks before businessmen and economists these days is to
make estimates for the future. For example, a businessman is interested in finding
out his likely sales next year or as long term planning in next five or ten years so that
he adjusts his production accordingly and avoid the possibility of either inadequate
production to meet the demand or unsold stocks.
Similarly, an economist is interested in estimating the likely population in the coming
years so that proper planning can be carried out with regard to jobs for the people,
food supply, etc. First step in making estimates for the future consists of gathering
information from the past. In this connection we usually deal with statistical data
which is collected, observed or recorded at successive intervals of time. Such data is
generally referred to as time series. Thus, when we observe numerical data at
different points of time the set of observations is known as time series.
Objectives:
After studying this unit, you should be able to:
• describe the meaning of business forecasting
• distinguish between prediction, projection and forecast
• describe the forecasting methods available
• apply the forecasting theories in taking effective business decisions
3.2 Business Forecasting
Business forecasting refers to the analysis of past and present economic conditions
with the object of drawing inferences about probable future business conditions. The
process of making definite estimates of future course of events is referred to as
forecasting and the figure or statements obtained from the process is known as
‘forecast’; future course of events is rarely known. In order to be assured of the
coming course of events, an organised system of forecasting helps. The following are
two aspects of scientific business forecasting:
1. Analysis of past economic conditions
For this purpose, the components of time series are to be studied. The secular trend
shows how the series has been moving in the past and what its future course is likely
to be over a long period of time. The cyclic fluctuations would reveal whether the
business activity is subjected to a boom or depression. The seasonal fluctuations
would indicate the seasonal changes in the business activity.
2. Analysis of present economic conditions
The object of analysing present economic conditions is to study those factors which
affect the sequential changes expected on the basis of the past conditions. Such
factors are new inventions, changes in fashion, changes in economic and political
spheres, economic and monetary policies of the government, war, etc. These factors
may affect and alter the duration of trade cycle. Therefore, it is essential to keep in
mind the present economic conditions since they have an important bearing on the
probable future tendency.
3.2.1 Objectives of forecasting in business
Forecasting is a part of human nature. Businessmen also need to look to the future.
Success in business depends on correct predictions. In fact when a man enters
business, he automatically takes with it the responsibility for attempting to forecast
the future.
To a very large extent, success or failure would depend upon the ability to
successfully forecast the future course of events. Without some element of continuity
between past, present and future, there would be little possibility of successful
prediction. But history is not likely to repeat itself and we would hardly expect
economic conditions next year or over the next 10 years to follow a clear cut
prediction. Yet, past patterns prevail sufficiently to justify using the past as a basis
for predicting the future.
A businessman cannot afford to base his decisions on guesses. Forecasting helps a
businessman in reducing the areas of uncertainty that surround management decision
making with respect to costs, sales, production, profits, capital investment, pricing,
expansion of production, extension of credit, development of markets, increase of
inventories and curtailment of loans. These decisions are to be based on present
indications of future conditions.
However, we know that it is impossible to forecast the future precisely. There is a
possibility of occurrence of some range of error in the forecast. Statistical forecasts
are the methods in which we can use the mathematical theory of probability to
measure the risks of errors in predictions.
3.2.1.1 Prediction, Projection and Forecasting
A great amount of confusion seems to have grown up in the use of words ‘forecast’,
‘prediction’ and ‘projection’.
Key Statistic
A prediction is an estimate based solely on past data of the series under
investigation. It is purely a statistical extrapolation.
A projection is a prediction, where the extrapolated values are subject to
certain numerical assumptions.
A forecast is an estimate, which relates the series in which we are
interested into external factors.
Forecasts are made by estimating future values of the external factors by means of
prediction, projection or forecast and from these values calculating the estimate of
the dependent variable.
3.2.2 Characteristics of Business Forecasting
• Based on past and present conditions
Business forecasting is based on past and present economic condition of the business.
To forecast the future, various data, information and facts concerning to economic
condition of business for past and present are analysed.
• Based on mathematical and statistical methods
The process of forecasting includes the use of statistical and mathematical methods.
By using these methods, the actual trend which may take place in future can be
forecasted.
• Period
The forecasting can be made for long term, short term, medium term or any specific
period.
• Estimation of future
Business forecasting is to forecast the future regarding probable economic conditions.
• Scope
Forecasting can be physical as well as financial.
3.2.3 Steps in forecasting
Forecasting of business fluctuations consists of the following steps:
1. Understanding why changes in the past have occurred
One of the basic principles of statistical forecasting is that the forecaster should use
past performance data. The current rate and changes in the rate constitute the basis of
forecasting. Once they are known, various mathematical techniques can develop
projections from them. If an attempt is made to forecast business fluctuations without
understanding why past changes have taken place, the forecast will be purely
mechanical.
Business fluctuations are based solely upon the application of mathematical formulae
and are subject to serious error.
2. Determining which phases of business activity must be measured After
understanding the reasons of occurrence of business fluctuations, it is necessary to
measure certain phases of business activity in order to predict what changes will
probably follow the present level of activity.
Quantitative forecasting
The quantitative forecasting method relies on historical data to predict future needs and
trends. The data can be from your own company, market activity, or both. It focuses on cold,
hard numbers that can show clear courses of change and action. This method is beneficial
for companies that have an extensive amount of data at their disposal.
Qualitative forecasting
The qualitative forecasting method relies on the input of those who influence your company’s
success. This includes your target customer base and even your leadership team. This method
is beneficial for companies that don’t have enough complex data to conduct a quantitative
forecast.
There are two approaches to qualitative forecasting:
1. Market research: The process of collecting data points through direct correspondence with
the market community. This includes conducting surveys, polls, and focus groups to gather
real-time feedback and opinions from the target market. Market research looks at
competitors to see how they adjust to market fluctuations and adapt to changing supply and
demand. Companies commonly utilize market research to forecast expected sales for new
product launches.
2. Delphi method:This method collects forecasting data from company professionals. The
company’s foreseeable needs are presented to a panel of experts, who then work together to
forecast the expectations and business decisions that can be made with the derived insights.
This method is used to create long-term business predictions and can also be applied to sales
forecasts.
3.2 Utility of Business Forecasting
Business forecasting acquires an important place in every field of the economy.
Business forecasting helps the businessmen and industrialists to form the policies and
plans related with their activities. On the basis of the forecasting, businessmen can
forecast the demand of the product, price of the product, condition of the market and
so on. The business decisions can also be reviewed on the basis of business
forecasting.
3.3.1 Advantages of business forecasting
• Helpful in increasing profit and reducing losses
Every business is carried out with the purpose of earning maximum profits. So, by
forecasting the future price of the product and its demand, the businessman can
predetermine the production cost, production and the level of stock to be determined.
Thus, business forecasting is regarded as the key of success of business.
• Helpful in taking management decisions
Business forecasting provides the basis for management decisions, because in
present times the management has to take the decision in the atmosphere of
uncertainties. Also, business forecasting explains the future conditions and enables
the management to select the best alternative.
• Useful to administration
On the basis of forecasting, the government can control the circulation of money. It
can also modify the economic, fiscal and monetary policies to avoid adverse effects
of trade cycles. So, with the help of forecasting, the government can control the
expected fluctuations in future.
• Basis for capital market
Business forecasting helps in estimating the requirement of capital, position of stock
exchange and the nature of investors.
• Useful in controlling the business cycles
The trade cycles cause various depressions in business such as sudden change in price
level, increase in the risk of business, increase in unemployment, etc. By adopting a
systematic business forecasting, businessmen and government can handle and control
the depression of trade cycles.
• Helpful in achieving the goals
Business forecasting helps to achieve the objective of business goals through proper
planning of business improvement activities.
• Facilitates control
By business forecasting, the tendency of black marketing, speculation, uneconomic
activities and corruption can be controlled.
• Utility to society
With the help of business forecasting the entire society is also benefited because the
adverse effects of fluctuations in the conditions of business are kept under control.
3.3.2Limitations of business forecasting
Business forecasting cannot be accurate due to various limitations which are
mentioned below.
• Forecasting cannot be accurate, because it is largely based on future events and
there is no guarantee that they will happen.
• Business forecasting is generally made by using statistical and mathematical
methods. However, these methods cannot claim to make an uncertain future a
definite one.
• The underlying assumptions of business forecasting cannot be satisfied
simultaneously. In such a case, the results of forecasting will be misleading.
• The forecasting cannot guarantee the elimination of errors and mistakes. The
managerial decision will be wrong if the forecasting is done in a wrong way.
• Factors responsible for economic changes are often difficult to discover and
measure. Hence, business forecasting becomes an unnecessary exercise.
• Business forecasting does not evaluate risks.
• The forecasting is made on the basis of past information and data and relies on
the assumption that economic events are repeated under the same conditions. But
there may be circumstances where these conditions are not repeated.
• Forecasting is not a continuous process. In order to be effective, it requires
continuous attention.
Predictive Analytics
Predictive Analytics is a statistical method that utilizes algorithms and machine learning to
identify trends in data and predict future behaviors.
With increasing pressure to show a return on investment (ROI) for implementing learning
analytics, it is no longer enough for a business to simply show how learners performed or
how they interacted with learning content. It is now desirable to go beyond descriptive
analytics and gain insight into whether training initiatives are working and how they can be
improved.
Predictive Analytics can take both past and current data and offer predictions of what could
happen in the future. This identification of possible risks or opportunities enables businesses
to take actionable intervention in order to improve future learning initiatives.
The software for predictive analytics has moved beyond the realm of statisticians and is
becoming more affordable and accessible for different markets and industries, including the
field of learning & development.
For online learning specifically, predictive analytics is often found incorporated in the
Learning Management System (LMS), but can also be purchased separately as specialized
software.
For the learner, predictive forecasting could be as simple as a dashboard located on the main
screen after logging in to access a course. Analyzing data from past and current progress,
visual indicators in the dashboard could be provided to signal whether the employee was on
track with training requirements.
At the business level, an LMS system with predictive analytic capability can help improve
decision-making by offering in-depth insight to strategic questions and concerns. This could
range from anything to course enrolment, to course completion rates, to employee
performance.
Predictive analytic models
Because predictive analytics goes beyond sorting and describing data, it relies heavily on
complex models designed to make inferences about the data it encounters. These models
utilize algorithms and machine learning to analyze past and present data in order to provide
future trends.
Each model differs depending on the specific needs of those employing predictive analytics.
Some common basic models that are utilized at a broad level include:
• Decision trees use branching to show possibilities stemming from each
outcome or choice.
• Regression techniques assist with understanding relationships between
variables.
Predictive Modeling
Predictive modeling means developing models that can be used to forecast or predict future events. In
business analytics, models can be developed based on logic or data.
Logic-Driven Models
A logic-driven model is one based on experience, knowledge, and logical relationships of
variables and constants connected to the desired business performance outcome situation.
The question here is how to put variables and constants together to create a model that can
predict the future. Doing this requires business experience. Model building requires an
understanding of business systems and the relationships of variables and constants that seek
to generate a desirable business performance outcome. To help conceptualize the
relationships inherent in a business system, diagramming methods can be helpful. For
example, the cause-and-effect diagram is a visual aid diagram that permits a user to
hypothesize relationships between potential causes of an outcome (see Figure). This diagram
lists potential causes in terms of human, technology, policy, and process resources in an
effort to establish some basic relationships that impact business performance. The diagram
is used by tracing contributing and relational factors from the desired business performance
goal back to possible causes, thus allowing the user to better picture sources of potential
causes that could affect the performance. This diagram is sometimes referred to as a fishbone
diagram because of its appearance.
Fig Cause-and-effect diagram*
or Profit = (Unit Price × Quantity Sold) − [(Fixed Cost) + (Variable Cost ×Quantity Sold)],
The relationships in this simple example are based on fundamental business knowledge.
Consider, however, how complex cost functions might become without some idea of how
they are mapped together. It is necessary to be knowledgeable about the business systems
being modeled in order to capture the relevant business behavior. Cause-and-effect diagrams
and influence diagrams provide tools to conceptualize relationships, variables, and
constants, but it often takes many other methodologies to explore and develop predictive
models.
Suppose a grocery store has collected a big data file on what customers put
into their baskets at the market (the collection of grocery items a customer
purchases at one time). The grocery store would like to know if there are any
associated items in a typical market basket. (For example, if a customer
purchases product A, she will most often associate it or purchase it with product
B.) If the customer generally purchases product A and B together, the store
might only need to advertise product A to gain both product A’s and B’s sales.
The value of knowing this association of products can improve the performance
of the store by reducing the need to spend money on advertising both products.
The benefit is real if the association holds true.
The K-mean clustering process provides a quick way to classify data into
differentiated groups. To illustrate this process, use the sales data in Figure 6.3
and assume these are sales from individual customers. Suppose a company
wants to classify the sales customers into high and low sales groups.
The SAS K-Mean cluster software can be found in Proc Cluster. Any
integer value can designate the K number of clusters desired. In this problem
set, K=2. The SAS printout of this classification process is shown in Table 6.3.
The Initial Cluster Centers table listed the initial high (20167) and a low
(12369) value from the data set as the clustering process begins. As it turns out,
the software divided the customers into 9 high sales customers and 11 low sales
customers.
Consider how large big data sets can be. Then realize this kind of
classification capability can be a useful tool for identifying and predicting sales
based on the mean values.
The case study firm had collected a random sample of monthly sales
information presented in Figure 6.4 listed in thousands of dollars. What the
firm wants to know is, given a fixed budget of $350,000 for promoting this
service product, when it is offered again, how best should the company allocate
budget dollars in hopes of maximizing the future estimated month’s product
sales? Before the firm makes any allocation of budget, there is a need to
understand how to estimate future product sales. This requires understanding
the behavior of product sales relative to sales promotion efforts using radio,
paper, TV, and point-of-sale (POS) ads.
Figure 6.4 Data for marketing/planning case study
To aid in supporting a final decision and to ensure these analytics are the
best possible estimates, we can consider an additional statistic. That tie breaker
is the R-Squared (Adjusted) statistic, which is commonly used in multiple
regression models.
The R-Square Adjusted statistic does not have the same interpretation as R-
Square (a precise, proportional measure of variation in the relationship). It is
instead a comparative measure of suitability of alternative independent
variables. It is ideal for selection between independent variables in a multiple
regression model. The R-Square adjusted seeks to take into account the
phenomenon of the R-Square automatically increasing when additional
independent variables are added to the model. This phenomenon is like a
painter putting paint on a canvas, where more paint additively increases the
value of the painting. Yet by continually adding paint, there comes a point at
which some paint covers other paint, diminishing the value of the original.
Similarly, statistically adding more variables should increase the ability of the
model to capture what it seeks to model. On the other hand, putting in too many
variables, some of which may be poor predictors, might bring down the total
predictive ability of the model. The R-Square adjusted statistic provides some
information to aid in revealing this behavior.
The value of the R-Square adjusted statistic can be negative, but it will
always be less than or equal to that of the R-Square in which it is related. Unlike
R-Square, the R-Square adjusted increases when a new independent variable is
included only if the new variable improves the R-Square more than would be
expected in the absence of any independent value being added. If a set of
independent variables is introduced into a regression model one at a time in
forward step-wise regression using the highest correlations ordered first, the R-
Square adjusted statistic will end up being equal to or less than the R-Square
value of the original model. By systematic experimentation with the R-Square
adjusted recomputed for each added variable or combination, the value of the
R-Square adjusted will reach a maximum and then decrease. The multiple
regression model with the largest R-Square adjusted statistic will be the most
accurate combination of having the best fit without excessive or unnecessary
independent variables. Again, just putting all the variables into a model may
add unneeded variability, which can decrease its accuracy. Thinning out the
variables is important.
Table 6.9 SAS Best Variable Combination Regression Model and Statistics:
Marketing/Planning Case Study
Although there are many other additional analyses that could be performed
to validate this model, we will use the SAS multiple regression model in Table
6.9 for the firm in this case study. The forecasting model can be expressed as
follows:
where:
Because all the data used in the model is expressed as dollars, the
interpretation of the model is made easier than using more complex data. The
interpretation of the multiple regression model suggests that for every dollar
allocated to radio commercials (represented by X1), the firm will receive
$275.69 in product sales (represented by Yp in the model). Likewise, for every
dollar allocated to TV commercials (represented by X2), the firm will receive
$48.34 in product sales.
In summary, for this case study, the predictive analytics analysis has
revealed a more detailed, quantifiable relationship between the generation of
product sales and the sources of promotion that best predict sales. The best way
to allocate the $350,000 budget to maximize product sales might involve
placing the entire budget into radio commercials because they give the best
return per dollar of budget. Unfortunately, there are constraints and limitations
regarding what can be allocated to the different types of promotional methods.
Optimizing the allocation of a resource and maximizing business performance
necessitate the use of special business analytic methods designed to accomplish
this task. This requires the additional step of prescriptive analytics analysis in
the BA process, which will be presented in the last section of Chapter 7.
Summary
This chapter dealt with the predictive analytics step in the BA process.
Specifically, it discussed logic-driven models based on experience and aided
by methodologies like the cause-and-effect and the influence diagrams. This
chapter also defined data-driven models useful in the predictive step of the BA
analysis. A further discussion of data mining was presented. Data mining
methodology such as neural networks, discriminant analysis, logistic
regression, and hierarchical clustering was described. An illustration of K-
mean clustering using SAS was presented. Finally, this chapter discussed the
second installment of a case study illustrating the predictive analytics step of
the BA process. The remaining installment of the case study will be presented
in Chapter 7.
Once again, several of this book’s appendixes are designed to augment the
chapter material by including technical, mathematical, and statistical tools. For
both a greater understanding of the methodologies discussed in this chapter and
a basic review of statistical and other quantitative methods, a review of the
appendixes is recommended.
Discussion Questions
1. Why is predictive analytics analysis the next logical step in any business
analytics (BA) process?
Problems
1. Using a similar equation to the one developed in this chapter for predicting
dollar product sales (note below), what is the forecast for dollar product sales
if the firm could invest $70,000 in radio commercials and $250,000 in TV
commercials?
where:
3. Assume for this problem the following table would have held true for the
resulting marketing/planning case study problem. Which combination of
variables is estimated here to be the best predictor set? Explain why.
4. Assume for this problem that the following table would have held true for
the resulting marketing/planning case study problem. Which of the variables is
estimated here to be the best predictor? Explain why.
• PRIVACY POLICY