BA Unit 1
BA Unit 1
Data science. It's a discipline that has been constantly evolving. Just when you're sure
you've worked out what a data scientist is, someone goes and pulls the rug out from under
you! With a barrage of new terms and buzzwords flying around, even HR managers in the field
get confused, so how are you supposed to keep up with the fields of business and data analytics,
data science, business intelligence, machine learning, and AI?
May seem a bit complicated in the beginning, but I assure you that everything will be clear after
we walk through it. Let's start at the very beginning.
1
Business Analytics
Before talking about data science, it is a good idea to start from a much older concept
- business. Here's a list of branches from the business field:
For a second, have a think which ones are related to business, to data, or both.
Look at the picture below to check if your ideas matched ours. Note that the blue
rectangle contains activities related to business and the pink one to data. If something sits in an
area that overlaps, then it is related to both fields.
2
As you can see, all terms are business activities but only some are data-driven, the rest of them
are experience-driven.
A preliminary report
A visual representation of your company’s performance for last year
A business dashboard
A forecast for the future sales of your company.
Business case studies are real-world experiences of how business people and companies succeed
or fail. Qualitative analytics is about using your intuition and knowledge to assist in future
planning. You don’t need a dataset to learn from either. Thus, both remain in the blue rectangle.
3
Now would be the perfect time to introduce a timeline. Some of the terms you see refer to activities
that explain past behavior, while others refer to activities used for predicting future behavior.
We’re going to put a line through the middle to represent the present. Therefore, all terms that are
on the right of this line will regard future planning and forecasting, those that are on the left of the
line will be related to the analysis of past events or data.
Again, take a moment to decide which aspects refer to which point in time.
Business case studies examine events that have already happened. For instance, one could learn
from them and attempt to prevent making a similar mistake in the future, so this activity refers to
the past.
Contrast it to the other business term, ‘qualitative analytics’ which includes working with tools
that help predict future behavior and you’ll realize this must be placed on the right.
Preparing a report or a dashboard is always a reflection of past data, these terms will remain on
the left. Forecasting, though, is a future-oriented activity, we’ve put it to the right of the black
4
line, but not too much – it must still belong to the sphere of business, so it must be in the area
where business analytics and data intersect.
Does this mean that the preliminary data report, reporting with visuals, creating dashboards, and
sales forecasting are of interest to a data scientist? Yes, absolutely.You notice we added a couple
of aspects that weren’t there before, good eye! ‘Optimization of drilling operations’ and ‘digital
signal processing’ are a couple of examples that fit into sub-areas outside of business.
5
Consider the oil and gas industry, and the optimization of drilling operations. This is a perfect
example of an aspect which requires data science and data analytics but NOT business. We use
data science to improve predictions based on data extracted from activities typical for drilling
efficiency. And that certainly isn’t business analytics. We use digital signal to represent data in the
form of discrete values. Therefore, we can apply data analytics to digital signal to produce a higher
quality signal, without going into data science.
6
ANALYTICS LIFE CYCLE
BUSINESS ANALYTICS LIFE CYCLE
Business Analytics is a very prevalent term in the 21st century across various sectors. It
corresponds to a set of methodologies and tools that change the way of how organizations approach
decision-making. Since the impact of Business Analytics is very heavy, organizations have defined
a business analytics lifecycle to make sure not to commit mistakes or miss out on any crucial
information. This process is termed the Business Analytics process. The steps of the process may
vary from organization to organization as a lot of factors, viz. the industry, the type of product, the
size of the company, etc., play major roles in determining them. However, broadly, you can
classify the entire Business Analytics process into six steps.
we will discuss the Business Analytics process and its six steps in the following sections:
Business Analytics is a term that took industries by storm in the 21st century. All businesses around
the world were looking to make more and more profits, and the only way they could do that was
by finding out gaps and filling them. The Business Analytics process initially came as a problem-
solving approach to many organizations where data was being captured and accessed. This data
was then used for multiple purposes, ranging from improving customer services to predicting
fraud. Due to its vast success, people realized quickly that Business Analytics can not only solve
pre-existing visible problems but also can notify them about the illusive problems that do not seem
to be existing.
Once the world started noticing the impact of Business Analytics, organizations soon realized that
its potential is not related to just problem-solving, but they can also use it to predict, plan,
improvise, and overcome various obstacles that they may find.Business Analytics is a discipline
where you use the pre-existing data to find out key insights that can help you solve a business
problem. To find the said insights, you have to apply a lot of statistical models, as well as
manipulate the data to fit such models. In today’s world, Business Analytics is so important that
almost every organization has a Business Analytics team and well defined business analytics
7
process steps. Since there are problems and gaps in all forms of businesses, Business Analytics is
a viable approach across all industries. From the food industry to the IT sector, everyone is
employing Business Analytics to find out the optimum ways to do business.
The Business Analytics process involves asking questions, looking at data, and
manipulating it to find the required answers. Now, every organization has different ways
to execute this process as all of these organizations work in different sectors and value
different metrics more than the others based on their specific business model.
Since the approach to business is different for different organizations, their solutions
and their ways to reach the solutions are also different. Nonetheless, all of the actions
that they do can be classified and generalized to understand their approach. The image
given below demonstrates the steps in Business Analytics process of a firm:
8
The above image just covers the overview of the Business Analytics process. Now, let’s convert
it into the actual steps that are involved in solving problems.
The first step of the process is identifying the business problem. The problem could be an actual
crisis; it could be something related to recognizing business needs or optimizing current processes.
This is a crucial stage in Business Analytics as it is important to clearly understand what the
expected outcome should be. When the desired outcome is determined, it is further broken down
into smaller goals. Then, business stakeholders decide the relevant data required to solve the
problem. Some important questions must be answered in this stage, such as: What kind of data is
available? Is there sufficient data? And so on.
Once the problem statement is defined, the next step is to gather data (if required) and, more
importantly, cleanse the data—most organizations would have plenty of data, but not all data
points would be accurate or useful. Organizations collect huge amounts of data through different
methods, but at times, junk data or empty data points would be present in the dataset. These faulty
pieces of data can hamper the analysis. Hence, it is very important to clean the data that has to be
analyzed.
To do this, you must do computations for the missing data, remove outliers, and find new variables
as a combination of other variables. You may also need to plot time series graphs as they generally
indicate patterns and outliers. It is very important to remove outliers as they can have a heavy
impact on the accuracy of the model that you create. Moreover, cleaning the data helps you get a
better sense of the dataset.
9
Step 3: Analysis
Once the data is ready, the next thing to do is analyze it. Now to execute the same, there are various
kinds of statistical methods (such as hypothesis testing, correlation, etc.) involved to find out the
insights that you are looking for. You can use all of the methods for which you have the data.
The prime way of analyzing is pivoting around the target variable, so you need to take into account
whatever factors that affect the target variable. In addition to that, a lot of assumptions are also
considered to find out what the outcomes can be. Generally, at this step, the data is sliced, and the
comparisons are made. Through these methods, you are looking to get actionable insights.
Gone are the days when analytics was used to react. In today’s era, Business Analytics is all about
being proactive. In this step, you will use prediction techniques, such as neural networks or
decision trees, to model the data. These prediction techniques will help you find out hidden insights
and relationships between variables, which will further help you uncover patterns on the most
important metrics. By principle, a lot of models are used simultaneously, and the models with the
most accuracy are chosen. In this stage, a lot of conditions are also checked as parameters, and
answers to a lot of ‘what if…?’ questions are provided.
From the insights that you receive from your model built on target variables, a viable plan of action
will be established in this step to meet the organization’s goals and expectations. The said plan of
action is then put to work, and the waiting period begins. You will have to wait to see the actual
outcomes of your predictions and find out how successful you were in your endeavors. Once you
get the outcomes, you will have to measure and evaluate them.
Post the implementation of the solution, the outcomes are measured as mentioned above. If you
find some methods through which the plan of action can be optimized, then those can be
implemented. If that is not the case, then you can move on with registering the outcomes of the
10
entire process. This step is crucial for any analytics in the future because you will have an ever-
improving database. Through this database, you can get closer and closer to maximum
optimization. In this step, it is also important to evaluate the ROI (return on investment). Take a
look at the diagram below of the life cycle of business analytics.
11
Why Do We Need Data Science and Analytics?
In earlier days, the size of the data was minimal, and it was effortless to analyze the data by
using some business intelligence tools. But with the advancement of digital technology and more
data getting generated from several different sources such as financial logs, text files, multimedia
forms, sensors, instruments, etc., companies face big-time challenges in cleansing and analyzing
this unstructured data with traditional business intelligence tools. The chart below clearly
indicates that the percentage of this unstructured data will rise to 80% by the end of 2020.
Hence, we need tools built on the latest technology and use advanced algorithms that are capable
of cleansing, preparing, and processing this massive chunk of unstructured data to produce
meaningful insights.
The LIFECYCLE OF A DATA SCIENCE
There are multiple phases in the lifecycle of data science. Let’s understand it better with a real-
life example. Imagine that you run a retail shop and your primary goal is to improve the sales of
the shop. To identify the factors that drive your sales numbers, you must answer a few questions,
such as which products are the most profitable? Are you gaining any benefit from the in-store
promotions? These questions are better explained by following the steps involved in the lifecycle
of data science.
A data science life cycle includes the following steps:
12
Data Discovery
The data discovery phase consists of the multiple sources from which you discover the raw and
unstructured data such as videos, images, text files, etc. So, as per the above example, you need a
clear understanding of the factors that affect your sales to procure the data that will be relevant
for your further analysis. You can consider the following factors: store location, staff, working
hours, promotions, product pricing, and so on.
Data Preparation
The next stage of the data science lifecycle is preparing the raw and unstructured data for further
analysis. For this, you need to convert the data into a standard format so that you can work on it
seamlessly. This phase includes steps for exploring, pre-processing, and conditioning of data.
After your data is cleaned and pre-processed, it is much easier to perform exploratory analytics
on it.
Model Planning
The model planning phase includes the methods and techniques that you will use to determine
the relationships between variables. This relationship can act as a base for the algorithms that are
used at the time of model building. You can use several different tools for model planning, such
as SQL analysis services, R programming, or SAS/access. Out of all these tools, R programming
is the most commonly used tool in model planning.
Model Building
13
In the model-building phase, you will create different datasets for training and testing purposes.
For this purpose, you can divide your dataset into the 70 and 30 per cent ratio. 70% of data will
be used to train the model, and the remaining 30% of data will be used to test the trained model.
You can use techniques such as classification, association, or clustering to build your model.
Operationalize
In the operationalize phase, you will deliver the final reports, briefings, code, and any other
technical documents.
Communicate Results
In the last phase, you will evaluate if you can achieve the goal that you set in the first phase. In
this phase, you will communicate all your critical findings to the respective stakeholders and
determine whether your project results in a success or failure based on the criteria defined in
phase 1.
Difference Between a Data Science and Analytics Role
As stated above, data science is an umbrella term that includes data analytics, machine learning,
and data mining; hence, data analytics can be considered as a subset of data science. Data science
is the blend of various tools, algorithms, and machine learning principles that are studied to
discover the meaningful pattern and information from the raw and unstructured data. On the
other hand, data analysis explains what is going on by processing the history of the data, and
includes techniques such as descriptive analytics, advanced analytics, diagnostic analytics, and
prescriptive analytics. Each of the methods stated has its applications in the field of business.
For example, descriptive analytics helps answer questions about what happened and summarize
large datasets to describe outcomes to stakeholders. Diagnostic analytics helps answer why
things happened and supplement more basic descriptive analytics. Predictive analytics helps
answer questions such as what will happen in the future, and identifies the trends and determines
if they are likely to recur. Prescriptive analytics finds an answer to what should be done and
14
helps businesses make informed decisions in the face of uncertainty.
The following block diagram shows how the two job titles, data scientist and data analyst, map to
the skills and scope of the responsibilities:
15
TYPES OF ANALYTICS
There are 4 different types of analytics: Descriptive, Diagnostic, Predictive, and Prescriptive
analytics, through which you can eradicate flaws and promote informed decisions. By
implementing these methods, decision-making becomes much more efficient. However, the right
combination of analytics is essential.
Analytical methods, or analytics for short, change the game for the better. Each type has its
reasoning and calculated consequences, so you are rarely caught off-guard. A sorted process backs
it up, which deals with analyzing data at each of its stages.
So far, we have come up with four broad categories, viz. descriptive, diagnostic, predictive, and
prescriptive analytical methods.
Each one is used in a particular scenario and helps you comprehend where the company is at.
Accordingly, it leads you to an insightful solution. Before getting into the intricate details of every
analytical method, defining them briefly would be ideal for better understanding.
DESCRIPTION
Firstly, we have descriptive analytics, under which you do the required bare minimum of sorting
and categorizing. It includes summarizing your data through business intelligence tools. The
purpose is to get clarity on a particular event.
Next up, we have diagnostic analytics. As the words suggest, it focuses on diagnosing the event.
You consider the past performances to understand and track why the current event has taken place.
By the end of this process, you will have yourself an analytical dashboard.
Meanwhile, predictive analytics focuses on making predictions of what the possible outcomes
could be. However, these are not baseless predictions and depend on machine learning techniques.
You might have come across statistical models, which will also come in handy.
Descriptive Analysis
Diagnostic Analysis
Predictive Analysis
Prescriptive Analysis
Below, we will introduce each type and give examples of how they are utilized in business.
16
Descriptive Analysis
The first type of data analysis is descriptive analysis. It is at the foundation of all data insight. It is
the simplest and most common use of data in business today. Descriptive analysis answers the
“what happened” by summarizing past data usually in the form of dashboards. More information
about designing dashboards can be found here.
The biggest use of descriptive analysis in business is to track Key Performance Indicators (KPI’s).
KPI’s describe how a business is performing based on chosen benchmarks.
KPI dashboards
Monthly revenue reports
Sales leads overview
Diagnostic Analysis
After asking the main question of “what happened” you may then want to dive deeper and ask why
did it happen? This is where diagnostic analysis comes in.
Diagnostic analysis takes the insight found from descriptive analytics and drills down to find the
cause of that outcome. Organizations make use of this type of analytics as it creates more
connections between data and identifies patterns of behavior.
A critical aspect of diagnostic analysis is creating detailed information. When new problems arise,
it is possible you have already collected certain data pertaining to the issue. By already having the
data at your disposal, it ends having to repeat work and makes all problems interconnected.
Predictive Analysis
Predictive analysis attempts to answer the question “what is likely to happen”. This type of
analytics utilizes previous data to make predictions about future outcomes.
This type of analysis is another step up from the descriptive and diagnostic analyses. Predictive
analysis uses the data we have summarized to make logical predictions of the outcomes of events.
This analysis relies on statistical modeling, which requires added technology and manpower to
forecast. It is also important to understand that forecasting is only an estimate; the accuracy of
predictions relies on quality and detailed data.
17
While descriptive and diagnostic analysis are common practices in business, predictive analysis is
where many organizations begin show signs of difficulty. Some companies do not have the
manpower to implement predictive analysis in every place they desire. Others are not yet willing
to invest in analysis teams across every department or not prepared to educate current teams.
Risk Assessment
Sales Forecasting
Using customer segmentation to determine which leads have the best chance of converting
Predictive analytics in customer success teams
Prescriptive Analysis
The final type of data analysis is the most sought after, but few organizations are truly equipped
to perform it. Prescriptive analysis is the frontier of data analysis, combining the insight from all
previous analyses to determine the course of action to take in a current problem or decision.
Prescriptive analysis utilizes state of the art technology and data practices. It is a huge
organizational commitment and companies must be sure that they are ready and willing to put forth
the effort and resources.
Final Thoughts
Data analytics is quite an important process of breaking down the functioning of your
organization. The primary reason behind that being the help it offers to businesses.
Meaning, optimizing your performances accordingly becomes swift and efficient.
By implementing the required data analytics into your business model would reduce
incurred costs or failure. Moreover, it helps you identify multiple efficient ways of going
about with your daily business!
18
BUSINESS PROBLEM DEFINITION
As a business analyst you will have to understand your clients’ needs and constructively provide
valuable solution options. You will have to find the real roots of the needs and approach problems
in a way that will enable change.
Your task is not just to collect requirements. It’s to elicit requirements in order to ensure long –
lasting change. It is common for clients to come up with the solution in mind. For example, a client
may request an addition of a step to the process. Diving more and trying to figure out the actual
need behind this request may reveal that there is another way of treating the actual need.
The following stages are commonly used by Business Analysts when problem solving is required.
1) Problem Definition
Τhe first step in the approach is the problem definition. Gathering information, ascertaining its
validity against other sources of information, and analyzing the available information are key at
this stage. The way a problem is identified first and then defined can have a significant impact on
the alternatives that may be emerge. Identifying the problem will also delineate the goals and
objectives that the alternative solutions should cover. The more complete a problem statement is,
the easier it will be to identify alternatives, selection & evaluation.
Too wide or too narrow definitions of the problem can impact the quality of the solution. Analysts
are asked to find the balance between small and large range so that there are several alternatives.
Focusing on the symptoms rather than the causes is a common mistake in defining a problem. Of
course the subjectivity involved in characterizing the symptom often makes this mistake inevitable.
Many techniques such as the “5 Whys” can help in avoiding this pitfall.
Choosing the right problem means that while there may be parallel problems we must choose with
a systemic approach the problem that is most possible to some extent another problem. Systemic
thinking is of paramount importance as there is usually an interdependence between seemingly
unrelated problems.
2) Alternative Solutions
Once the problem is identified, the analyst, should, together with the technical team to search for
possible solutions.
Solution options has to be aligned with the project scope, the overall business needs and the
technical feasibility. Solutions options must be realistic from business and technical side and of
course valid in the eyes of the stakeholders.
19
A common mistake in this step is to abandon an alternative too quickly. This often happens under
the pressure of time and other circumstances. However, because an alternative seems convenient,
this does not make it ideal. It may have harmful side effects, or it may be less effective than other
alternatives that would result if given enough time at this stage.
One way to limit the error of the incomplete “pool of alternatives” is to involve key stakeholders
in discussions of identifying different solutions. It’s a good way for different perspectives to be
presented and contribute to different solution alternatives.
For every solution option an assessment shall be done against the other solution options. The
business analysts in collaboration with the key stakeholders identify the criteria that will be used
for this comparison.
A cost-benefit analysis is commonly used for each solution option in order to figure out the benefits
against the costs. However sometimes the full benefits or costs cannot be monetized, and indirect
benefits or costs may be derived by the implementation of a solution. So, it is not a good idea to
compare different options based strictly on a cost – benefit analysis as it is not easy to think about
all costs and benefits and give them a value.
An analyst understands the cognitive limitations of human information processing capabilities and
the difficulty of making optimization decisions. It is worth noting that the best alternative is
choosing an environment of delimited rationality. An environment of delimited rationality is
created as the limits of the decision-making process are set by the available information and the
context.
Problem solving is vital in all aspects of business from people problems to technical problems and
from short-term to long-term problems. And problem-solving involves two completely different,
possibly conflicting thought processes: creativity and decision making. A business analyst shall
continuously try to improve problem solving skills by implementing in practice useful techniques
and approaches and continuously following up the outcomes.
20
DATA COLLECTION
WHAT IS DATA COLLECTION?
Data collection is the methodological process of gathering information about a specific subject.
It’s crucial to ensure your data is complete during the collection phase and that it’s collected legally
and ethically. If not, your analysis won’t be accurate and could have far-reaching consequences.
In general, there are three types of consumer data:
First-party data, which is collected directly from users by your organization
Second-party data, which is data shared by another organization about its customers (or its first-
party data)
Third-party data, which is data that’s been aggregated and rented or sold by organizations that
don’t have a connection to your company or users
Although there are use cases for second- and third-party data, first-party data (data you’ve
collected yourself) is more valuable because you receive information about how your audience
behaves, thinks, and feels—all from a trusted source.
Data can be qualitative (meaning contextual in nature) or quantitative (meaning numeric in
nature). Many data collection methods apply to either type, but some are better suited to one over
the other.
In the data life cycle, data collection is the second step. After data is generated, it must be collected
to be of use to your team. After that, it can be processed, stored, managed, analyzed, and visualized
to aid in your organization’s decision-making.
21
Before collecting data, there are several factors you need to define:
The question you aim to answer
The data subject(s) you need to collect data from
The collection timeframe
The data collection method(s) best suited to your needs
The data collection method you select should be based on the question you want to answer, the
type of data you need, your timeframe, and your company’s budget. Explore the options in the
next section to see which data collection method is the best fit.
1. Surveys
Surveys are physical or digital questionnaires that gather both qualitative and quantitative data
from subjects. One situation in which you might conduct a survey is gathering attendee feedback
after an event. This can provide a sense of what attendees enjoyed, what they wish was different,
and areas you can improve or save money on during your next event for a similar audience.
Because they can be sent out physically or digitally, surveys present the opportunity for
distribution at scale. They can also be inexpensive; running a survey can cost nothing if you use a
free tool. If you wish to target a specific group of people, partnering with a market research firm
to get the survey in the hands of that demographic may be worth the money.
Something to watch out for when crafting and running surveys is the effect of bias, including:
Collection bias: It can be easy to accidentally write survey questions with a biased lean. Watch
out for this when creating questions to ensure your subjects answer honestly and aren’t swayed
by your wording.
Subject bias: Because your subjects know their responses will be read by you, their answers
may be biased toward what seems socially acceptable. For this reason, consider pairing survey
data with behavioral data from other collection methods to get the full picture.
2. Transactional Tracking
Each time your customers make a purchase, tracking that data can allow you to make decisions
about targeted marketing efforts and understand your customer base better.
Often, e-commerce and point-of-sale platforms allow you to store data as soon as it’s generated,
making this a seamless data collection method that can pay off in the form of customer insights.
22
Through interviews and focus groups, you can gather feedback from people in your target audience
about new product features. Seeing them interact with your product in real-time and recording
their reactions and responses to questions can provide valuable data about which product features
to pursue.
As is the case with surveys, these collection methods allow you to ask subjects anything you want
about their opinions, motivations, and feelings regarding your product or brand. It also introduces
the potential for bias. Aim to craft questions that don’t lead them in one particular direction.
One downside of interviewing and conducting focus groups is they can be time-consuming and
expensive. If you plan to conduct them yourself, it can be a lengthy process. To avoid this, you
can hire a market research facilitator to organize and conduct interviews on your behalf.
4. Observation
Observing people interacting with your website or product can be useful for data collection because
of the candor it offers. If your user experience is confusing or difficult, you can witness it in real-
time.
Yet, setting up observation sessions can be difficult. You can use a third-party tool to record users’
journeys through your site or observe a user’s interaction with a beta version of your site or
product.
While less accessible than other data collection methods, observations enable you to see firsthand
how users interact with your product or site. You can leverage the qualitative and quantitative data
gleaned from this to make improvements and double down on points of success.
5. Online Tracking
To gather behavioral data, you can implement pixels and cookies. These are both tools that track
users’ online behavior across websites and provide insight into what content they’re interested in
and typically engage with.
You can also track users’ behavior on your company’s website, including which parts are of the
highest interest, whether users are confused when using it, and how long they spend on product
pages. This can enable you to improve the website’s design and help users navigate to their
destination.
Inserting a pixel is often free and relatively easy to set up. Implementing cookies may come with
a fee but could be worth it for the quality of data you’ll receive. Once pixels and cookies are set,
they gather data on their own and don’t need much maintenance, if any.
It’s important to note: Tracking online behavior can have legal and ethical privacy implications.
Before tracking users’ online behavior, ensure you’re in compliance with local and industry data
privacy standards.
6. Forms
Online forms are beneficial for gathering qualitative data about users, specifically demographic
data or contact information. They’re relatively inexpensive and simple to set up, and you can use
them to gate content or registrations, such as webinars and email newsletters.
23
You can then use this data to contact people who may be interested in your product, build out
demographic profiles of existing customers, and in remarketing efforts, such as email workflows
and content recommendations.
DATA PREPARATION
Data preparation is a pre-processing step that involves cleansing, transforming, and consolidating
data. In other words, it is a process that involves connecting to one or many different data sources,
cleaning dirty data, reformatting or restructuring data, and finally merging this data to be consumed
for analysis.
The data preparation pipeline consists of the following steps
1. Access the data.
2. Ingest (or fetch) the data.
3. Cleanse the data.
4. Format the data.
5. Combine the data.
6. And finally, analyze the data.
1.Access
There are many sources of business data within any organization. Examples include endpoint data,
customer data, marketing data, and all their associated repositories. This first essential data
preparation step involves identifying the necessary data and its repositories. This is not simply
identifying all possible data sources and repositories, but identifying all that are applicable to the
desired analysis. This means that there must first be a plan that includes the specific questions to
be answered by the data analysis.
2.Ingest
Once the data is identified, it needs to be brought into the analysis tools. The data will likely be
some combination of structured and semi-structured data in different types of repositories.
Importing it all into a common repository is necessary for the subsequent steps in the pipeline.
Access and ingest tend to be manual processes with significant variations in exactly what needs to
be done. Both data preparation steps require a combination of business and IT expertise and are
therefore best done by a small team. This step is also the first opportunity for data validation.
24
3.Cleanse
Cleansing the data ensures that the data set can provide valid answers when the data is analyzed.
This step could be done manually for small data sets but requires automation for most realistically
sized data sets. There are software tools available for this processing. If custom processing is
needed, many data engineers rely on applications coded in Python. There are many different
problems possible with the ingested data. There could be missing values, out-of-range values,
nulls, and whitespaces that obfuscate values, as well as outlier values that could skew analysis
results. Outliers are particularly challenging when they are the result of combining two or more
variables in the data set. Data engineers need to plan carefully for how they are going to cleanse
their data.
4.Format
Once the data set has been cleansed; it needs to be formatted. This step includes resolving issues
like multiple date formats in the data or inconsistent abbreviations. It is also possible that some
data variables are not needed for the analysis and should therefore be deleted from the analysis
data set. This is another data preparation step that will benefit from automation. Cleansing and
formatting steps should be saved into a repeatable recipe data scientists or engineers can apply to
similar data sets in the future. For example, a monthly analysis of sales and support data would
likely have the same sources that need the same cleansing and formatting steps each month.
5.Combine
When the data set has been cleansed and formatted, it may be transformed by merging, splitting,
or joining the input sets. Once the combining step is complete, the data is ready to be moved to
the data warehouse staging area. Once data is loaded into the staging area, there is a second
opportunity for validation.
6.Analyze
Once the analysis has begun, changes to the data set should only be made with careful
consideration. During analysis, algorithms are often adjusted and compared to other results.
Changes to the data can skew analysis results and make it impossible to determine whether the
different results are caused by changes to the data or the algorithms.
25
5. Ensure that transforms are reproducible, deterministic and idempotent. Each transform must
produce the same results each time it is executed given the same input data set, without
harmful effects.
6. Future proof your data pipeline. Version not only the data and the code that performs the
analysis, but also the transforms that have been applied to the data.
7. Ensure that there is adequate separation between the online system and the offline analysis so
that the ingest step does not impact user-facing services.
8. Monitor the data pipeline for consistency across data sets.
9. Employ Data Governance early, and be proactive. IT’s need for security and compliance
means incorporating governance capabilities like data masking, retention, lineage, and role-
based permissions are all important aspects of the pipeline.
Know your data, know your customers’ needs, and set up a reproducible process for constructing
your data preparation pipeline.
26
HYPOTHESIS GENERATION
Introduction
The first step towards problem-solving in data science projects isn’t about building machine
learning models. Yes, you read that right!That distinction belongs to hypothesis generation – the
step where combine our problem solving skills with our business intuition. It’s a truly crucial step
Let’s be honest – all of us think of a hypothesis almost every day. Let us consider the example of
a famous sport in India – cricket. It is that time of the year when IPL fever is high and we are all
If you have been guessing which team would win based on various factors like the size of the
stadium and batsmen present in the team with six hitting capabilities or batsmen with high T20
averages, then kudos to you all. You have all been making an educated guess and generating
Similarly, the first step towards solving any business problem using machine learning is hypothesis
generation. Understanding the problem statement with good domain knowledge is important and
So in this article, let’s dive into what hypothesis generation is and figure out why it is important
for every data scientist. Let’s see the following topics in detail.
Hypothesis generation is an educated “guess” of various factors that are impacting the business
problem that needs to be solved using machine learning. In framing a hypothesis, the data scientist
must not know the outcome of the hypothesis that has been generated based on any evidence.
27
“A hypothesis may be simply defined as a guess. A scientific hypothesis is an intelligent guess.”
– Isaac Asimov
Hypothesis generation is a crucial step in any data science project. If you skip this or skim through
Hypothesis generation is a process beginning with an educated guess whereas hypothesis testing
is a process to conclude that the educated guess is true/false or the relationship between the
This latter part could be used for further research using statistical proof. A hypothesis is accepted
or rejected based on the significance level and test score of the test used for testing the hypothesis
Here are 5 key reasons why hypothesis generation is so important in data science:
Hypothesis generation helps in comprehending the business problem as we dive deep in inferring
the various factors affecting our target variable
You will get a much better idea of what are the major factors that are responsible to solve the
problem
Data that needs to be collected from various sources that are key in converting your business
problem into a data science-based problem
Improves your domain knowledge if you are new to the domain as you spend time understanding
the problem
Helps to approach the problem in a structured manner
The million-dollar question – when in the world should you perform hypothesis generation?
The hypothesis generation should be made before looking at the dataset or collection of the data.
You will notice that if you have done your hypothesis generation adequately, you would have
included all the variables present in the dataset in your hypothesis generation. You might also have
included variables that are not present in the dataset
28
Case Study: Hypothesis Generation on “New York City Taxi Trip Duration Prediction”
Let us now look at the “NEW YORK CITY TAXI TRIP DURATION PREDICTION” problem
statement and generate a few hypotheses that would affect our taxi trip duration to understand
hypothesis generation.
Here’s the problem statement: To predict the duration of a trip so that the company can assign the
cabs that are free for the next trip. This will help in reducing the wait time for customers and will
Let us try to come up with a formula that would have a relation with trip duration and would help
TIME=DISTANCE/SPEED
Distance and speed play an important role in predicting the trip duration.We can notice that the
trip duration is directly proportional to the distance traveled and inversely proportional to the speed
of the taxi. Using this we can come up with a hypothesis based on distance and speed.
Distance: More the distance traveled by the taxi, the more will be the trip duration.
Interior drop point: Drop points to congested or interior lanes could result in an increase in trip
duration
Speed: Higher the speed, the lower the trip duration
Cars are of various types, sizes, brands, and these features of the car could be vital for commute
not only on the basis of the safety of the passengers but also for the trip duration. Let us now
29
Condition of the car: Good conditioned cars are unlikely to have breakdown issues and could
have a lower trip duration
Car Size: Small-sized cars (Hatchback) may have a lower trip duration and larger-sized cars
(XUV) may have higher trip duration based on the size of the car and congestion in the city
Trip types can be different based on trip vendors – it could be an outstation trip, single or pool
rides. Let us now define a hypothesis based on the type of trip used.
Pool Car: Trips with pooling can lead to higher trip duration as the car reaches multiple places
before reaching your assigned destination
A driver is an important person when it comes to commute time. Various factors about the driver
can help in understanding the reason behind trip duration and here are a few hypotheses this.
Age of driver: Older drivers could be more careful and could contribute to higher trip duration
Gender: Female drivers are likely to drive slowly and could contribute to higher trip duration
Driver experience: Drivers with very less driving experience can cause higher trip duration
Medical condition: Drivers with a medical condition can contribute to higher trip duration
5. Passenger details
Passengers can influence the trip duration knowingly or unknowingly. We usually come across
passengers requesting drivers to increase the speed as they are getting late and there could be other
Age of passengers: Senior citizens as passengers may contribute to higher trip duration as drivers
tend to go slow in trips involving senior citizens
Medical conditions or pregnancy: Passengers with medical conditions contribute to a longer trip
duration
Emergency: Passengers with an emergency could contribute to a shorter trip duration
Passenger count: Higher passenger count leads to shorter duration trips due to congestion in
seating
30
6. Date-Time Features
The day and time of the week are important as New York is a busy city and could be highly
congested during office hours or weekdays. Let us now generate a few hypotheses on the date and
time-based features.
Pickup Day:
Weekends could contribute to more outstation trips and could have a higher trip duration
Weekdays tend to have higher trip duration due to high traffic
If the pickup day falls on a holiday then the trip duration may be shorter
If the pickup day falls on a festive week then the trip duration could be lower due to lesser traffic
Time:
Early morning trips have a lesser trip duration due to lesser traffic
Evening trips have a higher trip duration due to peak hours
7. Road-based Features
Roads are of different types and the condition of the road or obstructions in the road are factors
that can’t be ignored. Let’s form some hypotheses based on these factors.
Condition of the road: The duration of the trip is more if the condition of the road is bad
Road type: Trips in concrete roads tend to have a lower trip duration
Strike on the road: Strikes carried out on roads in the direction of the trip causes the trip duration
to increase
8. Weather Based Features
Weather can change at any time and could possibly impact the commute if the weather turns bad.
Weather at the start of the trip: Rainy weather condition contributes to a higher trip duration
After writing down our hypothesis and looking at the dataset you will notice that you would have
covered the writing of hypothesis on most of the features present in the data set. There could also
be a possibility that you might have to work with fewer features and the features on which you
have generated hypotheses are not currently being captured/stored by the business and are not
available. Always go ahead and capture data from external sources if you think that the data is
relevant for your prediction. Ex: Getting weather information. It is also important to note that
since hypothesis generation is an estimated guess, the hypothesis generated could come out to be
true or false once exploratory data analysis and hypothesis testing is performed on the data.
31
MODELING
The purpose of the institution of business is to create and deliver value in an efficient enough way
that it will generate profit after cost.” If only it were as simple as it sounds! It’s bigger than
checking boxes to produce reports, closing the books each period and remaining in compliance.
To create real value, forward-looking finance organizations have moved beyond traditional finance
activities and are establishing robust business modeling & analytics programs that provides
detailed visibility into historical performance and expected results.
Companies are increasingly dependent on automation and analytics to deliver clear, actionable and
forward-looking insights. With the explosion of available data from quickbooks desktop
canada and the need to quickly evaluate it, finance teams are centered on two areas – the data and
the models.
As the size of the organization grows, so do the number of systems that support it. Companies can
have hundreds of machines that generate endless data points along with groups of data warehouses,
BI systems and other data sources holding information. Without a way to link data across the
enterprise, it’s impossible to deliver meaningful insights or accurate results.
One shortcoming of most business analytics programs is the inability to integrate forecasted or
simulated data as part of the modeling data set. Along with the expected data from ERP, shop floor
and other systems, advanced analytics program recognize the need to retain and use forecasted or
32
simulated data alongside historical data to predict performance and test assumptions for any data
combination.
The best models and simulations reflect real-world scenarios, not a pre-defined process or
methodology, and delivers results in a timely manner. They must include measurable and
meaningful KPIs that expose improvement opportunities and encourage behaviors that positively
effect performance.
Most organizations have a few “system types” that provide, calculate, or share data as part of the
planning and analytics process. The Business Modeling & Analytics Technology Landscape
compares the different technologies and their attributes. The vertical axis measures the robustness
of the system – or its ability to handle very large data sets and the ability to execute large sets of
calculations quickly. Also, robust systems are scalable and can handle multiple users and
permission sets. On the horizontal axis, the flexibility of the system represents the ability for the
system to be configured to meet the company’s unique needs, both now and in the future.
33
Business Modeling and Analytics “System Types”
ERP: The ERP system is the system of record for organizations and serves as the transactional
system. They are often limited by the design choices in the initial implementation, creating a rigid
environment with no inherent simulation capabilities.
User Defined Applications: Many companies try to meet their analytical needs with applications
developed internally. From legacy systems dating back to the 70s to AS400 and Access databases,
User defined applications offer customized solutions, but their capabilities are often a reduced or
partial set of controls. Additionally, simulation capabilities are limited to the programmed options,
and changes to the system require IT involvement in maintaining and upgrading systems.
Spreadsheets: Desktop modeling tools solve a variety of challenges when it comes to creating a
flexible modeling environment. It isn’t however, a system. That means limited controls and
auditability will assure integrity issues. Further, spreadsheets are unable to scale with the
requirements to build a robust modeling and analytics environment, and has difficulty handling
larger sets of data.
Integrated Business Modeling & Analytics Platform: ImpactECS is an integrated modeling &
analytics platform that leverages data from existing systems and delivers complete flexibility to
design, calculate, manage, and report results that drive results by creating value. With a centralized
approach, finance and planning organizations can link important data from across the company
and build models to calculate, predict, or simulate performance at any level of detail or business
dimension.
Each of these system types are critical and offer valuable benefits that keep the business running.
However, organizations on a mission to create real value through analytics need to augment their
IT footprint with technology that delivers the best of both worlds – a solid, enterprise-level system
that connects relevant systems and data, and a flexible modeling platform to build, run, and
maintain models that meet the company’s unique business requirements. Companies build models
with ImpactECS for everything from detailed manufacturing product costs, cost-to-serve and
distribution costs, and profitability analytics for any business dimension.
34
VALIDATION AND EVALUATION
MODEL VALIDATION
Model validation is defined within regulatory guidance as “the set of processes and activities
intended to verify that models are performing as expected, in line with their design objectives, and
business uses.” It also identifies “potential limitations and assumptions, and assesses their possible
impact.”Generally, validation activities are performed by individuals independent of model
development or use. Models, therefore, should not be validated by their owners as they can be
highly technical, and some institutions may find it difficult to assemble a model risk team that has
sufficient functional and technical expertise to carry out independent validation. When faced with
this obstacle, institutions often outsource the validation task to third parties. In statistics, model
validation is the task of confirming that the outputs of a statistical model are acceptable with respect
to the real data-generating process. In other words, model validation is the task of confirming that
the outputs of a statistical model have enough fidelity to the outputs of the data-generating process
that the objectives of the investigation can be achieved.
1. Conceptual Design
The foundation of any model validation is its conceptual design, which needs documented coverage
assessment that supports the model’s ability to meet business and regulatory needs and the unique
The design and capabilities of a model can have a profound effect on the overall effectiveness of a
bank’s ability to identify and respond to risks. For example, a poorly designed risk assessment
model may result in a bank establishing relationships with clients that present a risk that is greater
than its risk appetite, thus exposing the bank to regulatory scrutiny and reputation damage.
A validation should independently challenge the underlying conceptual design and ensure that
documentation is appropriate to support the model’s logic and the model’s ability to achieve desired
regulatory and business outcomes for which it is designed.
35
2. System Validation
All technology and automated systems implemented to support models have limitations. An
effective validation includes: firstly, evaluating the processes used to integrate the model’s
conceptual design and functionality into the organisation’s business setting; and, secondly,
examining the processes implemented to execute the model’s overall design. Where gaps or
limitations are observed, controls should be evaluated to enable the model to function effectively.
respond to risks. Best practise indicates that institutions should apply a risk-based data validation,
which enables the reviewer to consider risks unique to the organisation and the model.
To establish a robust framework for data validation, guidance indicates that the accuracy of source
data be assessed. This is a vital step because data can be derived from a variety of sources, some of
which might lack controls on data integrity, so the data might be incomplete or inaccurate.
4. Process Validation
To verify that a model is operating effectively, it is important to prove that the established processes
for the model’s ongoing administration, including governance policies and procedures, support the
model’s sustainability. A review of the processes also determines whether the models are producing
output that is accurate, managed effectively, and subject to the appropriate controls.
If done effectively, model validation will enable your bank to have every confidence in its various
models’ accuracy, as well as aligning them with the bank’s business and regulatory expectations.
By failing to validate models, banks increase the risk of regulatory criticism, fines, and penalties.
The complex and resource-intensive nature of validation makes it necessary to dedicate sufficient
resources to it. An independent validation team well versed in data management, technology, and
relevant financial products or services — for example, credit, capital management, insurance, or
financial crime compliance — is vital for success. Where shortfalls in the validation process are
Model Evaluation is an integral part of the model development process. It helps to find the best
model that represents our data and how well the chosen model will work in the future. Evaluating
model performance with the data used for training is not acceptable in data science because it can
easily generate overoptimistic and overfitted models. There are two methods of evaluating models
in data science, Hold-Out and Cross-Validation. To avoid overfitting, both methods use a test set
(not seen by the model) to evaluate model performance.
Hold-Out: In this method, the mostly large dataset is randomly divided to three subsets:
1. Training set is a subset of the dataset used to build predictive models.
2. Validation set is a subset of the dataset used to assess the performance of model built in the
training phase. It provides a test platform for fine tuning model’s parameters and selecting the
best-performing model. Not all modelling algorithms need a validation set.
3. Test set or unseen examples is a subset of the dataset to assess the likely future performance
of a model. If a model fit to the training set much better than it fits the test set, overfitting is
probably the cause.
Cross-Validation: When only a limited amount of data is available, to achieve an unbiased
estimate of the model performance we use k-fold cross-validation. In k-fold cross-validation,
we divide the data into k subsets of equal size. We build models ktimes, each time leaving out
one of the subsets from training and use it as the test set. If k equals the sample size, this is
called “leave-one-out”.
37
INTERPRETATION
1) What Is Data Interpretation?
2) How To Interpret Data?
3) Why Data Interpretation Is Important?
4) Data Analysis & Interpretation Problems
5) Data Interpretation Techniques & Methods
6) The Use of Dashboards For Data Interpretation
Data analysis and interpretation have now taken center stage with the advent of the digital age…
and the sheer amount of data can be frightening. In fact, a Digital Universe study found that the
total data supply in 2012 was 2.8 trillion gigabytes! Based on that amount of data alone, it is clear
the calling card of any successful enterprise in today’s global world will be the ability to analyze
complex data, produce actionable insights and adapt to new market needs… all at the speed of
thought.
Business dashboards are the digital age tools for big data. Capable of displaying key performance
indicators (KPIs) for both quantitative and qualitative data analyses, they are ideal for making the
fast-paced and data-driven market decisions that push today’s industry leaders to sustainable
success. Through the art of streamlined visual communication, data dashboards permit businesses
to engage in real-time and informed decision-making and are key instruments in data
interpretation. First of all, let’s find a definition to understand what lies behind data interpretation
meaning.
38
Nominal Scale: non-numeric categories that cannot be ranked or compared quantitatively.
Variables are exclusive and exhaustive.
Ordinal Scale: exclusive categories that are exclusive and exhaustive but with a logical order.
Quality ratings and agreement ratings are examples of ordinal scales (i.e., good, very good, fair,
etc., OR agree, strongly agree, disagree, etc.).
Interval: a measurement scale where data is grouped into categories with orderly and equal
distances between the categories. There is always an arbitrary zero point.
Ratio: contains features of all three.
Observations: detailing behavioral patterns that occur within an observation group. These
patterns could be the amount of time spent in an activity, the type of activity, and the method of
communication employed.
Focus groups: Group people and ask them relevant questions to generate a collaborative
discussion about a research topic.
Secondary Research: much like how patterns of behavior can be observed, different types of
documentation resources can be coded and divided based on the type of material they contain.
Interviews: one of the best collection methods for narrative data. Inquiry responses can be
grouped by theme, topic, or category. The interview approach allows for highly-focused data
segmentation.
A key difference between qualitative and quantitative analysis is clearly noticeable in the
interpretation stage. Qualitative data, as it is widely open to interpretation, must be “coded” so as
to facilitate the grouping and labeling of data into identifiable themes. As person-to-person data
collection techniques can often result in disputes pertaining to proper analysis, qualitative data
39
analysis is often summarized through three basic principles: notice things, collect things, think
about things.
Quantitative Data Interpretation
If quantitative data interpretation could be summed up in one word (and it really can’t) that word
would be “numerical.” There are few certainties when it comes to data analysis, but you can be
sure that if the research you are engaging in has no numbers involved, it is not quantitative
research. Quantitative analysis refers to a set of processes by which numerical data is analyzed.
More often than not, it involves the use of statistical modeling such as standard deviation, mean
and median. Let’s quickly review the most common statistical terms:
Mean: a mean represents a numerical average for a set of responses. When dealing with a data set
(or multiple data sets), a mean will represent a central value of a specific set of numbers. It is the
sum of the values divided by the number of values within the data set. Other terms that can be used
to describe the concept are arithmetic mean, average and mathematical expectation.
Standard deviation: this is another statistical term commonly appearing in quantitative analysis.
Standard deviation reveals the distribution of the responses around the mean. It describes the
degree of consistency within the responses; together with the mean, it provides insight into data
sets.
Frequency distribution: this is a measurement gauging the rate of a response appearance within
a data set. When using a survey, for example, frequency distribution has the capability of
determining the number of times a specific ordinal scale response appears (i.e., agree, strongly
agree, disagree, etc.). Frequency distribution is extremely keen in determining the degree of
consensus among data points.
Typically, quantitative data is measured by visually presenting correlation tests between two or
more variables of significance. Different processes can be used together or separately, and
comparisons can be made to ultimately arrive at a conclusion. Other signature interpretation
processes of quantitative data include:
Regression analysis: Essentially, regression analysis uses historical data to understand the
relationship between a dependent variable and one or more independent variables. Knowing which
variables are related and how they developed in the past allows you to anticipate possible outcomes
and make better decisions going forward. For example, if you want to predict your sales for next
month you can use regression analysis to understand what factors will affect them such as products
on sale, the launch of a new campaign, among many others.
Cohort analysis: This method identifies groups of users who share common characteristics during
a particular time period. In a business scenario, cohort analysis is commonly used to understand
different customer behaviors. For example, a cohort could be all users who have signed up for a
free trial on a given day. An analysis would be carried out to see how these users behave, what
actions they carry out, and how their behavior differs from other user groups.
Predictive analysis: As its name suggests, the predictive analysis method aims to predict future
developments by analyzing historical and current data. Powered by technologies such as artificial
40
intelligence and machine learning, predictive analytics practices enable businesses to spot trends
or potential issues and plan informed strategies in advance.
Prescriptive analysis: Also powered by predictions, the prescriptive analysis method uses
techniques such as graph analysis, complex event processing, neural networks, among others, to
try to unravel the effect that future decisions will have in order to adjust them before they are
actually made. This helps businesses to develop responsive, practical business strategies.
Conjoint analysis: Typically applied to survey analysis, the conjoint approach is used to analyze
how individuals value different attributes of a product or service. This helps researchers and
businesses to define pricing, product features, packaging, and many other attributes. A common
use is menu-based conjoint analysis in which individuals are given a “menu” of options from which
they can build their ideal concept or product. Like this analysts can understand which attributes
they would pick above others and drive conclusions.
Cluster analysis: Last but not least, cluster analysis is a method used to group objects into
categories. Since there is no target variable when using cluster analysis, it is a useful method to
find hidden trends and patterns in the data. In a business context clustering is used for audience
segmentation to create targeted experiences, and in market research, it is often used to identify age
groups, geographical information, earnings, among others.
Now that we have seen how to interpret data, let's move on and ask ourselves some questions:
what are some data interpretation benefits? Why do all industries engage in data research and
analysis? These are basic questions, but they often don’t receive adequate attention.
41
4.) Common Data Analysis And Interpretation Problems
The oft-repeated mantra of those who fear data advancements in the digital age is “big data equals
big trouble.” While that statement is not accurate, it is safe to say that certain data interpretation
problems or “pitfalls” exist and can occur when analyzing data, especially at the speed of thought.
Let’s identify some of the most common data misinterpretation risks and shed some light on how
they can be avoided:
a) Correlation mistaken for causation: our first misinterpretation of data refers to the tendency
of data analysts to mix the cause of a phenomenon with correlation. It is the assumption that
because two actions occurred together, one caused the other. This is not accurate as actions can
occur together absent a cause and effect relationship.
Digital age example: assuming that increased revenue is the result of increased social media
followers… there might be a definitive correlation between the two, especially with today’s multi-
channel purchasing experiences. But, that does not mean an increase in followers is the direct cause
of increased revenue. There could be both a common cause or an indirect causality.
Remedy: attempt to eliminate the variable you believe to be causing the phenomenon.
b) Confirmation bias: our second data interpretation problem occurs when you have a theory or
hypothesis in mind but are intent on only discovering data patterns that provide support to it while
rejecting those that do not.
Digital age example: your boss asks you to analyze the success of a recent multi-platform social
media marketing campaign. While analyzing the potential data variables from the campaign (one
that you ran and believe performed well), you see that the share rate for Facebook posts was great,
while the share rate for Twitter Tweets was not. Using only the Facebook posts to prove your
hypothesis that the campaign was successful would be a perfect manifestation of confirmation
bias.
Remedy: as this pitfall is often based on subjective desires, one remedy would be to analyze data
with a team of objective individuals. If this is not possible, another solution is to resist the urge to
make a conclusion before data exploration has been completed. Remember to always try to
disprove a hypothesis, not prove it.
c) Irrelevant data: the third data misinterpretation pitfall is especially important in the digital age.
As large data is no longer centrally stored, and as it continues to be analyzed at the speed of
thought, it is inevitable that analysts will focus on data that is irrelevant to the problem they are
trying to correct.
Digital age example: in attempting to gauge the success of an email lead generation campaign, you
notice that the number of homepage views directly resulting from the campaign increased, but the
number of monthly newsletter subscribers did not. Based on the number of homepage views, you
decide the campaign was a success when really it generated zero leads.
42
Remedy: proactively and clearly frame any data analysis variables and KPIs prior to engaging in
a data review. If the metric you are using to measure the success of a lead generation campaign is
newsletter subscribers, there is no need to review the number of homepage visits. Be sure to focus
on the data variable that answers your question or solves your problem and not on irrelevant data.
Data analysis and interpretation are critical to developing sound conclusions and making better-
informed decisions. As we have seen with this article, there is an art and science to the
interpretation of data. To help you with this purpose here we will list a few relevant data
interpretation techniques, methods, and tricks you can implement for a successful data
management process.
As mentioned at the beginning of this post, the first step to interpret data in a successful way is to
identify the type of analysis you will perform and apply the methods respectively. Clearly
differentiate between qualitative analysis (observe, document, and interview notice, collect and
think about things) and quantitative analysis (you lead research with a lot of numerical data to be
analyzed through various statistical methods).
The first data interpretation technique is to define a clear baseline for your work. This can be done
by answering some critical questions that will serve as a useful guideline to start. Some of them
include: what are the goals and objectives from my analysis? What type of data interpretation
method will I use? Who will use this data in the future? And most importantly, what general
question am I trying to answer?
43
Once all this information has been defined, you will be ready to collect your data. As mentioned
at the beginning of the post, your methods for data collection will vary depending on what type of
analysis you use (qualitative or quantitative). With all the needed information in hand, you are
ready to start the interpretation process, but first, you need to visualize your data.
Data visualizations such as business graphs, charts, and tables are fundamental to successfully
interpreting data. This is because the visualization of data via interactive charts and graphs makes
the information more understandable and accessible. As you might be aware, there are different
types of visualizations you can use but not all of them are suitable for any analysis purpose. Using
the wrong graph can lead to misinterpretation of your data so it’s very important to carefully pick
the right visual for it. Let’s look at some use cases of common data visualizations.
Bar chart: One of the most used chart types, the bar chart uses rectangular bars to show the
relationship between 2 or more variables. There are different types of bar charts for different
interpretations this includes the horizontal bar chart, column bar chart, and stacked bar chart.
Line chart: Most commonly used to show trends, acceleration or decelerations, and volatility, the
line chart aims to show how data changes over a period of time for example sales over a year. A
few tips to keep this chart ready for interpretation is to not use many variables that can overcrowd
the graph and keep your axis scale close to the highest data point to avoid making the information
hard to read.
Pie chart: Although it doesn’t do a lot in terms of analysis due to its uncomplex nature, pie charts
are widely used to show the proportional composition of a variable. Visually speaking, showing a
percentage in a bar chart is way more complicated than showing it in a pie chart. However, this
also depends on the number of variables you are comparing. If your pie chart would need to be
divided into 10 portions then it is better to use a bar chart instead.
Tables: While they are not a specific type of chart, tables are wildly used when interpreting data.
Tables are especially useful when you want to portray data in its raw format. They give you the
freedom to easily look up or compare individual values while also displaying grand totals.
With the use of data visualizations becoming more and more critical for businesses’ analytical
success, many tools have emerged to help users visualize their data in a cohesive and interactive
way. One of the most popular ones is the use of BI dashboards. These visual tools provide a
centralized view of various graphs and charts that paint a bigger picture about a topic. We will
discuss more the power of dashboards for an efficient data interpretation practice in the next
portion of this post. If you want to learn more about different types of data visualizations take a
look at our complete guide on the topic.
As mentioned above, keeping your interpretation objective is a fundamental part of the process.
Being the person closest to the investigation, it is easy to become subjective when looking for
answers in the data. Some good ways to stay objective is to show the information to other people
related to the study, for example, research partners or even the people that will use your findings
44
once they are done. This can help avoid confirmation bias and any reliability issues with your
interpretation.
Findings are the observations you extracted out of your data. They are the facts that will help you
drive deeper conclusions about your research. For example, findings can be trends and patterns
that you found during your interpretation process. To put your findings into perspective you can
compare them with other resources that used similar methods and use them as benchmarks.
Reflect on your own thinking and reasoning and be aware of the many pitfalls data analysis and
interpretation carries. Correlation versus causation, subjective bias, false information, and
inaccurate data, etc. Once you are comfortable with your interpretation of the data you will be
ready to develop conclusions, see if your initial question were answered, and suggest
recommendations based on them.
As we have seen, quantitative and qualitative methods are distinct types of data analyses. Both
offer a varying degree of return on investment (ROI) regarding data investigation, testing, and
decision-making. Because of their differences, it is important to understand how dashboards can
be implemented to bridge the quantitative and qualitative information gap. How are digital data
dashboard solutions playing a key role in merging the data disconnect? Here are a few of the ways:
a) Connecting and blending data. With today’s pace of innovation, it is no longer feasible (nor
desirable) to have bulk data centrally located. As businesses continue to globalize and borders
continue to dissolve, it will become increasingly important for businesses to possess the capability
to run diverse data analyses absent the limitations of location. Data dashboards decentralize data
without compromising on the necessary speed of thought while blending both quantitative and
qualitative data. Whether you want to measure customer trends or organizational performance, you
now have the capability to do both without the need for a singular selection.
b) Mobile Data. Related to the notion of “connected and blended data” is that of mobile data. In
today’s digital world, employees are spending less time at their desks and simultaneously
increasing production. This is made possible by the fact that mobile solutions for analytical tools
are no longer standalone. Today, mobile analysis applications seamlessly integrate with everyday
business tools. In turn, both quantitative and qualitative data are now available on-demand where
they’re needed, when they’re needed, and how they’re needed via interactive online dashboards.
c) Visualization. Data dashboards are merging the data gap between qualitative and quantitative
methods of interpretation of data, through the science of visualization. Dashboard solutions come
“out of the box” well-equipped to create easy-to-understand data demonstrations. Modern online
data visualization tools provide a variety of color and filter patterns, encourage user interaction,
and are engineered to help enhance future trend predictability. All of these visual characteristics
45
make for an easy transition among data methods – you only need to find the right types of data
visualization to tell your data story the best way possible.
To give you an idea of how a market research dashboard fulfills the need of bridging quantitative
and qualitative analysis and helps in understanding how to interpret data in research thanks to
visualization, have a look at the following one. It brings together both qualitative and quantitative
data knowledgeably analyzed and visualizes it in a meaningful way that everyone can understand,
thus empowering any viewer to interpret it:
Conclusion
46
DEPLOYMENT AND ITERATION
The deployment phase is the final phase of any IT or SW development project, including those
that are exploring Advanced Analytics capabilities.
The main objective is to ensure that the final solution is ready to be used within the operational
environment and that end users have all the required tools to act upon the analytical insights
discovered during the development phases of the project.
However, organisations, especially those that are deploying analytical initiatives for the first time,
or that are still analytically “immature”, typically focus too much on building an IT infrastructure,
instead of planning how to deliver actionable insights to their end users and integrate this
intelligence into internal organisational processes.
The lack of vision regarding usability and access to analytical information to drive decision
making, is a common mistake in Advanced Analytics projects.
At Presidion, we’ve seen organisations perform best when they take the before, during and after
of the deployment of analytics initiatives into consideration. This will help to maximize return on
investment and achieve a high impact on business and operations.
Before proceeding to deployment, Data scientists and Development teams should exhaustively
evaluate analytical outputs to assess quality and accuracy, but most importantly to validate that
business objectives are properly addressed and success criteria, that were set during project
initiation, are fully met.
The goal of the transition period is to make sure that end users have accepted the functionalities
and the outputs of the solution, and that they are now ready to integrate the system outputs in a
way that will improve future outcomes and decision making.
During the pre-deployment phase, end users should be placed at the centre of
attention. Training and formal Knowledge Transfer sessions can significantly help them learn
how to operate the new solution, validate the usability of the new system and ensure a smooth
transition period.
47
Things to consider DURING deployment
At this stage, technical teams can focus on deploying the technology and integrating the
solution into an operational environment to automate decision making process.
The Technical Infrastructure and the Production environment, which will host the solution,
and any required Integration Interfaces should by now be already developed, tested and ready in
order to initiate deployment and successfully incorporate analytical results into an organisation’s
daily operations.
Typically, deployment of Advanced Analytics insights includes all operations to generate reports
and recommendations for end users, visualisation of key findings, self-service and data discovery
functionalities for business users, and finally, depending on the size and scope of the analytical
application, implementation of a scoring process or workflows that integrate analytical outputs (in
real time or not) with custom, operational and core systems.
During deployment, many iterations, enhancements and fine-tuning activities might be necessary
to finalise the deployment of the system. Other activities necessary during deployment
include Administration, Security and Authorisation, aswellas
finalising Documentation and Transferring Ownership to business and operations.
The goal of a post-deployment monitoring phase is to create the strategy and the foundations to
continually monitor the solution’s performance, review outputs, collect feedback from business
users and address issues detected on an ongoing basis in a timely manner, without creating
operational disruptions.
Finally, do not forget to learn from your previous mistakes and incorporate your end users
feedback into this monitoring process to address issues either in future enhancements of the
application or during regular updates to overall improve the accuracy of analytical outputs.
48