0% found this document useful (0 votes)

29 views19 pages

RapidMiner - Humans Guide ML V2

Uploaded by

シシ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views19 pages

RapidMiner - Humans Guide ML V2

Uploaded by

シシ

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

data scientist’s

unicorn’s
marketer’s
dolphin’s

A human’s Guide to
Machine Learning Projects
mathematician’s
CEO’s
project manager’s

technologist’s

WHITEPAPER
Martin Schmitz, PhD
Table of Contents
Step 0 – Responding to Common Objections 4
“But what’s in it for me?“ 4
“But that’s not my job” 4

Step I – Understanding CRISP-DM 5

Step II – Understanding the Business Case 6

The genesis of the team 6

Step III – Defining the Label 8

The power of math 9
The real problem: Defining our label 10

Step IV – Defining Performance and “Success” 11

On data science quality measures 12
The solution: Business-aligned performance 13
A reprise on unsupervised problems 13
When is our project successful 14
On baseline models 14
On validation 15

Step V – Discussing Profile Generation 15

The general idea 15
Data types 17
First ideas on data preparation 17

Wrapping Up 18
Many people start advanced analytics programs at their company and search for a
methodology that will allow them to tackle their use cases as quickly and effectively as possible.
In this whitepaper, I discuss the approach that I’ve developed over the last decade to
help understand, outline, and implement artificial intelligence solutions for business
problems. This whitepaper is designed to be used as a guide for the first few hours of internal
discussions around a machine learning project.

For the purposes of this guide, I assume

that you’re working in a group with a
mixed background—that is, the domain,
data science, and program management

A Failure of knowledge varies among the members of

the team. One party in the conversation

Communication might have knowledge of the technology

used, such as machine learning,
The lack of shared background and databases, and business intelligence tools.
Another party might have deep knowledge
knowledge is one of the main drivers of
of the domain, but little mathematical or
the current Model Impact Epidemic. During
computer science background.
our recent exploration of why so many
models fail to provide value, we discovered If you follow this
that only about 10% of models that could conversation guide, you’ll
have an impact are actually chosen for
be able to create a clear
deployment, and issues arising from a lack
plan that outlines the
of communication and shared knowledge
are one of the main drivers in that lack
steps you need to take
of implementation. Ensuring that you get towards building and
communication right from the outset is a implementing a model
key component of a successful machine
prototype with the
learning project.
potential to have a real
impact on your business.

rapidminer.com 3
Step 0 — Responding to Common Objections
When embarking on a new machine learning project, objections often crop up from various
parties. Many of these objections are unique to a given organization or work environment, but
there are two common objections that arise often enough that I feel it’s worth addressing them
here so that I can give you some tips on how to respond before we dive deep on the particulars
of setting up a new machine learning project.

“But what’s in it for me?”

One of the most common roadblocks that I’ve seen when it comes to developing and
implementing a machine learning solution, especially from leadership, is the question: “What’s
in it for me?” People are often averse to change—and this goes doubly so in business, where
a mistake can potentially cost hundreds of millions of dollars. If the current solution is working
well enough, why take the risk?

This is a great example of why it’s so critical to clearly outline the business problem, as well as
defining what success looks like (which we’ll talk about in detail in Step IV), at the beginning of a
project. If you’re having these kinds of conversations about machine
learning, there’s a good chance that people have already started to

AS MANY AS see that there’s room for improvement. You want to capitalize on

70%
that by showing the kinds of impacts you believe the project can
have and getting buy-in that hitting those success criteria will mean
that the model can be implemented and have impact. If you don’t
of programs fail to achieve their
clearly articulate these things to the relevant stakeholders at the
goals due largely
beginning, you run the risk of the project getting bogged down in red
to employee tape, approvals, and back-and-forths that can stall—and potentially
resistance. even kill—your project.

“But that’s not my job”

The second common objection that crops up is resistance from
those who would be doing the actual work: “That’s not a part of
my job”. McKinsey reports that as many as 70% of programs fail to achieve their goals
due largely to employee resistance. Data science projects often require people to work
outside their comfort zones, especially during the early stages as new skills are learned and new
requirements are placed on infrastructure and job duties.

There are two steps you can take to help mitigate objections from those doing the work. The first
relates to the above. If you have clear buy-in from management about the goals that you’re working
towards, and what success looks like, it should be easier to convince departments such as IT to
support the project. After all, management is pushing to try and hit these goals.

rapidminer.com 4
The second step is to find a champion in the department that you’re working with to act as a
liaison and spokesperson. They’ll probably be the most stretched by learning new things and
being pushed out of their comfort zone, but if you can find someone who will willingly take on this
task, they can provide an excellent point of contact between the different people working on the
project, as well as serve as an advocate for the project among their peers.

Step I – Understanding CRISP-DM

Before we get started, we need to talk about CRISP-DM, or the cross-industry standard
process for data mining. It’s been around since 1996 and according to Forbes, it’s the most
widely used and relied upon analytics process in the world. If you’re getting started on a
data science project, it’s absolutely essential that you understand the basics and how they relate
to the work that you’re doing. The CRISP-DM process consists six phases:

The six phases of 1. BUSINESS UNDERSTANDING

CRISP-DM
In the first stage of CRISP-DM, you work
through what the project looks like, and what
the business expectations for the project are.

6. DEPLOYMENT
Finally, you need to deploy the model
2. DATA UNDERSTANDING
that you’ve developed to ensure
Next, the data that are available for
that it can have a positive impact
analysis are examined in light of the
on your business. It might seem
4. like a no-brainer to deploy a model
business objects that were decided on in
the first phase.
once it’s created, but fully half of
completed models never make it into
production, contributing significantly
to the Model Impact Epidemic. 3. DATA PREPARATION
Once you have an understanding of the
data that’s available, the next step is
5. EVALUATION
to clean, sort, and process said data to
Once you’re happy with the model
make it useable for your purposes. This
that you’ve built, you need to
phase often takes the greatest amount of
evaluate whether or not it effectively
time and effort.
addresses the business criteria laid
4. MODELING
out in the first phase.
Then, you iterate through various
versions of a model, using the prepared
data from phase 3.

rapidminer.com 5
Although there’s a lot more involved in each of these phases of CRISP-DM, the summary above
should give us enough to start talking about how RapidMiner approaches these issues. If you’re
interested in reading more, you can check out the CRISP-DM article at Wikipedia.

With that background, let’s take a look at how I approach the issues outlined in this process as
we dive deep on the most critical components of implementing a successful machine learning
project. As mentioned above, the purpose of this guide is to focus on the early stages of such a
project, so the discussion here mostly elaborates and elucidates on phases one through three
of CRISP-DM, although we will touch briefly on the other stages as well.

Step II – Understanding the Business Case

As discussed above, the first phase in CRISP-DM is developing a business understanding. In
my experience, many analysts don’t take this step seriously enough. Data scientists often see
the machine learning problem as the core problem, but the opposite is true—we are tasked
with solving business problems, not math problems. Every data scientist, but especially
those new to business, need to understand that the reason we do analytics is to generate value
for a company. A model by itself has no value. Value is generated by putting models into
context within the business processes of a company to solve problems. It’s crucial for
success to understand the business problem before moving on to the technology that we’re
going to use to address that problem.

The business analyst often has the opposite problem. She understands the business and the
problems she faces there but doesn’t understand the methods of machine learning. How does
she even know if a problem is solvable by machine learning? And even if it is solvable, how can
she possibly assess the difficulty of developing and implementing a machine learning
solution?

These understanding gaps can create a bit of a chicken-and-egg problem for the team, which
is why it’s so important to have an agreed-upon process in place to navigate the early stages of
the project.

The genesis of the team

So how does a machine learning team come into existence in the first place? If you’re like
most of the teams that I’ve worked with, there’s usually a specific trigger that creates a project
team—a perceived problem or opportunity that the business wants to use machine learning
to address.

rapidminer.com 6
There’s a two-fold effect that comes from this way of thinking about
machine learning in business. On the one hand, if you wait for a problem
or opportunity to present itself, you can miss a lot of use cases
where machine learning could help. Because there was never a trigger
to create a team, there’s no one tasked with working on this challenge.
On the other hand, a lot of use cases are not solvable by such a team, even
if one is created, because it turns out that the problem they’re looking at
isn’t a machine learning problem.

Although this second scenario can kill the project, that isn’t always a bad
thing. Data scientists often think that they have the philosopher’s stone to
solve business problems, but they don’t. It’s okay if a particular use case is
a better fit for traditional business intelligence methods than for machine
learning. You don’t need to use a high-precision laser scalpel to
open a box. Often “traditional” methods like business intelligence or
Six Sigma should be tried before exploring machine learning.

Your team’s first session should focus on mapping the problem and use case as well
as understanding it in enough detail to provide clarity on what the correct solution
looks like. To that end, you need to come up with answers to the following questions:

> What is the problem/opportunity/challenge?

> Why is it important?

> Who in the organization cares?

> What happens if we solve the problem badly? Who suffers? How can we know if this has
happened?

> How has this problem traditionally been solved?

It’s important that the data scientists in

the room hold themselves back during

“When you see a good move, this initial discussion. The other people will
often already have an idea of how to solve
search for a better one.“ the problem at hand, and in this part of the

- EMANUEL LASKER, CHESS WORLD CHAMPION process, we don’t want to do data science;
we want to do business. I usually try to
remind myself of chess world champion
Emanuel Lasker’s famous saying: “When you see a good move, search for a better one.”
If you are too quick to jump to a solution at this stage, you are in danger of solving the problem
sub-optimally by forcing a machine learning solution where it might not be appropriate. You
could also potentially miss the business problem completely and create something that doesn’t
directly address the challenge you’re focusing on.

rapidminer.com 7
Step III – Defining the Label
Once the business problem is clearly defined and understood by all parties, and your team has
decided that machine learning is the right way to tackle the problem, it’s then the data scientist’s task to
map the problem to a data science method. Ideally you want to transform the problem to a supervised
learning problem. My personal credo is: If you can go supervised, go supervised.

Supervised simply means that you know ahead of time the labels that you’re trying to predict—for
example, in a categorization problem, it would mean that you already know the categories you
want your algorithm to sort the data into. With an unsupervised problem, rather than defining
the categories ahead of time, you let the algorithm decide what categories are present in the data.
Even if it takes a lot of effort to make the problem into a supervised one, I still recommend doing it.
Unsupervised use cases are much harder to
optimize, as they do not provide a qualitative
measure to evaluate and tune your model on.

A good example of this difficulty is customer

If you can go supervised,
segmentation. It may sound intuitive to treat go supervised.
this as an unsupervised clustering problem. In
fact, historically, most marketing departments
have treated segmentation problems as grouping problems. However, when generating a clustering,
you will encounter the problem of measuring which clustering is better—that is, which categories
are the right ones? The one with age as an attribute or without? The one with Euclidian distance
or Manhattan distance? It is very hard—nearly impossible—to answer these questions based on a
performance value. The groupings usually need to be interpreted by humans, which can then lead
to ambiguity.

An alternative option is to turn this into a supervised problem by treating it as a

targeting problem: you want to predict whether someone will purchase an item or not. This is
often also called “segmentation by one”. It’s easy to calculate a performance measure for this and
optimize on it.

It’s imperative to define a performance measure that fits your business needs. Ideally, this
performance measure is something that has a direct business impact. We’ll discuss this
in-depth in Step IV.

rapidminer.com 8
The power of math
While mapping the problem to a machine learning solution, the data scientist should be careful with
his language. I’m not a fan of using any math to explain machine learning. If you’re used to speaking
and writing in equations, math is a helpful language. If not, it’s confusing. I recommend having the
following quote from Stephen Hawking in mind when using math:

“Someone told me that each equation I included in my book
would halve the sales. I did put in one equation, Einstein‘s
famous equation, E=mc2. I hope that this will not scare off half of
my potential readers.”

Data scientists need to be aware that they have to be able to sell their methods to
stakeholders. And if you want to sell your methods, you won’t help yourself by confusing
people. Similarly, this is not about “fancy” methods, but about the concepts. You can explain all
of the concepts of machine learning by using a decision tree algorithm. There is absolutely no
need to start with something more complicated like neural networks. Always remember the
KISS principle: Keep It Simple, Stupid.

At this point in the conversation, every participant should be aware of two things: the use case
being considered, and the concept of supervised learning. So let’s move on to the real problem:
What are you predicting?

Selling your According to research conducted here at RapidMiner for our Model Impact
Epidemic infographic, only 1 in 10 models that are developed by businesses are
methods chosen to be put into production and one of the biggest reasons is not selling
your methodology to key stakeholders.
Having a plan in place in the
early stages makes sure you’re As resources are committed to deploy a model to production and timelines

communicating the value of are put in place, it’s critical that you’re able to clearly and effectively articulate

your project from your first the impact that you’re having. Will it be via an interactive dashboard? A weekly

steps, and not scrambling later report? A part of a website?

to get buy-in.

rapidminer.com 9
The real problem: Defining our label
Here again, you’ll face an issue of different cultures. Data scientists coming from an academic
environment are used to have a clearly defined label—that is, what is being predicted. After all,
most university assignments and coding competitions will tell you what to predict. In business, it
isn’t always that simple. You need to remind yourself that defining the label is the equivalent
of formulating the question you want to ask the data. Once you’re clear about that, you’ll
need to be sure that you satisfy three vital requirements with the label you choose.

Requirement 1: The label needs to match business needs

In Step I above, you should have discussed why the project you’re doing is important to the
business. Any prediction problem you’re working on needs to be aligned with that goal. I’ve
often encountered proof-of-concept models where small subproblems or entirely different
problems are worked on, rather than ones aligned with business value. Rather than proof
of concept, I prefer to think of the first
prototype as proof of value, because
Defining the label is the equivalent the important thing at this stage is to
of formulating the question you generate evidence that the project can

want to ask the data. create value for the business. Before you
move ahead with planning, you need
to make sure that your label is directly
connected to the business needs.

Requirement 2: The label needs to exist

The second requirement is that the label actually exists. A “magic wand” attitude towards
machine learning leads to unsatisfiable expectations. I’ve run into situations where a
label is impossible to measure in real life. For example, imagine that your factory has a vat of
chemicals that rarely boils over but, when it does, it costs you a lot of money. Perhaps it has
happened three times in the last ten years. With such little data, it would be exceptionally
difficult to train an accurate model. And, even if you were to build a system to predict this event,
you wouldn’t be able to assess the quality of the model because the event, and thus the data
about it, are so rare.

Note that the requirement that the label exists doesn’t exclude use cases where human
judgement is the only way to measure the label. A classic example of this is sentiment analysis,
where humans rate whether a comment is positive or negative as training data. This might
cause difficulties in acquiring training data in the first place but it’s not strictly impossible.

rapidminer.com 10
Requirement 3: The label needs to be actionable
The best machine learning algorithm doesn’t help if the insights you derive from it aren’t
actionable. You need to be able to answer the question: If I could predict this, what
business action would I take?

A good example of this problem is churn prevention modelling. Typically, you would directly predict
whether or not a customer is still a customer in x months. But even if you can do this, it doesn’t
mean that you’re able to prevent churn. The business value of the churn model is not generated
from predicting churn but from preventing it. What interaction do you trigger if you predict that
a customer is about to churn? Can you prevent it x months ahead? If so, how? You need a clear idea
about what actions you can take to address the problems that a model identifies.

The un-requirement: A label without noise

We’ve now talked about the three main requirements for a label that you’re trying to
predict. But there is one thing that you don’t need when we’re generating predictions,
and that is the perfect label.

Sometimes people believe that an incorrect label could present a problem for interpreting
and implementing results. But it’s important to remember that no measurement is without
error. Obviously different labels have differing qualities. Labels based on human judgement,
like the sentiment of a text, are more subjective than the measurement of a voltmeter. But the
voltmeter still has uncertainty in its measurements.

It’s important to be aware that the uncertainty inherent in the label you’re trying to
predict sets an upper bound on the best accuracy of your model. You can’t build a model
which predicts the voltage better than your voltmeter, but that’s the only limitation—don’t be
afraid to predict labels that aren’t of the best imaginable quality, simply be careful when you do.

Step IV – Defining Performance and “Success”

A common mistake at this stage of a project is to take your
data scientists, add some data, send them off for a few weeks,
and expect that this recipe will result in a deployable model.
Unfortunately, this common process is more likely to be a
recipe for disaster. Only by knowing what you want to achieve
can you know if you’ve solved the problem at hand or not. You
need to ask yourself: How do I measure the quality of this
algorithm?

rapidminer.com 11
On data science quality measures
If you ask a data scientist about quality measures, they will usually opt for mathematical
measures like RMSE for regression, or AUC or AUPRC for classification. There are good statistical
reasons for this from the data scientist’s perspective, but, as discussed above, you need to
remember that you are ultimately solving a business problem and not a stats assignment.

So how do you measure the quality of what you’re building in a business-oriented way?
You need to ensure that the quality measures that you choose are both appropriate for the
business problem that you’re trying to solve, as well as interpretable by others on the team

Regression tasks
Let’s first have a look at regression problems. Assume the true value of your label is 5. If you
predict 3, you’ll take a different business action than if you predict 7. Thus, underestimating
and overestimating are of potentially different severities in terms of the business
problem at hand.

By way of example, consider predictive packaging. The idea here is that you forecast the
number of purchases of a given item. This would allow you to package items ahead of time.
Overestimating causes too many items to be pre-packaged, which then will just lay around. On
the other hand, underestimating the demand will result in a delay in shipments. As you can
see, although the two predictions (3 and 7) are mathematically the same distance from the true
label, the business impacts, and thus costs, associated with the two predictions are different.

However, typical statistical performance measures for regression tasks—such as RMSE or R²—
assume that errors are symmetrical, and should thus be avoided if possible, given their lack of
alignment with business concerns.

Classification tasks
In the case of classification tasks, the problem is similar to what we saw with regression above—
namely, false positives and false negatives will have a different impact to your business.

Medical quick tests are a perfect example. A test which falsely predicts that you have a given
disease (false positive) will trigger more extensive and expensive tests. These tests will then
correctly determine that you do not have the disease in question. The other error type is when
the test results indicate that you don’t have the disease, even though you actually do (false
negative). Here, the incorrect result will prevent proper and timely treatment, potentially causing
serious harm and even death.

Common data science measures of classification accuracy like F1 score and AUC assume that
false positives and false negatives are of equal severity. These measures are thus misleading in
a business environment and should be avoided.

rapidminer.com 12
The solution: Business-aligned performance

Because of these issues, you need to identify a performance measure that’s more closely aligned with
the business problem you identified in Step I. A proven way to do this is a value-based performance
measure. Consider a predictive maintenance scenario where the cost of replacing a part before
it actually fails is much different than failing to replace a part in time and having it fail (similar to the
medical quick test issue discussed above). In this case, you want your models to take into account the
costs of these different scenarios.

Data scientists usually look at measurements like

accuracy when evaluating model performance,
but sometimes it’s easier to look at evaluation from

Model Costs a more business-oriented angle. For example,

and Values
say that your department can only handle 100
requests per day. Naturally, you want to optimize
the number of correct predictions in that batch
RapidMiner has been pioneering the value-based
of 100 so that you can have the most impact.
approach to building models by providing ways to
By optimizing your model for these 100 most
take costs and benefits into account during model
impactful cases, you’ll help ensure that the
building. This helps identify the best model for the use
performance metric you’re using is directly tied
case, based not only on statistical and mathematical to business concerns.
accuracy of the model, but also on the impact that the
model’s predictions will have on your bottom line. A reprise on unsupervised problems
Now that you understand some of the issues with
For example, consider a model to predict churn. If
performance measures as they relate to business
you’re planning to offer discounts to those customers
problems, you’re in a better position to understand
who are predicted to churn, you not only want to
my admonition above that you should opt for
know what effects those discounts might have on
supervised problems whenever possible. It is
churn rates, but also how they’ll affect your revenue
much harder to find performance measures
if customers take you up on the offer. It might be that
for unsupervised problems like clustering or
identifying churn and then offering steep discounts to topic modeling that also align with business
get retention is less cost effective than simply letting concerns. For example, if you want to use
some of the customers churn, but you won’t know this clustering for customer segmentation, you will
unless you’re working in a value-based approach. struggle to assign a business-oriented performance
measure to the analysis. Usual measures like
Davies–Bouldin index have the same flaw as RMSE
or AUC— they do not correlate with business value.

rapidminer.com 13
In a recent blog post, I demonstrated how to do topic modelling of Amazon reviews. If you
search for six topics, you will find a topic that’s about hot beverages like coffee and tea. If you
increase the number of topics to twelve, the beverages topic will be split into two different
topics: one about coffee and the other about tea. There is no way that the algorithm can assess
whether it’s better for the business to separate these two or not. That being said, this shouldn’t
prevent you from doing such an analysis with a
human in the loop if it’s likely to have a bigger
impact. But it should be clear from the outset
that the difficulty of measuring performance When is the model good
in an unsupervised problem is much greater
enough?
when compared to an equivalent supervised
problem.

When is our project successful?

Great work so far! The next problem to tackle is answering the question: When can I call the
project a success? When you’re iterating on a machine learning project, you usually don’t know
how good you could make your model if you invested unlimited resources. When do you stop
trying to get better results? That is, when is the model good enough?

I’m a big fan of moving to deployment as soon as the model generates decent value. I’ve
often seen models that could save hundreds of thousands of dollars per year that are not being
deployed because the data science team was confident that, given more time, they could get even
better results. This is another factor driving the Model Impact Epidemic.

But how do you define “decent value”? This highlights why it’s so important to define your
success criteria at the beginning of a project. If you have a clear threshold to make a
decision, you can use this step of your analysis to identify the first performance milestone that
defines your minimum viable product. If you know what your threshold is, and you hit it, you can
pause and deploy, or deploy the first version in parallel while you continue to refine your model.

On baseline models
Now that you have a performance measure to assess the quality of your model, you want to
have some kind of baseline for comparison. What should you use for you baseline? It can vary
depending on your particular use case, but there are basically two kinds of baseline models:
currently deployed solutions and naïve solutions.

Where possible, we want to compare the model that we are building against whatever
currently deployed solution is in place to address the business problem that we are looking
at. For example, are you are using linear regression and Excel to predict maintenance? If so, that’s
great—because you can then ask what the performance of that current system is. This is critical
because it allows you to put your results into perspective and to justify the money and time spent
developing your new analysis.

rapidminer.com 14
However, what do you do if the problem you’re trying to tackle doesn’t have a current solution?
After all, many of the triggers to create a team are new problems that traditional solutions aren’t
able to solve. It’s also possible that some current solution exists, but isn’t quantifiable — for
example, decisions are made on the fly by humans and are not recorded.

In this case, you’ll want to compare yourself to naïve or simple models. The naïve or default
model could be the majority class in a classification problem, the average in a regression
problem, or the demand yesterday in a demand forecasting scenario. Basically, what would
you do to quickly get a baseline of the current state of affairs if you didn’t have access to the
model that you’re building? As above, this will let you compare your solution to a baseline to
demonstrate the value of what you’re building.

On validation
Validation is perhaps the most important part of any data science project. While the data
scientist should be extremely careful about validation, ensuring that they use best practices like
ensuring that training and test sets are distinct from one another and using cross-validation, I
won’t address these issues in depth here, as the details tend to be quite in the weeds for those
who aren’t data scientists.

Step V – Discussing Profile Generation

After discussing the business problem at hand, what you want to predict, and what success
looks like, you’re ready to move on to discussing the final step—you need to figure out what
kind of data you can use to make your predictions.

The general idea

The rise of artificial intelligence in business has caused problems for data scientists in terms
of how those who aren’t data scientists view their work, which usually goes something like:
Throw data into a deep learning model and the problem will be solved.

If only it were so easy! Unfortunately, this is far from accurate. Before data scientists can start
the process of building models to solve problems, they need high-quality data that’s been
prepared for the task at hand. There are good reasons why there are whole schools of
teaching just around data preparation.

So how do you align your team and set realistic expectations? I call this step profile generation.
You do it by putting the problem in context by calling machine learning pattern recognition.
This makes it sound a bit old school, but it highlights an important issue here. You are
detecting a pattern in your data in order to predict your label.

rapidminer.com 15
In order to do this, you first need to create a one-line-per-customer (or machine, or asset, or
etc.) representation. This one-line representation is what I call a profile. The art of data science
is to build a profile which is both complete and dense. Complete means that the data has all of
the information possible that may help the algorithm make its predictions. Dense means that you’ve
reached completeness with a low number of individual attributes. Large numbers of attributes can
not only lead to longer runtimes, but they can also prevent the algorithm from identifying patterns in
the data. The difficulty of getting the balance between completeness and denseness right makes the
data preparation step some of the heaviest lifting in data science.

As mentioned above, whole schools of thought exist around how best to prep data. In what
follows, I intend to provide you with a high-level overview and some important details, but there
is certainly more depth to these issues than we have space to go into here.

Source systems & availability of data

Where does your data come from? This is a very important question, because you need
to make sure that you can access the data you need for your project. Some systems and
infrastructures lock users out, preventing data access. This needs to be checked early on in the
planning of your project, and processes put in place to grant you access to the data you need to
build your models.

The other question about data that needs

to be addressed is at what point is the
data available? In data science, you
Where does your data come from? usually work with data in two different
At what point is the data available? ways. In the first, you build your models
in a batch session, meaning that the
calculation of models is either completely
offline, or is run periodically every x days
or months and then redeployed. In the second, the application of the model can be in (close-to)
real time. You thus need to make sure that all data is available in both scenarios, or at the very
least establish which way the data will be available during deployment, as that may influence
how the model is built, as well as the infrastructure that supports it.

I’ve personally run into situations where I learned after finishing a model that a given attribute
would not be available at runtime. It might be because the measurement takes 24 hours to
create, which means you won’t be able to use it on the fly. This issue can also arise as a result
of IT infrastructure, where you might be working with a system that only updates every 6 hours.
In this stage of your project, you want to make sure that you’re aware of what kind of data is
available in both cases.

rapidminer.com 16
Data types
The complexity of the modelling process, as well as the predictive power of a model, depends
a lot on the type of data that you’re using. In my experience, the raw data is often in a form that
looks something like this:

Date Id Action Value

ClickedItem

An example of this might be sensor analytics, where you have the following information:

Date SensorId Channel Action

ClickedItem

Or with customer analytics, the table might more look like this:

Date CustomerId Website ClickedItem

ClickedItem

These data sets are time series by nature. However, this data often exists in data warehouses
where you can get access to the aggregations of this data, but not the full data itself. On the one
hand, it’s great to have aggregated data to use for quickly building an initial prototype, because
it takes less work than collecting live data, and the aggregations do often contain meaningful
information. On the other hand, all types of aggregations remove information compared to
what is present in the raw data.

The raw data is never aggregated in such a way that the aggregated data is as
appropriate for the task of machine learning as the raw data. Thus, to get the best
results possible in your project, it’s important to get access to the underlying data and use that
for model training and evaluation.

First ideas on data preparation

When talking about availability, a data scientist can get a lot of information about the data. As
we know, data is always dirty. It’s part of the nature of data to be dirty. But to be able to clean
the data we need to understand what cleaning means.

There are often obvious ways to clean up the dirty data once you talk to the subject matter
experts, and data scientists should be willing to engage with those in the know to gather
information and get ideas about data cleaning. The process of cleaning and prepping the data is
obviously not doable in a short period of time, but you want to have a clear idea early on of how
much time you’ll need for this process so that you can plan accordingly.

rapidminer.com 17
Wrapping Up
If you follow this guide, you’ll be able to assess the value and feasibility of a machine learning
project correctly, right from the get-go, including getting early buy-in on critical aspects of the
process. This will result in fewer projects being killed in the early stages that could have been
successful, while also helping to prevent work on projects that won’t yield value, regardless of
the reason—whether it’s because they’re too challenging, because the necessary data doesn’t
exist, or because there are too many politics hurdles to clear to get the model into production.

Based on the research that we did for our Model Impact Epidemic infographic, 4.95 billion
models developed don’t end up even having the potential to have a real business
impact. This demonstrates the clear need for more planning in order to help prioritize use
cases and identify areas for immediate impact. In fact, when you’re getting started with a
project, I usually recommend spending time up-front to map out not just one use case but as
many as possible, using the process described in this document, rather than going with the
problem > trigger > team model. The result of such a process is an Impact-Feasibility Map
which can be used to prioritize use cases and lobby internally.

There’s a German saying to keep in mind when assessing the viability of machine learning for
your potential use case: Better a painful end than pain without end. Don’t be afraid to not
use machine learning if there’s a better option available!

RapidMiner offers a free AI

assessment for anyone interested
in thinking through how machine HIGH Process
Manufacturing
learning can transform their business.
Predictive
We‘ll apply our expertise to help you Maintenance
Supply Chain
identify the use cases you should
tackle first by creating an Impact-
Feasibility map that you can use with Premium
Brand Marketing Demand
or without RapidMiner.
Impact

Forecasting

If you’re interested in learning more

about data science and how it can
impact your business, you can also
take free courses at the RapidMiner
Academy—we suggest starting with
the Machine Learning Applications
and Use Cases Professional course. LOW HIGH
Feasability
I wish you the best of luck in your
machine learning endeavors!

rapidminer.com 18
About the Author
Martin Schmitz, PhD is RapidMiner‘s Head of Data Science Services. Martin studied physics
at TU Dortmund University and joined RapidMiner in 2014. During his career as a researcher,
Martin was part of the IceCube Neutrino Observatory located at the geographic South pole.
Using RapidMiner on IceCube data, he studied the most violent phenomena in the universe like
super massive black holes and gamma ray bursts. Being part of several interdisciplinary research centers, Martin
dived into computer science, mathematics, and statistics, and taught data science and the use of RapidMiner.

RapidMiner is reinventing enterprise AI so that anyone has the power to positively shape the future. We’re doing
this by enabling ‘data loving’ people of all skill levels, across the enterprise, to rapidly create and operate AI solutions
to drive immediate business impact. We offer an end-to-end platform that unifies data prep, machine learning, and
model operations with a user experience that provides depth for data scientists and simplifies complex tasks for
everyone else.

The RapidMiner Center of Excellence methodology and the RapidMiner Academy ensures customers are successful,
no matter their experience or resource levels. More than 40,000 organizations in over 150 countries rely on
RapidMiner to increase revenue, cut costs, and reduce risk.

Artificial Intelligence and Machine Learning For Business
No ratings yet
Artificial Intelligence and Machine Learning For Business
22 pages
Art Appreciation Module Updated Final Version
No ratings yet
Art Appreciation Module Updated Final Version
101 pages
What Is CRISP DM - Data Science Process Alliance
No ratings yet
What Is CRISP DM - Data Science Process Alliance
8 pages
Big Data Analytics - Quick Guide - Tutorialspoint
No ratings yet
Big Data Analytics - Quick Guide - Tutorialspoint
50 pages
CRISP DM1 - Chapter 2
No ratings yet
CRISP DM1 - Chapter 2
22 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Monologue Pirmojo Poros Kandidato Studies in Lithuania
No ratings yet
Monologue Pirmojo Poros Kandidato Studies in Lithuania
2 pages
ML Life Cycle
No ratings yet
ML Life Cycle
10 pages
Certified Artificial Intelligence Practitioner 1
No ratings yet
Certified Artificial Intelligence Practitioner 1
43 pages
The Language Experience Approach Readings
100% (2)
The Language Experience Approach Readings
10 pages
Crisp-Dm
No ratings yet
Crisp-Dm
4 pages
PAM - Unit1 PDF
No ratings yet
PAM - Unit1 PDF
217 pages
EFRELYN REACTION PAPER Final1
No ratings yet
EFRELYN REACTION PAPER Final1
1 page
Data Science Process
No ratings yet
Data Science Process
13 pages
Demonstration in Teaching Take-Off/ Motivation
No ratings yet
Demonstration in Teaching Take-Off/ Motivation
2 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
42 pages
Lecture 1
No ratings yet
Lecture 1
29 pages
De Castro (Profed3)
100% (1)
De Castro (Profed3)
3 pages
Essentials of Language Documentation
100% (3)
Essentials of Language Documentation
437 pages
CRISP-DM: A Guide for Data Miners
No ratings yet
CRISP-DM: A Guide for Data Miners
3 pages
Crisp DM
No ratings yet
Crisp DM
14 pages
PAM - Complete
No ratings yet
PAM - Complete
322 pages
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
No ratings yet
JobRecord MUHAMMAD NAEEM F70a3eba Db3d 11ef A12f 96f32f87411b
63 pages
Unlocking Insights The CRISPDMData Mining Process 622298 B 6385 C 29 A 7
No ratings yet
Unlocking Insights The CRISPDMData Mining Process 622298 B 6385 C 29 A 7
19 pages
CRISP-DM Phase 1 - Business Understanding
No ratings yet
CRISP-DM Phase 1 - Business Understanding
4 pages
Data Mining
No ratings yet
Data Mining
13 pages
Didactics For Reading and Listening in English
100% (2)
Didactics For Reading and Listening in English
19 pages
CRISP DM For Data Science 2025
No ratings yet
CRISP DM For Data Science 2025
6 pages
Organising ML Projects
No ratings yet
Organising ML Projects
52 pages
23-01!99!00 CS 633 Data Ming - Final Project PDF - PDF 2
No ratings yet
23-01!99!00 CS 633 Data Ming - Final Project PDF - PDF 2
36 pages
What Is CRISP in Data Mining - Javatpoint
No ratings yet
What Is CRISP in Data Mining - Javatpoint
10 pages
What Is CRISP DM - Data Science Process Alliance
No ratings yet
What Is CRISP DM - Data Science Process Alliance
20 pages
IMP Questions & Ans On ML & CI Using Python
No ratings yet
IMP Questions & Ans On ML & CI Using Python
21 pages
Wan Nur Wahidah Binti Mustapakamal 2022824636 MKT537 ASSIGNMENT 1
No ratings yet
Wan Nur Wahidah Binti Mustapakamal 2022824636 MKT537 ASSIGNMENT 1
10 pages
CRISP DM - Explained in Easy Way
No ratings yet
CRISP DM - Explained in Easy Way
12 pages
EDA in DATA Analytics
No ratings yet
EDA in DATA Analytics
11 pages
Topic 2 Business in Practice and The GRISP-DM Framework
No ratings yet
Topic 2 Business in Practice and The GRISP-DM Framework
22 pages
Data Science Process Alliance CRISP DM For Data Science
No ratings yet
Data Science Process Alliance CRISP DM For Data Science
7 pages
Crisp Note
No ratings yet
Crisp Note
5 pages
Building & Deploying ML Projects Guide
No ratings yet
Building & Deploying ML Projects Guide
29 pages
Data Science Methodology
No ratings yet
Data Science Methodology
3 pages
CRISP DM For Data Science
No ratings yet
CRISP DM For Data Science
7 pages
Jumpstart 2022
No ratings yet
Jumpstart 2022
7 pages
How I Transitioned From Analyst To Data Scientist in Less Than 12 Months - by Claudia NG - Aug, 2024 - Towards Data Science
No ratings yet
How I Transitioned From Analyst To Data Scientist in Less Than 12 Months - by Claudia NG - Aug, 2024 - Towards Data Science
14 pages
Module 5 - Data Science Methodology
No ratings yet
Module 5 - Data Science Methodology
17 pages
Churn Prediction with ML Techniques
No ratings yet
Churn Prediction with ML Techniques
77 pages
2 - BBDS - Decisions Management & Problem Framing
No ratings yet
2 - BBDS - Decisions Management & Problem Framing
78 pages
Intorduction To Data Mining
No ratings yet
Intorduction To Data Mining
26 pages
CRISP-DM & Business Understanding
No ratings yet
CRISP-DM & Business Understanding
5 pages
DS CRISP-DM Model
No ratings yet
DS CRISP-DM Model
2 pages
Crispslides
No ratings yet
Crispslides
20 pages
Notes On Data Science Methodologies
No ratings yet
Notes On Data Science Methodologies
4 pages
Beginner's Guide to Data Science
No ratings yet
Beginner's Guide to Data Science
8 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Why Do AI Initiatives Fail
No ratings yet
Why Do AI Initiatives Fail
5 pages
Data Science For Managers and Business Leaders Curriculum
No ratings yet
Data Science For Managers and Business Leaders Curriculum
6 pages
Py 4 DS
No ratings yet
Py 4 DS
95 pages
Data Mining Applications & CRISP-DM
No ratings yet
Data Mining Applications & CRISP-DM
5 pages
FILE Ai
No ratings yet
FILE Ai
10 pages
IIMK DS W6 Transcripts
No ratings yet
IIMK DS W6 Transcripts
14 pages
Guide PDF
No ratings yet
Guide PDF
14 pages
Crisp - DM: Data Mining Process
No ratings yet
Crisp - DM: Data Mining Process
8 pages
Logical Thinking: The Categories of Legitimate Reservation
No ratings yet
Logical Thinking: The Categories of Legitimate Reservation
4 pages
Call For Paper
No ratings yet
Call For Paper
1 page
OpenSAP Ds1 Week 1 Unit 2 INTROPM Presentation
No ratings yet
OpenSAP Ds1 Week 1 Unit 2 INTROPM Presentation
12 pages
Data Science with Python Guide
No ratings yet
Data Science with Python Guide
25 pages
Zsiga - Segments - The Phonetics Phonology Interface
No ratings yet
Zsiga - Segments - The Phonetics Phonology Interface
28 pages
Common Data Science Mistakes
No ratings yet
Common Data Science Mistakes
32 pages
Happiness and Well Being 1 2
No ratings yet
Happiness and Well Being 1 2
8 pages
Review Oral Com.
No ratings yet
Review Oral Com.
5 pages
Challenges To Preparing Teachers To Instruct All Stu - 2023 - Teaching and Teach
No ratings yet
Challenges To Preparing Teachers To Instruct All Stu - 2023 - Teaching and Teach
10 pages
Pre-Writing Skills: Brain Storming
No ratings yet
Pre-Writing Skills: Brain Storming
8 pages
11 ML Readiness Questions for Business
No ratings yet
11 ML Readiness Questions for Business
6 pages
MARK5813 Creativity and Innovation in Marketing S22016
No ratings yet
MARK5813 Creativity and Innovation in Marketing S22016
14 pages
Cfde Programme - Guide
No ratings yet
Cfde Programme - Guide
15 pages
Term 2 Basic 3 Week 10-12
No ratings yet
Term 2 Basic 3 Week 10-12
54 pages
Young Scientists: Road Surfaces & Speed
No ratings yet
Young Scientists: Road Surfaces & Speed
12 pages
Human Rights Resources For Educators Brochure
No ratings yet
Human Rights Resources For Educators Brochure
4 pages
BMW Project Final Report - en
No ratings yet
BMW Project Final Report - en
21 pages
University of Gondar Institute of Technology Dep't of Civil Engineering
No ratings yet
University of Gondar Institute of Technology Dep't of Civil Engineering
35 pages
Sustainability 12 02407
No ratings yet
Sustainability 12 02407
19 pages
Grading+Policy+PGP-DS+v1 0
No ratings yet
Grading+Policy+PGP-DS+v1 0
3 pages
Blended Teaching Boosts Tech Scores
No ratings yet
Blended Teaching Boosts Tech Scores
11 pages
English Teacher Questionnaire Teacher 2
No ratings yet
English Teacher Questionnaire Teacher 2
8 pages
ARCS Model in Pharma Education
No ratings yet
ARCS Model in Pharma Education
10 pages
Lesson Plan: Wonders of Nature''
No ratings yet
Lesson Plan: Wonders of Nature''
4 pages
Mini Lesson Plan (Reading)
No ratings yet
Mini Lesson Plan (Reading)
3 pages
Crisp
No ratings yet
Crisp
14 pages

RapidMiner - Humans Guide ML V2

Uploaded by

RapidMiner - Humans Guide ML V2

Uploaded by

data scientist’s

Step I – Understanding CRISP-DM 5

Step II – Understanding the Business Case 6

Step III – Defining the Label 8

Step IV – Defining Performance and “Success” 11

Step V – Discussing Profile Generation 15

For the purposes of this guide, I assume

A Failure of knowledge varies among the members of

Communication might have knowledge of the technology

“But what’s in it for me?”

“But that’s not my job”

Step I – Understanding CRISP-DM

The six phases of 1. BUSINESS UNDERSTANDING

Step II – Understanding the Business Case

The genesis of the team

> What is the problem/opportunity/challenge?

> Why is it important?

> Who in the organization cares?

> How has this problem traditionally been solved?

It’s important that the data scientists in

A good example of this difficulty is customer

An alternative option is to turn this into a supervised problem by treating it as a

steps, and not scrambling later report? A part of a website?

Requirement 1: The label needs to match business needs

Requirement 2: The label needs to exist

The un-requirement: A label without noise

Step IV – Defining Performance and “Success”

Data scientists usually look at measurements like

Model Costs a more business-oriented angle. For example,

When is our project successful?

Step V – Discussing Profile Generation

The general idea

Source systems & availability of data

The other question about data that needs

Date Id Action Value

Date SensorId Channel Action

Date CustomerId Website ClickedItem

First ideas on data preparation

RapidMiner offers a free AI

If you’re interested in learning more

©2020 RapidMiner, Inc. All rights reserved

You might also like