RapidMiner - Humans Guide ML V2
RapidMiner - Humans Guide ML V2
unicorn’s
marketer’s
dolphin’s
A human’s Guide to
Machine Learning Projects
mathematician’s
CEO’s
project manager’s
technologist’s
WHITEPAPER
Martin Schmitz, PhD
Table of Contents
Step 0 – Responding to Common Objections 4
“But what’s in it for me?“ 4
“But that’s not my job” 4
Wrapping Up 18
Many people start advanced analytics programs at their company and search for a
methodology that will allow them to tackle their use cases as quickly and effectively as possible.
In this whitepaper, I discuss the approach that I’ve developed over the last decade to
help understand, outline, and implement artificial intelligence solutions for business
problems. This whitepaper is designed to be used as a guide for the first few hours of internal
discussions around a machine learning project.
rapidminer.com 3
Step 0 — Responding to Common Objections
When embarking on a new machine learning project, objections often crop up from various
parties. Many of these objections are unique to a given organization or work environment, but
there are two common objections that arise often enough that I feel it’s worth addressing them
here so that I can give you some tips on how to respond before we dive deep on the particulars
of setting up a new machine learning project.
This is a great example of why it’s so critical to clearly outline the business problem, as well as
defining what success looks like (which we’ll talk about in detail in Step IV), at the beginning of a
project. If you’re having these kinds of conversations about machine
learning, there’s a good chance that people have already started to
AS MANY AS see that there’s room for improvement. You want to capitalize on
70%
that by showing the kinds of impacts you believe the project can
have and getting buy-in that hitting those success criteria will mean
that the model can be implemented and have impact. If you don’t
of programs fail to achieve their
clearly articulate these things to the relevant stakeholders at the
goals due largely
beginning, you run the risk of the project getting bogged down in red
to employee tape, approvals, and back-and-forths that can stall—and potentially
resistance. even kill—your project.
There are two steps you can take to help mitigate objections from those doing the work. The first
relates to the above. If you have clear buy-in from management about the goals that you’re working
towards, and what success looks like, it should be easier to convince departments such as IT to
support the project. After all, management is pushing to try and hit these goals.
rapidminer.com 4
The second step is to find a champion in the department that you’re working with to act as a
liaison and spokesperson. They’ll probably be the most stretched by learning new things and
being pushed out of their comfort zone, but if you can find someone who will willingly take on this
task, they can provide an excellent point of contact between the different people working on the
project, as well as serve as an advocate for the project among their peers.
Before we get started, we need to talk about CRISP-DM, or the cross-industry standard
process for data mining. It’s been around since 1996 and according to Forbes, it’s the most
widely used and relied upon analytics process in the world. If you’re getting started on a
data science project, it’s absolutely essential that you understand the basics and how they relate
to the work that you’re doing. The CRISP-DM process consists six phases:
CRISP-DM
In the first stage of CRISP-DM, you work
through what the project looks like, and what
the business expectations for the project are.
6. DEPLOYMENT
Finally, you need to deploy the model
2. DATA UNDERSTANDING
that you’ve developed to ensure
Next, the data that are available for
that it can have a positive impact
analysis are examined in light of the
on your business. It might seem
4. like a no-brainer to deploy a model
business objects that were decided on in
the first phase.
once it’s created, but fully half of
completed models never make it into
production, contributing significantly
to the Model Impact Epidemic. 3. DATA PREPARATION
Once you have an understanding of the
data that’s available, the next step is
5. EVALUATION
to clean, sort, and process said data to
Once you’re happy with the model
make it useable for your purposes. This
that you’ve built, you need to
phase often takes the greatest amount of
evaluate whether or not it effectively
time and effort.
addresses the business criteria laid
4. MODELING
out in the first phase.
Then, you iterate through various
versions of a model, using the prepared
data from phase 3.
rapidminer.com 5
Although there’s a lot more involved in each of these phases of CRISP-DM, the summary above
should give us enough to start talking about how RapidMiner approaches these issues. If you’re
interested in reading more, you can check out the CRISP-DM article at Wikipedia.
With that background, let’s take a look at how I approach the issues outlined in this process as
we dive deep on the most critical components of implementing a successful machine learning
project. As mentioned above, the purpose of this guide is to focus on the early stages of such a
project, so the discussion here mostly elaborates and elucidates on phases one through three
of CRISP-DM, although we will touch briefly on the other stages as well.
The business analyst often has the opposite problem. She understands the business and the
problems she faces there but doesn’t understand the methods of machine learning. How does
she even know if a problem is solvable by machine learning? And even if it is solvable, how can
she possibly assess the difficulty of developing and implementing a machine learning
solution?
These understanding gaps can create a bit of a chicken-and-egg problem for the team, which
is why it’s so important to have an agreed-upon process in place to navigate the early stages of
the project.
rapidminer.com 6
There’s a two-fold effect that comes from this way of thinking about
machine learning in business. On the one hand, if you wait for a problem
or opportunity to present itself, you can miss a lot of use cases
where machine learning could help. Because there was never a trigger
to create a team, there’s no one tasked with working on this challenge.
On the other hand, a lot of use cases are not solvable by such a team, even
if one is created, because it turns out that the problem they’re looking at
isn’t a machine learning problem.
Although this second scenario can kill the project, that isn’t always a bad
thing. Data scientists often think that they have the philosopher’s stone to
solve business problems, but they don’t. It’s okay if a particular use case is
a better fit for traditional business intelligence methods than for machine
learning. You don’t need to use a high-precision laser scalpel to
open a box. Often “traditional” methods like business intelligence or
Six Sigma should be tried before exploring machine learning.
Your team’s first session should focus on mapping the problem and use case as well
as understanding it in enough detail to provide clarity on what the correct solution
looks like. To that end, you need to come up with answers to the following questions:
> What happens if we solve the problem badly? Who suffers? How can we know if this has
happened?
“When you see a good move, this initial discussion. The other people will
often already have an idea of how to solve
search for a better one.“ the problem at hand, and in this part of the
- EMANUEL LASKER, CHESS WORLD CHAMPION process, we don’t want to do data science;
we want to do business. I usually try to
remind myself of chess world champion
Emanuel Lasker’s famous saying: “When you see a good move, search for a better one.”
If you are too quick to jump to a solution at this stage, you are in danger of solving the problem
sub-optimally by forcing a machine learning solution where it might not be appropriate. You
could also potentially miss the business problem completely and create something that doesn’t
directly address the challenge you’re focusing on.
rapidminer.com 7
Step III – Defining the Label
Once the business problem is clearly defined and understood by all parties, and your team has
decided that machine learning is the right way to tackle the problem, it’s then the data scientist’s task to
map the problem to a data science method. Ideally you want to transform the problem to a supervised
learning problem. My personal credo is: If you can go supervised, go supervised.
Supervised simply means that you know ahead of time the labels that you’re trying to predict—for
example, in a categorization problem, it would mean that you already know the categories you
want your algorithm to sort the data into. With an unsupervised problem, rather than defining
the categories ahead of time, you let the algorithm decide what categories are present in the data.
Even if it takes a lot of effort to make the problem into a supervised one, I still recommend doing it.
Unsupervised use cases are much harder to
optimize, as they do not provide a qualitative
measure to evaluate and tune your model on.
It’s imperative to define a performance measure that fits your business needs. Ideally, this
performance measure is something that has a direct business impact. We’ll discuss this
in-depth in Step IV.
rapidminer.com 8
The power of math
While mapping the problem to a machine learning solution, the data scientist should be careful with
his language. I’m not a fan of using any math to explain machine learning. If you’re used to speaking
and writing in equations, math is a helpful language. If not, it’s confusing. I recommend having the
following quote from Stephen Hawking in mind when using math:
“Someone told me that each equation I included in my book
would halve the sales. I did put in one equation, Einstein‘s
famous equation, E=mc2. I hope that this will not scare off half of
my potential readers.”
Data scientists need to be aware that they have to be able to sell their methods to
stakeholders. And if you want to sell your methods, you won’t help yourself by confusing
people. Similarly, this is not about “fancy” methods, but about the concepts. You can explain all
of the concepts of machine learning by using a decision tree algorithm. There is absolutely no
need to start with something more complicated like neural networks. Always remember the
KISS principle: Keep It Simple, Stupid.
At this point in the conversation, every participant should be aware of two things: the use case
being considered, and the concept of supervised learning. So let’s move on to the real problem:
What are you predicting?
Selling your According to research conducted here at RapidMiner for our Model Impact
Epidemic infographic, only 1 in 10 models that are developed by businesses are
methods chosen to be put into production and one of the biggest reasons is not selling
your methodology to key stakeholders.
Having a plan in place in the
early stages makes sure you’re As resources are committed to deploy a model to production and timelines
communicating the value of are put in place, it’s critical that you’re able to clearly and effectively articulate
your project from your first the impact that you’re having. Will it be via an interactive dashboard? A weekly
to get buy-in.
rapidminer.com 9
The real problem: Defining our label
Here again, you’ll face an issue of different cultures. Data scientists coming from an academic
environment are used to have a clearly defined label—that is, what is being predicted. After all,
most university assignments and coding competitions will tell you what to predict. In business, it
isn’t always that simple. You need to remind yourself that defining the label is the equivalent
of formulating the question you want to ask the data. Once you’re clear about that, you’ll
need to be sure that you satisfy three vital requirements with the label you choose.
want to ask the data. create value for the business. Before you
move ahead with planning, you need
to make sure that your label is directly
connected to the business needs.
Note that the requirement that the label exists doesn’t exclude use cases where human
judgement is the only way to measure the label. A classic example of this is sentiment analysis,
where humans rate whether a comment is positive or negative as training data. This might
cause difficulties in acquiring training data in the first place but it’s not strictly impossible.
rapidminer.com 10
Requirement 3: The label needs to be actionable
The best machine learning algorithm doesn’t help if the insights you derive from it aren’t
actionable. You need to be able to answer the question: If I could predict this, what
business action would I take?
A good example of this problem is churn prevention modelling. Typically, you would directly predict
whether or not a customer is still a customer in x months. But even if you can do this, it doesn’t
mean that you’re able to prevent churn. The business value of the churn model is not generated
from predicting churn but from preventing it. What interaction do you trigger if you predict that
a customer is about to churn? Can you prevent it x months ahead? If so, how? You need a clear idea
about what actions you can take to address the problems that a model identifies.
Sometimes people believe that an incorrect label could present a problem for interpreting
and implementing results. But it’s important to remember that no measurement is without
error. Obviously different labels have differing qualities. Labels based on human judgement,
like the sentiment of a text, are more subjective than the measurement of a voltmeter. But the
voltmeter still has uncertainty in its measurements.
It’s important to be aware that the uncertainty inherent in the label you’re trying to
predict sets an upper bound on the best accuracy of your model. You can’t build a model
which predicts the voltage better than your voltmeter, but that’s the only limitation—don’t be
afraid to predict labels that aren’t of the best imaginable quality, simply be careful when you do.
rapidminer.com 11
On data science quality measures
If you ask a data scientist about quality measures, they will usually opt for mathematical
measures like RMSE for regression, or AUC or AUPRC for classification. There are good statistical
reasons for this from the data scientist’s perspective, but, as discussed above, you need to
remember that you are ultimately solving a business problem and not a stats assignment.
So how do you measure the quality of what you’re building in a business-oriented way?
You need to ensure that the quality measures that you choose are both appropriate for the
business problem that you’re trying to solve, as well as interpretable by others on the team
Regression tasks
Let’s first have a look at regression problems. Assume the true value of your label is 5. If you
predict 3, you’ll take a different business action than if you predict 7. Thus, underestimating
and overestimating are of potentially different severities in terms of the business
problem at hand.
By way of example, consider predictive packaging. The idea here is that you forecast the
number of purchases of a given item. This would allow you to package items ahead of time.
Overestimating causes too many items to be pre-packaged, which then will just lay around. On
the other hand, underestimating the demand will result in a delay in shipments. As you can
see, although the two predictions (3 and 7) are mathematically the same distance from the true
label, the business impacts, and thus costs, associated with the two predictions are different.
However, typical statistical performance measures for regression tasks—such as RMSE or R²—
assume that errors are symmetrical, and should thus be avoided if possible, given their lack of
alignment with business concerns.
Classification tasks
In the case of classification tasks, the problem is similar to what we saw with regression above—
namely, false positives and false negatives will have a different impact to your business.
Medical quick tests are a perfect example. A test which falsely predicts that you have a given
disease (false positive) will trigger more extensive and expensive tests. These tests will then
correctly determine that you do not have the disease in question. The other error type is when
the test results indicate that you don’t have the disease, even though you actually do (false
negative). Here, the incorrect result will prevent proper and timely treatment, potentially causing
serious harm and even death.
Common data science measures of classification accuracy like F1 score and AUC assume that
false positives and false negatives are of equal severity. These measures are thus misleading in
a business environment and should be avoided.
rapidminer.com 12
The solution: Business-aligned performance
Because of these issues, you need to identify a performance measure that’s more closely aligned with
the business problem you identified in Step I. A proven way to do this is a value-based performance
measure. Consider a predictive maintenance scenario where the cost of replacing a part before
it actually fails is much different than failing to replace a part in time and having it fail (similar to the
medical quick test issue discussed above). In this case, you want your models to take into account the
costs of these different scenarios.
and Values
say that your department can only handle 100
requests per day. Naturally, you want to optimize
the number of correct predictions in that batch
RapidMiner has been pioneering the value-based
of 100 so that you can have the most impact.
approach to building models by providing ways to
By optimizing your model for these 100 most
take costs and benefits into account during model
impactful cases, you’ll help ensure that the
building. This helps identify the best model for the use
performance metric you’re using is directly tied
case, based not only on statistical and mathematical to business concerns.
accuracy of the model, but also on the impact that the
model’s predictions will have on your bottom line. A reprise on unsupervised problems
Now that you understand some of the issues with
For example, consider a model to predict churn. If
performance measures as they relate to business
you’re planning to offer discounts to those customers
problems, you’re in a better position to understand
who are predicted to churn, you not only want to
my admonition above that you should opt for
know what effects those discounts might have on
supervised problems whenever possible. It is
churn rates, but also how they’ll affect your revenue
much harder to find performance measures
if customers take you up on the offer. It might be that
for unsupervised problems like clustering or
identifying churn and then offering steep discounts to topic modeling that also align with business
get retention is less cost effective than simply letting concerns. For example, if you want to use
some of the customers churn, but you won’t know this clustering for customer segmentation, you will
unless you’re working in a value-based approach. struggle to assign a business-oriented performance
measure to the analysis. Usual measures like
Davies–Bouldin index have the same flaw as RMSE
or AUC— they do not correlate with business value.
rapidminer.com 13
In a recent blog post, I demonstrated how to do topic modelling of Amazon reviews. If you
search for six topics, you will find a topic that’s about hot beverages like coffee and tea. If you
increase the number of topics to twelve, the beverages topic will be split into two different
topics: one about coffee and the other about tea. There is no way that the algorithm can assess
whether it’s better for the business to separate these two or not. That being said, this shouldn’t
prevent you from doing such an analysis with a
human in the loop if it’s likely to have a bigger
impact. But it should be clear from the outset
that the difficulty of measuring performance When is the model good
in an unsupervised problem is much greater
enough?
when compared to an equivalent supervised
problem.
I’m a big fan of moving to deployment as soon as the model generates decent value. I’ve
often seen models that could save hundreds of thousands of dollars per year that are not being
deployed because the data science team was confident that, given more time, they could get even
better results. This is another factor driving the Model Impact Epidemic.
But how do you define “decent value”? This highlights why it’s so important to define your
success criteria at the beginning of a project. If you have a clear threshold to make a
decision, you can use this step of your analysis to identify the first performance milestone that
defines your minimum viable product. If you know what your threshold is, and you hit it, you can
pause and deploy, or deploy the first version in parallel while you continue to refine your model.
On baseline models
Now that you have a performance measure to assess the quality of your model, you want to
have some kind of baseline for comparison. What should you use for you baseline? It can vary
depending on your particular use case, but there are basically two kinds of baseline models:
currently deployed solutions and naïve solutions.
Where possible, we want to compare the model that we are building against whatever
currently deployed solution is in place to address the business problem that we are looking
at. For example, are you are using linear regression and Excel to predict maintenance? If so, that’s
great—because you can then ask what the performance of that current system is. This is critical
because it allows you to put your results into perspective and to justify the money and time spent
developing your new analysis.
rapidminer.com 14
However, what do you do if the problem you’re trying to tackle doesn’t have a current solution?
After all, many of the triggers to create a team are new problems that traditional solutions aren’t
able to solve. It’s also possible that some current solution exists, but isn’t quantifiable — for
example, decisions are made on the fly by humans and are not recorded.
In this case, you’ll want to compare yourself to naïve or simple models. The naïve or default
model could be the majority class in a classification problem, the average in a regression
problem, or the demand yesterday in a demand forecasting scenario. Basically, what would
you do to quickly get a baseline of the current state of affairs if you didn’t have access to the
model that you’re building? As above, this will let you compare your solution to a baseline to
demonstrate the value of what you’re building.
On validation
Validation is perhaps the most important part of any data science project. While the data
scientist should be extremely careful about validation, ensuring that they use best practices like
ensuring that training and test sets are distinct from one another and using cross-validation, I
won’t address these issues in depth here, as the details tend to be quite in the weeds for those
who aren’t data scientists.
If only it were so easy! Unfortunately, this is far from accurate. Before data scientists can start
the process of building models to solve problems, they need high-quality data that’s been
prepared for the task at hand. There are good reasons why there are whole schools of
teaching just around data preparation.
So how do you align your team and set realistic expectations? I call this step profile generation.
You do it by putting the problem in context by calling machine learning pattern recognition.
This makes it sound a bit old school, but it highlights an important issue here. You are
detecting a pattern in your data in order to predict your label.
rapidminer.com 15
In order to do this, you first need to create a one-line-per-customer (or machine, or asset, or
etc.) representation. This one-line representation is what I call a profile. The art of data science
is to build a profile which is both complete and dense. Complete means that the data has all of
the information possible that may help the algorithm make its predictions. Dense means that you’ve
reached completeness with a low number of individual attributes. Large numbers of attributes can
not only lead to longer runtimes, but they can also prevent the algorithm from identifying patterns in
the data. The difficulty of getting the balance between completeness and denseness right makes the
data preparation step some of the heaviest lifting in data science.
As mentioned above, whole schools of thought exist around how best to prep data. In what
follows, I intend to provide you with a high-level overview and some important details, but there
is certainly more depth to these issues than we have space to go into here.
Where does your data come from? This is a very important question, because you need
to make sure that you can access the data you need for your project. Some systems and
infrastructures lock users out, preventing data access. This needs to be checked early on in the
planning of your project, and processes put in place to grant you access to the data you need to
build your models.
I’ve personally run into situations where I learned after finishing a model that a given attribute
would not be available at runtime. It might be because the measurement takes 24 hours to
create, which means you won’t be able to use it on the fly. This issue can also arise as a result
of IT infrastructure, where you might be working with a system that only updates every 6 hours.
In this stage of your project, you want to make sure that you’re aware of what kind of data is
available in both cases.
rapidminer.com 16
Data types
The complexity of the modelling process, as well as the predictive power of a model, depends
a lot on the type of data that you’re using. In my experience, the raw data is often in a form that
looks something like this:
An example of this might be sensor analytics, where you have the following information:
Or with customer analytics, the table might more look like this:
These data sets are time series by nature. However, this data often exists in data warehouses
where you can get access to the aggregations of this data, but not the full data itself. On the one
hand, it’s great to have aggregated data to use for quickly building an initial prototype, because
it takes less work than collecting live data, and the aggregations do often contain meaningful
information. On the other hand, all types of aggregations remove information compared to
what is present in the raw data.
The raw data is never aggregated in such a way that the aggregated data is as
appropriate for the task of machine learning as the raw data. Thus, to get the best
results possible in your project, it’s important to get access to the underlying data and use that
for model training and evaluation.
There are often obvious ways to clean up the dirty data once you talk to the subject matter
experts, and data scientists should be willing to engage with those in the know to gather
information and get ideas about data cleaning. The process of cleaning and prepping the data is
obviously not doable in a short period of time, but you want to have a clear idea early on of how
much time you’ll need for this process so that you can plan accordingly.
rapidminer.com 17
Wrapping Up
If you follow this guide, you’ll be able to assess the value and feasibility of a machine learning
project correctly, right from the get-go, including getting early buy-in on critical aspects of the
process. This will result in fewer projects being killed in the early stages that could have been
successful, while also helping to prevent work on projects that won’t yield value, regardless of
the reason—whether it’s because they’re too challenging, because the necessary data doesn’t
exist, or because there are too many politics hurdles to clear to get the model into production.
Based on the research that we did for our Model Impact Epidemic infographic, 4.95 billion
models developed don’t end up even having the potential to have a real business
impact. This demonstrates the clear need for more planning in order to help prioritize use
cases and identify areas for immediate impact. In fact, when you’re getting started with a
project, I usually recommend spending time up-front to map out not just one use case but as
many as possible, using the process described in this document, rather than going with the
problem > trigger > team model. The result of such a process is an Impact-Feasibility Map
which can be used to prioritize use cases and lobby internally.
There’s a German saying to keep in mind when assessing the viability of machine learning for
your potential use case: Better a painful end than pain without end. Don’t be afraid to not
use machine learning if there’s a better option available!
Forecasting
rapidminer.com 18
About the Author
Martin Schmitz, PhD is RapidMiner‘s Head of Data Science Services. Martin studied physics
at TU Dortmund University and joined RapidMiner in 2014. During his career as a researcher,
Martin was part of the IceCube Neutrino Observatory located at the geographic South pole.
Using RapidMiner on IceCube data, he studied the most violent phenomena in the universe like
super massive black holes and gamma ray bursts. Being part of several interdisciplinary research centers, Martin
dived into computer science, mathematics, and statistics, and taught data science and the use of RapidMiner.
RapidMiner is reinventing enterprise AI so that anyone has the power to positively shape the future. We’re doing
this by enabling ‘data loving’ people of all skill levels, across the enterprise, to rapidly create and operate AI solutions
to drive immediate business impact. We offer an end-to-end platform that unifies data prep, machine learning, and
model operations with a user experience that provides depth for data scientists and simplifies complex tasks for
everyone else.
The RapidMiner Center of Excellence methodology and the RapidMiner Academy ensures customers are successful,
no matter their experience or resource levels. More than 40,000 organizations in over 150 countries rely on
RapidMiner to increase revenue, cut costs, and reduce risk.