Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views290 pages

Machine Learning With R and Python

This document provides an introduction to the basics of machine learning, covering key concepts such as the difference between supervised and unsupervised learning, as well as various classes of machine learning algorithms including classification, regression, dimension reduction, and clustering. It explains how machine learning algorithms learn from data to make predictions and emphasizes the importance of selecting the appropriate algorithm for specific tasks. The document also discusses the evolution of machine learning and its applications across different fields.

Uploaded by

Whatsapp stuts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views290 pages

Machine Learning With R and Python

This document provides an introduction to the basics of machine learning, covering key concepts such as the difference between supervised and unsupervised learning, as well as various classes of machine learning algorithms including classification, regression, dimension reduction, and clustering. It explains how machine learning algorithms learn from data to make predictions and emphasizes the importance of selecting the appropriate algorithm for specific tasks. The document also discusses the evolution of machine learning and its applications across different fields.

Uploaded by

Whatsapp stuts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 290

Unit 1

Basics of Machine Learning

1.1 Introduction

1.2 Basic Concept of Machine Learning

1.3 Classes of Machine Learning Algorithms

1.4 Deep Learning

1.5 Why use R or Python for Machine Learning?

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

 Understand the concept of machine learning


 Explain the difference between supervised and unsupervised machine learning
 Describe classification, regression, dimension reduction, and clustering

1.1 INTRODUCTION

You interact with machine learning on a daily basis whether you recognize it or not. The
advertisements you see online are of products you’re more likely to buy, based on the things you’ve
previously bought or looked at. Faces in the photos you upload to social media platforms are
automatically identified and tagged. Your car’s GPS predicts which routes will be busiest at certain
times of day and re-plots your route to minimize journey length. Your email client progressively learns
which emails you want and which ones you consider spam to make your inbox less cluttered, and your
home personal assistant recognizes your voice and responds to your requests. From the small
improvements to our daily lives such as this, to the big, society-changing ideas such as self-driving cars,
robotic surgery, and the automated scanning for other Earth-like planets, machine learning has
become an increasingly important part of modern life.

Machine learning isn’t just the domain of large tech companies or computer scientists. Anyone with
basic programming skills can implement machine learning in their work. If you’re a scientist, machine
learning can give you extraordinary insights into the phenomena you’re studying. If you’re a journalist,
it can help you understand patterns in your data that can delineate your story. If you’re a business
person, machine learning can help you target the right customers and predict which products will sell
the best. If you’re someone with a question or problem, and have sufficient data to answer it, machine
learning can help you do just that.

In this unit we’re going to define what actually mean by the term machine learning. You’ll learn the
difference between an algorithm and a model, and discover that machine learning techniques.

1.2 BASIC CONCEPT OF MACHINE LEARNING

Imagine you work as a researcher in a hospital. What if, when a new patient is checked in, you could
calculate the risk of them dying? This would allow the clinicians to treat high risk patients more
aggressively and result in more lives being saved. But, where would you start? What data would you
use? How would you get this information from the data? The answer is to use machine learning.

Machine learning, sometimes referred to as statistical learning, is a subfield of artificial intelligence


(AI) whereby algorithms "learn" patterns in data to perform specific tasks. Although algorithms may
sound complicated, they aren’t. In fact, the idea behind an algorithm is not complicated at all. An
algorithm is simply a step-by-step process that we use to achieve something, that has a beginning and
an end. Chefs have a different word for algorithms, they call them "recipes". At each stage in a recipe,
you perform some kind of process, like beating an egg, then follow the next instruction in the recipe,
such as mixing the ingredients.

So having gathered data on your patients, you train a machine learning algorithm to learn patterns in
the data associated with their survival. Now, when you gather data on a new patient, the algorithm
can estimate the risk of that patient dying.
As another example, imagine you work for a power company, and it’s your job to make sure
customers' bills are estimated accurately. You train an algorithm to learn patterns of data associated
with the electricity use of households. Now, when a new household joins the power company, you
can estimate how much money you should bill them each month.

Finally, imagine you’re a political scientist, and you’re looking for types of voters that no one (including
you) knows about. You train an algorithm to identify patterns of voters in survey data, to better
understand what motivates voters for a particular political party. Do you see any similarities between
these problems, and the problems you would like to solve? Then provided the solution is hidden
somewhere in your data, you can train a machine learning algorithm to extract it for you.

1.2.1 Artificial Intelligence and Machine Learning

Arthur Samuel, a scientist at IBM, first used the term machine learning in 1959. He used it to describe
a form of artificial intelligence (AI) that involved training an algorithm to learn to play the game of
checkers. The word learning is what’s important here, as this is what distinguishes machine learning
approaches from traditional AI.

Traditional AI is programmatic. In other words, you give the computer a set of rules so that when it
encounters new data, it knows precisely which output to give. The problem with this approach is that
you need to know all possible outputs the computer should give you in advance, and the system will
never give you an output that you haven’t told it to give. Contrast this to the machine learning
approach, where instead of telling the computer the rules, you give it the data and allow it to learn
the rules for itself. The advantage of this approach is that the "machine" can learn patterns we didn’t
even know existed in the data, and the more data we provide, the better it gets at learning those
patterns.

Fig. 1.1: Traditional AI vs. Machine Learning


1.2.2 The difference between a Model and an Algorithm

In practice, we call a set of rules a machine learning algorithm learns a model. Once the model has
been learned, we can give it new observations and it will output its predictions for the new data. We
refer to these as models because they represent real-world phenomena in a simplistic-enough way
that we and the computer can interpret and understand. Just as a model of the Eiffel Tower may be a
good representation of the real thing but isn’t exactly the same, so statistical models are attempted
representations of real-world phenomena but won’t match them perfectly.

The process by which the model is learned is referred to as the algorithm. As we discovered earlier,
an algorithm is just a sequence of operations that work together to solve a problem. So how does this
work in practice? Let’s take a simple example. Say we have two continuous variables, and we would
like to train an algorithm that can predict one (the outcome or dependent variable), given the other
(the predictor or independent variable). The relationship between these variables can be described
by a straight line, which only needs two parameters to describe it: its slope, and where it crosses the
y axis (y intercept). This is shown in figure 1.2.

Fig. 1.2: Any straight line can be described by its slope (the change in y divided by the change in x),
and its intercept (where it crosses the y axis when x = 0). The equation y = intercept + slope * x can be
used to predict the value of y, given a value of x.

An algorithm to learn this relationship could look something like the example in figure 1.3. We start
by fitting a line with no slope through the mean of all the data. We calculate the distance each data
point is from the line, square it, and sum these squared values. This sum of squares is a measure of
how closely the line fits the data. Next, we rotate the line a little in a clockwise direction, and measure
the sum of squares for this line. If the sum of squares is bigger than it was before, we’ve made the fit
worse, so we rotate the slope in the other direction and try again. If the sum of squares gets smaller,
then we’ve made the fit better. We continue with this process, rotating the slope a little less each time
we get closer, until the improvement on our previous iteration is smaller than some pre-set value
we’ve chosen. The algorithm has iteratively learned the model (the slope and y intercept) needed to
predict future values of the output variable, given only the predictor variable. This example is slightly
crude, but hopefully illustrates how such an algorithm could work.
Fig. 1.3: A hypothetical algorithm for learning the parameters of a straight line. This algorithm takes
two continuous variables as inputs, and fits a straight line through the mean. It iteratively rotates the
line until it finds a solution that minimises the sum of squares. The parameters of the line are output
as the learned model.

While certain algorithms tend to perform better than others with certain types of data, there is no
single algorithm that will always outperform all others on all problems. This concept is called the no
free lunch theorem. In other words, you don’t get something for nothing; you need to put some effort
into working out the best algorithm for your particular problem. Instead, data scientists typically
choose a few algorithms they know tend to work well for the type of data and problem they are
working on, and see which algorithm generates the best performing model.

1.3 CLASSES OF MACHINE LEARNING ALGORITHMS

All machine learning algorithms can be categorized by their learning type, and the task they perform.
There are three learning types:

 Supervised learning
 Unsupervised learning
 Reinforcement learning

This depends on how the algorithms learn. Do they require us to hold their hand through the learning
process? Or do they learn the answers for themselves? Supervised and unsupervised algorithms can
be further split into two classes each:

 supervised
o classification
o regression
 unsupervised
o dimension reduction
o clustering

This depends on what they learn to do.

So we categorize algorithms by how they learn, and what they learn to do. But why do we care about
this? Well, there are a lot of machine learning algorithms available to us. How do we know which one
to pick? What kind of data do they require to function properly? Knowing which categories different
algorithms belong to makes our job of selecting the most appropriate ones much simpler.

1.3.1 Differences between Supervised, Unsupervised, and Semi-Supervised Learning

Imagine you are trying to get a toddler to learn about shapes using blocks of wood. In front of them,
they have a ball, a cube, and a star. You ask them to show you the cube, and if they point to the correct
shape you tell them they are correct, and if they are incorrect you also tell them. You repeat this
procedure until the toddler can identify the correct shape almost all of the time. This is called
supervised learning, because you, the person who already knows which shape is which, are supervising
the learner by telling them the answers.

Now imagine a toddler is given multiple balls, cubes, and stars, but this time is also given three bags.
The toddler has to put all the balls in one bag, the cubes in another bag, and the stars in another, but
you won’t tell them if they’re correct, they have to work it out for themselves from nothing but the
information they have in front of them. This is called unsupervised learning, because the learner has
to identify patterns themselves with no outside help.

So a machine learning algorithm is said to be supervised if it uses a ground truth, or in other words,
labeled data. For example, if we wanted to classify a patient biopsy as healthy or cancerous based on
its gene expression, we would give an algorithm the gene expression data, labeled with whether that
tissue was healthy or cancerous. The algorithm now knows which cases come from each of the two
types, and tries to learn patterns in the data that discriminate them.

Another example would be if we were trying to estimate how much someone spends on their credit
card in a given month, we would give an algorithm information about them such as their income,
family size, whether they own their home etc., labeled with how much they spent on their credit card.
The algorithm now knows how much each of the cases spent, and looks for patterns in the data that
can predict these values in a reproducible way.

A machine learning algorithm is said to be unsupervised if it does not use a ground truth, and instead
looks for patterns in the data on its own, that hint at some underlying structure. For example, let’s say
we take the gene expression data from lots of cancerous biopsies, and ask an algorithm to tell us if
there are clusters of biopsies. A cluster is a group of data which are similar to each other, but different
from data in other clusters. This type of analysis can tell us if we have subgroups of cancer types which
we may need to treat differently.

Alternatively, we may have a dataset with a large number of variables, so many that it is difficult to
interpret and look for relationships manually. We can ask an algorithm to look for a way of
representing this high-dimensional dataset in a lower-dimensional one, while maintaining as much
information from the original data as possible. Take a look at the summary in figure 1.4. If your
algorithm uses labeled data (i.e. a ground truth), then it is supervised, and if it does not use labeled
data then it is unsupervised.

Fig. 1.4: Supervised Vs. Unsupervised Machine Learning

Semi-supervised learning

Most machine learning algorithms will fall into one of these categories, but there is an additional
approach called semi-supervised learning. As its name suggests, semi-supervised machine learning is
not quite supervised and not quite unsupervised.

Semi-supervised learning often describes a machine learning approach that combines supervised and
unsupervised algorithms together, rather than strictly defining a class of algorithms in of itself. The
premise of semi-supervised learning is that, often, labeling a dataset may require a large amount of
manual work by an expert observer. This process may be very time-consuming, expensive and error-
prone, and may be impossible for an entire dataset. So instead, we expertly label as many of the cases
as is feasibly possible, then build a supervised model using only these labeled data. We pass the rest
of our data (the unlabeled cases) into the model, to get the predicted labels for these, which are called
pseudo-labels, because we don’t know if all of them are actually correct. Now, we combine the data
with the manual labels and pseudo-labels, and use this to train a new model.
This approach allows us to train a model that learns from both labeled and unlabeled data, and can
improve overall predictive performance because we are able to use all of the data at our disposal.

Within the categories of supervised and unsupervised, machine learning algorithms can be further
categorized by the tasks they perform. Just like a mechanical engineer knows which tools to use for
the task at hand, so the data scientist needs to know which algorithms they should use for their task.
There are four main classes to choose from: classification, regression, dimension reduction, and
clustering.

1.3.2 Classification, Regression, Dimension Reduction, and Clustering

Supervised machine learning algorithms can be split into two classes: classification algorithms and
regression algorithms.

 Classification algorithms take labeled data (because they are supervised learning methods)
and learn patterns in the data that can be used to predict a categorical output variable. This
is most often a grouping variable (a variable specifying which group a particular case belongs
to) and can be binomial (two groups) or multinomial (more than two groups). Classification
problems are very common machine learning tasks. Which customers will default on their
payments? Which patients will survive? Which objects in a telescope image are stars, planets
or galaxies? When faced with problems like these, you should use a classification algorithm.
 Regression algorithms take labeled data, and learn patterns in the data that can be used to
predict a continuous output variable. How much carbon dioxide does a household contribute
to the atmosphere? What will the share price of a company be tomorrow? What is the
concentration of insulin in a patient’s blood? When faced with problems like these, you should
use a regression algorithm.

Unsupervised machine learning algorithms can also be split into two classes: dimension reduction and
clustering algorithms.

 Dimension reduction algorithms take unlabeled (because they are unsupervised learning
methods), high-dimensional data (data with many variables) and learn a way of representing
it in a lower number of dimensions. Dimension reduction techniques may be used as an
exploratory technique (because it’s very difficult for humans to visually interpret data in more
than two or three dimensions at once), or as a pre-processing step in our machine learning
pipeline (it can help mitigate problems such as collinearity and the curse of dimensionality,
terms we will define in later chapters). We can also use it to help us visually confirm the
performance of classification and clustering algorithms (by allowing us to plot the data in two
or three dimensions).
 Clustering algorithms take unlabeled data, and learn patterns of clustering in the data. A
cluster is a collection of observations which are more similar to each other, than to data points
in other clusters. We assume that observations in the same cluster share some unifying
features or identity that makes them identifiably different from other clusters. Clustering
algorithms may be used as an exploratory technique to understand the structure of our data,
and may indicate a grouping structure that can be fed into classification algorithms. Are there
subtypes of patient responders in a clinical trial? How many classes of respondent were there
in the survey? Are there different types of customer that use our company? When faced with
problems like these, you should use a clustering algorithm.
By separating machine learning algorithms into these four classes, you will find it easier to select
appropriate ones for the tasks at hand. Deciding which class of algorithm to choose from is usually
straightforward:

 If you need to predict a categorical variable, use a classification algorithm.


 If you need to predict a continuous variable, use a regression algorithm
 If you need to represent the information of many variables with fewer variables, use
dimension reduction
 If you need to identify clusters of cases, use a clustering algorithm

1.4 DEEP LEARNING

Deep learning is a subfield of machine learning (all deep learning is machine learning, but not all
machine learning is deep learning) that has become extremely popular in the last 5 to 10 years for two
main reasons:

 It can produce models with outstanding performance


 We now have the computational power to apply it more broadly

Deep learning uses neural networks to learn patterns in data, a term referring to the way in which the
structure of these models superficially resembles neurons in the brain, with connections allowing
them to pass information between them. The relationship between AI, machine learning, and deep
learning is summarized in figure 1.5.

Fig. 1.5: The relationship between Artificial Intelligence (AI), Machine Learning, and Deep Learning

While it’s true that deep learning methods will typically out-perform "shallow" learning methods (a
term sometimes used to distinguish machine learning methods that are not deep learning) for the
same dataset, they are not always the best choice for a given problem. Deep learning methods often
are not the most appropriate method for a given problem for three reasons:

 They are computationally expensive. By expensive, we don’t mean monetary cost of course,
we mean they require a lot of computing power, which means they can take a long time (hours
or even days!) to train. Arguably this is a less important reason not to use deep learning,
because if a task is important enough to you, you can invest the time and computational
resources required to solve it. But if you can train a model in a few minutes that performs
well, then why waste additional time and resources?
 They tend to require more data. Deep learning models typically require hundreds to
thousands of cases in order to perform extremely well. This largely depends on the complexity
of the problem at hand, but shallow methods tend to perform better on small datasets than
their deep learning counterparts.
 The rules are less interpretable. By their nature, deep learning models favor performance over
model interpretability. Arguably, our focus should be on performance, but often we’re not
only interested in getting the right output, we’re also interested in the rules the algorithm
learned because these help us to interpret things about the real world and may help us further
our research. The rules learned by a neural network are not easy to interpret.

So while deep learning methods can be extraordinarily powerful, shallow learning techniques are still
invaluable tools in the arsenal of data scientists.

Deep learning algorithms are particularly good at tasks involving complex data, such as image
classification and audio transcription.

1.5 WHY USE R OR PYTHON FOR MACHINE LEARNING?

There is something of a rivalry between the two most commonly used data science languages: R and
Python. Anyone who is new to machine learning will choose one or the other to get started, and their
decision will often be guided by the learning resources they have access to, which one is more
commonly used in their field of work, and which one their colleagues use. There are no machine
learning tasks which are only possible to apply in one language or the other, although some of the
more cutting-edge deep learning approaches are easier to apply in Python (they tend to be written in
Python first and implemented in R later). Python, while very good for data science, is a more general
purpose programming language, whereas R is geared specifically for mathematical and statistical
applications. This means that users of R can focus purely on data, but may feel restricted if they ever
need to build applications based on their models. There are modern tools in R designed specifically to
make data science tasks simple and human-readable, such as those from the tidyverse.

Traditionally, machine learning algorithms in R were scattered across multiple packages, written by
different authors. This meant you would need to learn to use new functions with different arguments
and implementations, each time you wanted to apply a new algorithm. Proponents of python could
use this as an example of why it was better suited for machine learning, as it has the well-known scikit-
learn package which has a plethora of machine learning algorithms built into it. But R has now followed
suit, with the caret and mlr packages.

The mlr package (which stands for machine learning in R) provides an interface for a large number of
machine learning algorithms, and allows you to perform extremely complicated machine learning
tasks with very little coding.

Check your Progress 1

Fill in the blanks

1. A ______ is just a sequence of operations that work together to solve a problem.


2. A machine learning algorithm is said to be ______ if it uses a ground truth, or in other words,
labeled data.
3. A ______ is a collection of observations which are more similar to each other, than to data
points in other clusters.
4. Deep learning uses ______ to learn patterns in data, a term referring to the way in which the
structure of these models superficially resembles neurons in the brain, with connections
allowing them to pass information between them.

Activity 1

1. List the applications of deep learning.


2. Search and list the algorithms used in Artificial Intelligence technique.

Summary

 Artificial intelligence is the appearance of intelligent knowledge by a computer process.


 Machine learning is a subfield of artificial intelligence, where the computer learns
relationships in data to make predictions on future, unseen data, or to identify meaningful
patterns that help us understand our data better.
 A machine learning algorithm is the process by which patterns and rules in the data are
learned, and the model is the collection of those patterns and rules which accepts new data,
applies the rules to it, and outputs an answer.
 Deep learning is a subfield of machine learning, which is, itself, a subfield of artificial
intelligence.
 Machine learning algorithms are categorized/divided as supervised and unsupervised,
depending on whether they learn from ground-truth-labeled data (supervised learning) or
unlabeled data (unsupervised learning).
 Supervised learning algorithms are categorized/divided into classification (if they predict a
categorical variable) or regression (if they predict a continuous variable).
 Unsupervised learning algorithms are categorized/divided into dimension reduction (if they
find a lower dimension representation of the data) or clustering (if they identify clusters of
cases in the data).
 Along with Python, R is a popular data science language and contains many tools and built-in
data that simplify the process of learning data science and machine learning.

Keywords

 Machine Learning: It is an application of artificial intelligence (AI) that provides systems the
ability to automatically learn and improve from experience without being explicitly
programmed.
 Artificial Intelligence: It is the simulation of human intelligence processes by machines,
especially computer systems.
 Algorithm: It is a finite sequence of well-defined, computer-implementable instructions,
typically to solve a class of problems or to perform a computation.
 Deep Learning: A class of machine learning algorithms that uses multiple layers to
progressively extract higher level features from the raw input.

Self-Assessment Questions

1. State the difference between Supervised and Unsupervised learning.


2. Explain the concept of Deep Learning with example.
3. Write a short note on Artificial Intelligence.
Answers to Check Your Progress

Check your Progress 1

Fill in the blanks

1. A algorithm is just a sequence of operations that work together to solve a problem.


2. A machine learning algorithm is said to be supervised if it uses a ground truth, or in other
words, labeled data.
3. A cluster is a collection of observations which are more similar to each other, than to data
points in other clusters.
4. Deep learning uses neural networks to learn patterns in data, a term referring to the way in
which the structure of these models superficially resembles neurons in the brain, with
connections allowing them to pass information between them.

Suggested Reading

1. Machine Learning with R, Tidyverse, and Mlr by Hefin Ioan Rhys


2. Deep Learning by Ian Goodfellow, Yoshua Bengio, Aaron Courville
3. Elements of Machine Learning by Pat Langley
4. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools and
Techniques to Build Intelligent Systems by Aurélien Géron
5. Introduction to Machine Learning by Ethem Alpaydin

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/.
Unit 2

Supervised Machine Learning

2.1 Introduction

2.2 Supervised Learning

2.3 Algorithm Types

2.3.1 K-Nearest-Neighbours (KNN) Algorithm

2.3.2 Naïve Bayes Classifier

2.3.3 Decision Tree

2.3.4 Support Vector Machine

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

• Understand the concept of supervised machine learning algorithms


• Explain the different supervised machine learning algorithms

2.1 INTRODUCTION

Machine learning algorithms are organized into taxonomy, based on the desired outcome of the
algorithm. Common algorithm types include:

• Supervised learning --- where the algorithm generates a function that maps inputs to desired
outputs. One standard formulation of the supervised learning task is the classification
problem: the learner is required to learn (to approximate the behaviour of) a function which
maps a vector into one of several classes by looking at several input-output examples of the
function.
• Unsupervised learning --- which models a set of inputs: labeled examples are not available.
• Semi-supervised learning --- which combines both labeled and unlabeled examples to
generate an appropriate function or classifier.
• Reinforcement learning --- where the algorithm learns a policy of how to act given an
observation of the world. Every action has some impact in the environment, and the
environment provides feedback that guides the learning algorithm.
• Transduction --- similar to supervised learning, but does not explicitly construct a function:
instead, tries to predict new outputs based on training inputs, training outputs, and new
inputs.
• Learning to learn --- where the algorithm learns its own inductive bias based on previous
experience.

The performance and computational analysis of machine learning algorithms is a branch of statistics
known as computational learning theory. Machine learning is about designing algorithms that allow a
computer to learn. Learning is not necessarily involves consciousness but learning is a matter of finding
statistical regularities or other patterns in the data. Thus, many machine learning algorithms will
barely resemble how human might approach a learning task. However, learning algorithms can give
insight into the relative difficulty of learning in different environments.

2.2 SUPERVISED LEARNING

Supervised learning is fairly common in classification problems because the goal is often to get the
computer to learn a classification system that we have created. Digit recognition, once again, is a
common example of classification learning. More generally, classification learning is appropriate for
any problem where deducing a classification is useful and the classification is easy to determine. In
some cases, it might not even be necessary to give predetermined classifications to every instance of
a problem if the agent can work out the classifications for itself. This would be an example of
unsupervised learning in a classification context.

Supervised Learning often leaves the probability for inputs undefined. This model is not needed as
long as the inputs are available, but if some of the input values are missing, it is not possible to infer
anything about the outputs. Unsupervised learning, all the observations are assumed to be caused by
latent variables, that is, the observations is assumed to be at the end of the causal chain. Examples of
supervised learning and unsupervised learning are shown in the figure 2.1 below:
Fig. 2.1: Supervised and Unsupervised Learning

Supervised learning is the most common technique for training neural networks and decision trees.
Both of these techniques are highly dependent on the information given by the pre-determined
classifications. In the case of neural networks, the classification is used to determine the error of the
network and then adjust the network to minimize it, and in decision trees, the classifications are used
to determine what attributes provide the most information that can be used to solve the classification
puzzle.

We'll look at both of these in more detail, but for now, it should be sufficient to know that both of
these examples thrive on having some "supervision" in the form of pre-determined classifications.
Inductive machine learning is the process of learning a set of rules from instances (examples in a
training set), or more generally speaking, creating a classifier that can be used to generalize from new
instances. The process of applying supervised ML to a real world problem is described in Figure 2.2.

The first step is collecting the dataset. If a requisite expert is available, then s/he could suggest which
fields (attributes, features) are the most informative. If not, then the simplest method is that of “brute-
force,” which means measuring everything available in the hope that the right (informative, relevant)
features can be isolated. However, a dataset collected by the “brute-force” method is not directly
suitable for induction. It contains in most cases noise and missing feature values, and therefore
requires significant pre-processing according to Zhang et al (Zhang, 2002).

The second step is the data preparation and data pre-processing. Depending on the circumstances,
researchers have a number of methods to choose from to handle missing data, have recently
introduced a survey of contemporary techniques for outlier (noise) detection. These researchers have
identified the techniques’ advantages and disadvantages. Instance selection is not only used to handle
noise but to cope with the infeasibility of learning from very large datasets. Instance selection in these
datasets is an optimization problem that attempts to maintain the mining quality while minimizing the
sample size. It reduces data and enables a data mining algorithm to function and work effectively with
very large datasets. There is a variety of procedures for sampling instances from a large dataset. See
figure 2.2 below.

Feature subset selection is the process of identifying and removing as many irrelevant and redundant
features as possible. This reduces the dimensionality of the data and enables data mining algorithms
to operate faster and more effectively. The fact that many features depend on one another often
unduly influences the accuracy of supervised ML classification models. This problem can be addressed
by constructing new features from the basic feature set. This technique is called feature
construction/transformation. These newly generated features may lead to the creation of more
concise and accurate classifiers.

In addition, the discovery of meaningful features contributes to better comprehensibility of the


produced classifier, and a better understanding of the learned concept. Speech recognition using
hidden Markov models and Bayesian networks relies on some elements of supervision as well in order
to adjust parameters to, as usual, minimize the error on the given inputs. Notice something important
here: in the classification problem, the goal of the learning algorithm is to minimize the error with
respect to the given inputs. These inputs, often called the "training set", are the examples from which
the agent tries to learn. But learning the training set well is not necessarily the best thing to do.

For instance, if I tried to teach you exclusive-or, but only showed you combinations consisting of one
true and one false, but never both false or both true, you might learn the rule that the answer is always
true. Similarly, with machine learning algorithms, a common problem is over-fitting the data and
essentially memorizing the training set rather than learning a more general classification technique.
As you might imagine, not all training sets have the inputs classified correctly. This can lead to
problems if the algorithm used is powerful enough to memorize even the apparently "special cases"
that don't fit the more general principles. This, too, can lead to over fitting, and it is a challenge to find
algorithms that are both powerful enough to learn complex functions and robust enough to produce
generalisable results.

Fig. 2.2: Machine Learning Supervise Process

2.3 ALGORITHM TYPES

In the area of supervised/unsupervised learning, following are the algorithms types:

• Linear Classifiers
 Logical Regression
 Naïve Bayes Classifier
 Perceptron
 Support Vector Machine
• Quadratic Classifiers
• K-Means Clustering
• Boosting
• Decision Tree
 Random Forest
• Neural networks
• Bayesian Networks

In this unit, we shall explain four machine learning techniques with their examples and how they
perform in reality. These are:

• k-nearest neighbours algorithm


• Naïve Bayes Classifier
• Decision Tree
• Support Vector Machine

Linear Classifiers

In machine learning, the goal of classification is to group items that have similar feature values, into
groups. Timothy et al (Timothy Jason Shepard, 1998) stated that a linear classifier achieves this by
making a classification decision based on the value of the linear combination of the features. If the
input feature vector to the classifier is a real vector , then the output score is

where is a real vector of weights and f is a function that converts the dot product of the two vectors
into the desired output. The weight vector is learned from a set of labelled training samples. Often
f is a simple function that maps all values above a certain threshold to the first class and all other
values to the second class. A more complex f might give the probability that an item belongs to a
certain class. For a two-class classification problem, one can visualize the operation of a linear classifier
as splitting a high-dimensional input space with a hyperplane: all points on one side of the hyper plane
are classified as "yes", while the others are classified as "no". A linear classifier is often used in
situations where the speed of classification is an issue, since it is often the fastest classifier, especially
when is sparse. However, decision trees can be faster. Also, linear classifiers often work very well
when the number of dimensions in is large, as in document classification, where each element in
is typically the number of counts of a word in a document. In such cases, the classifier should be
well regularized.

2.3.1 K-Nearest-Neighbours (KNN) Algorithm

The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for more
complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM).
Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of
applications such as economic forecasting, data compression and genetics.
Let’s first start by establishing some definitions and notations. We will use x to denote a feature (aka.
predictor, attribute) and y to denote the target (aka. label, class) we are trying to predict.

KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a
labelled dataset consisting of training observations (x, y) and would like to capture the relationship
between x and y. More formally, our goal is to learn a function h: X→Y so that given an unseen
observation x, h(x) can confidently predict the corresponding output y.

The KNN classifier is also a non-parametric and instance-based learning algorithm.

 Non-parametric means it makes no explicit assumptions about the functional form of h,


avoiding the dangers of mis-modeling the underlying distribution of the data. For example,
suppose our data is highly non-Gaussian but the learning model we choose assumes a
Gaussian form. In that case, our algorithm would make extremely poor predictions.
 Instance-based learning means that our algorithm doesn’t explicitly learn a model. Instead, it
chooses to memorize the training instances which are subsequently used as “knowledge” for
the prediction phase. Concretely, this means that only when a query to our database is made
(i.e. when we ask it to predict a label given an input), will the algorithm use the training
instances to spit out an answer.

In the classification setting, the K-nearest neighbour algorithm essentially boils down to forming a
majority vote between the K most similar instances to a given “unseen” observation. Similarity is
defined according to a distance metric between two data points. A popular choice is the Euclidean
distance given by

but other measures can be more suitable for a given setting and include the Manhattan, Chebyshev
and Hamming distance.

More formally, given a positive integer K, an unseen observation x and a similarity metric d, KNN
classifier performs the following two steps:

 It runs through the whole dataset computing d between x and each training observation. We’ll
call the K points in the training data that are closest to x the set A. Note that K is usually odd
to prevent tie situations.
 It then estimates the conditional probability for each class, that is, the fraction of points in A
with that given class label. (Note I(x) is the indicator function which evaluates to 1 when the
argument x is true and 0 otherwise)

Finally, our input x gets assigned to the class with the largest probability.

At this point, you’re probably wondering how to pick the variable K and what its effects are on your
classifier. Well, like most machine learning algorithms, the K in KNN is a hyper-parameter that you, as
a designer, must pick in order to get the best possible fit for the data set. Intuitively, you can think of
K as controlling the shape of the decision boundary we talked about earlier.
When K is small, we are restraining the region of a given prediction and forcing our classifier to be
“more blind” to the overall distribution. A small value for K provides the most flexible fit, which will
have low bias but high variance. Graphically, our decision boundary will be more jagged. On the other
hand, a higher K averages more voters in each prediction and hence is more resilient to outliers. Larger
values of K will have smoother decision boundaries which means lower variance but increased bias.

For example, suppose a k-NN algorithm was given an input of data points of specific men and women's
weight and height, as plotted below. To determine the gender of an unknown input (green point), k-
NN can look at the nearest k neighbours (suppose k=3) and will determine that the input's gender is
male. This method is a very simple and logical way of marking unknown inputs, with a high rate of
success.

K-nearest neighbours can be used in classification or regression machine learning tasks. Classification
involves placing input points into appropriate categories whereas regression involves establishing a
relationship between input points and the rest of the data. In either of these cases, determining a
neighbour can be performed using many different notions of distance, with the most common being
Euclidean and Hamming distance. Euclidean distance is the most popular notion of distance--the
length of a straight line between two points. Hamming distance is the same concept, but for strings
distance is calculated as the number of positions where two strings differ. Furthermore, for certain
multivariable tasks, distances must be normalized (or weighted) to accurately represent the
correlation between variables and their strength of correlation.
The KNN Algorithm

1. Load the data


2. Initialize K to your chosen number of neighbours
3. For each example in the data
3.1 Calculate the distance between the query example and the current example from the data.
3.2 Add the distance and the index of the example to an ordered collection
4. Sort the ordered collection of distances and indices from smallest to largest (in ascending
order) by the distances
5. Pick the first K entries from the sorted collection
6. Get the labels of the selected K entries
7. If regression, return the mean of the K labels
8. If classification, return the mode of the K labels

For k-NN classification, an input is classified by a majority vote of its neighbours. That is, the algorithm
obtains the class membership of its k neighbours and outputs the class that represents a majority of
the k neighbours.

An example of k-NN classification

Suppose we are trying to classify the green circle. Let us begin with k=3k=3 (the solid line). In this case,
the algorithm would return a red triangle, since it constitutes a majority of the 3 neighbours. Likewise,
with k=5 (the dotted line), the algorithm would return a blue square.

If no majority is reached with the k neighbours, many courses of action can be taken. For example,
one could use a plurality system or even use a different algorithm to determine the membership of
that data point.

K-NN regression works in a similar manner. The value returned is the average value of the input's k
neighbours.
Suppose we have data points from a sine wave above (with some variance, of course) and our task is
to produce a y value for a given x value. When given an input data point to classify, k-NN would return
the average y value of the input's k neighbours. For example, if k-NN were asked to return the
corresponding y value for x=0, the algorithm would find the k nearest points to x=0 and return the
average y value corresponding to these k points. This algorithm would be simple, but very successful
for most x values.

Pros and Cons

k-NN is one of many algorithms used in machine learning tasks, in fields such as computer vision and
gene expression analysis. So why use k-NN over other algorithms? The following is a list of pros and
cons k-NN has over alternatives.

Pros:

 Very easy to understand and implement. A k-NN implementation does not require much code
and can be a quick and simple way to begin machine learning datasets.
 Does not assume any probability distributions on the input data. This can come in handy for
inputs where the probability distribution is unknown and is therefore robust.
 Can quickly respond to changes in input. k-NN employs lazy learning, which generalizes during
testing--this allows it to change during real-time use.

Cons:

 Sensitive to localized data. Since k-NN gets all of its information from the input's neighbours,
localized anomalies affect outcomes significantly, rather than for an algorithm that uses a
generalized view of the data.
 Computation time. Lazy learning requires that most of k-NN's computation be done during
testing, rather than during training. This can be an issue for large datasets.
 Normalization. If one type of category occurs much more than another, classifying an input
will be more biased towards that one category (since it is more likely to be neighbours with
the input). This can be mitigated by applying a lower weight to more common categories and
a higher weight to less common categories; however, this can still cause errors near decision
boundaries.

Dimensions. In the case of many dimensions, inputs can commonly be "close" to many data points.
This reduces the effectiveness of k-NN, since the algorithm relies on a correlation between closeness
and similarity. One workaround for this issue is dimension reduction, which reduces the number of
working variable dimensions (but can lose variable trends in the process).

2.3.2 Naïve Bayes Classifier


A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from
Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the
underlying probability model would be "independent feature model".

In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature
of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be
considered to be an apple if it is red, round, and about 4" in diameter. Even if these features depend
on each other or upon the existence of the other features, a naive Bayes classifier considers all of
these properties to independently contribute to the probability that this fruit is an apple.
Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very
efficiently in a supervised learning setting. In many practical applications, parameter estimation for
naive Bayes models uses the method of maximum likelihood; in other words, one can work with the
naive Bayes model without believing in Bayesian probability or using any Bayesian methods. In spite
of their naive design and apparently over-simplified assumptions, naive Bayes classifiers have worked
quite well in many complex real-world situations.

An advantage of the naive Bayes classifier is that it only requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification. Because
independent variables are assumed, only the variances of the variables for each class need to be
determined and not the entire covariance matrix.

The naive Bayes probabilistic model

Abstractly, the probability model for a classifier is a conditional model

over a dependent class variable with a small number of outcomes or classes, conditional on several
feature variables F1 through Fn. The problem is that if the number of features n is large or when a
feature can take on a large number of values, then basing such a model on probability tables is
infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, we
write

In plain English the above equation can be written as

In practice we are only interested in the numerator of that fraction, since the denominator does not
depend on C and the values of the features Fi are given, so that the denominator is effectively
constant. The numerator is equivalent to the joint probability model

which can be rewritten as follows, using repeated applications of the definition of conditional
probability:
Now the "naive" conditional independence assumptions come into play: assume that each feature Fi
is conditionally independent of every other feature Fj for j not equal to i. This means that

for , and so the joint model can be expressed as

This means that under the above independence assumptions, the conditional distribution over the
class variable C can be expressed like this:

where Z (the evidence) is a scaling factor dependent only on F1, …. Fn , i.e., a constant if the values of
the feature variables are known.

Models of this form are much more manageable, since they factor into a so-called class prior
and independent probability distributions . If there are k classes and if a model for each

can be expressed in terms of r parameters, then the corresponding naive Bayes model
has (k − 1) + n r k parameters. In practice, often k=2 (binary classification) and r=1 (Bernoulli variables
as features) are common, and so the total number of parameters of the naive Bayes model is 2n+1 ,
where n is the number of binary features used for classification and prediction.

Parameter estimation

All model parameters (i.e., class priors and feature probability distributions) can be approximated with
relative frequencies from the training set. These are maximum likelihood estimates of the
probabilities. A class' prior may be calculated by assuming equi-probable classes (i.e., priors = 1 /
(number of classes)), or by calculating an estimate for the class probability from the training set (i.e.,
(prior for a given class) = (number of samples in the class) / (total number of samples)). To estimate
the parameters for a feature's distribution, one must assume a distribution or generate nonparametric
models for the features from the training set. If one is dealing with continuous data, a typical
assumption is that the continuous values associated with each class are distributed according to a
Gaussian distribution.

For example, suppose the training data contains a continuous attribute, x. We first segment the data
by the class, and then compute the mean and variance of x in each class. Let be the mean of the
values in x associated with class c, and let be the variance of the values in x associated with class c.
Then, the probability of some value given a class, , can be computed by plugging into the
equation for a Normal distribution parameterized by and . That is,
Another common technique for handling continuous values is to use binning to discretize the values.
In general, the distribution method is a better choice if there is a small amount of training data, or if
the precise distribution of the data is known. The discretization method tends to do better if there is
a large amount of training data because it will learn to fit the distribution of the data. Since naive
Bayes is typically used when a large amount of data is available (as more computationally expensive
models can generally achieve better accuracy), the discretization method is generally preferred over
the distribution method.

Sample correction

If a given class and feature value never occur together in the training set then the frequency-based
probability estimate will be zero. This is problematic since it will wipe out all information in the other
probabilities when they are multiplied. It is therefore often desirable to incorporate a small-sample
correction in all probability estimates such that no probability is ever set to be exactly zero.

Constructing a classifier from the probability model

The discussion so far has derived the independent feature model, that is, the naive Bayes probability
model. The naive Bayes classifier combines this model with a decision rule. One common rule is to
pick the hypothesis that is most probable; this is known as the maximum a posteriori or MAP decision
rule. The corresponding classifier is the function classify defined as follows:

Despite the fact that the far-reaching independence assumptions are often inaccurate, the naive Bayes
classifier has several properties that make it surprisingly useful in practice. In particular, the
decoupling of the class conditional feature distributions means that each distribution can be
independently estimated as a one dimensional distribution. This in turn helps to alleviate problems
stemming from the curse of dimensionality, such as the need for data sets that scale exponentially
with the number of features. Like all probabilistic classifiers under the MAP decision rule, it arrives at
the correct classification as long as the correct class is more probable than any other class; hence class
probabilities do not have to be estimated very well. In other words, the overall classifier is robust
enough to ignore serious deficiencies in its underlying naive probability model.

Examples

Sex Classification Problem: classify whether a given person is a male or a female based on the
measured features. The features include height, weight, and foot size. Training Example training set
below.
The classifier created from the training set using a Gaussian distribution assumption would be:

Let's say we have equi-probable classes so P(male)= P(female) = 0.5. There was no identified reason
for making this assumption so it may have been a bad idea. If we determine P(C) based on frequency
in the training set, we happen to get the same answer.

Testing

Below is a sample to be classified as a male or female.

sex height (feet) weight (lbs) foot size(inches)

sample 6 130 8

We wish to determine which posterior is greater, male or female.

posterior (male) = P(male)*P(height|male)*P(weight|male)*P(foot size|male) / evidence

posterior (female) = P(female)*P(height|female)*P(weight|female)*P(foot size|female) / evidence

The evidence (also termed normalizing constant) may be calculated since the sum of the posteriors
equals one.

evidence = P(male)*P(height|male)*P(weight|male)*P(foot size|male) +

P(female)*P(height|female)*P(weight|female)*P(foot size|female)

The evidence may be ignored since it is a positive constant. (Normal distributions are always positive.)
We now determine the sex of the sample.

P(male) = 0.5

P(height|male) = 1.5789 (A probability density greater than 1 is OK. It is the area under the bell curve
that is equal to 1.)
P(weight|male) = 5.9881e-06

P(foot size|male) = 1.3112e-3

posterior numerator (male) = their product = 6.1984e-09

P(female) = 0.5

P(height|female) = 2.2346e-1

P(weight|female) = 1.6789e-2

P(foot size|female) = 2.8669e-1

posterior numerator (female) = their product = 5.3778e-04

Since posterior numerator (female) > posterior numerator (male), the sample is female.

2.3.3 Decision Tree

Decision Tree learning algorithm generates decision trees from the training data to solve classification
and regression problem. Consider you would like to go out for game of Tennis outside. Now the
question is how one would decide whether it is ideal to go out for a game of tennis. Now this depends
upon various factors like time, weather, temperature etc. We call these factors as features which will
influence our decision. If you could record all the factors and decision you took, you could get a table
something like this.

With this table, other people would be able to use your intuition to decide whether they should play
tennis by looking up what you did given a certain weather pattern, but after just 14 days, it’s a little
unwieldy to match your current weather situation with one of the rows in the table. We could
represent this tabular data in the form of tree.
Here all the information is represented in the form of tree. The rectangular box represents the node
of the tree. Splitting of data is done by asking question to the node. The branches represents various
possible known outcome obtained by asking the question on the node. The end nodes are the leafs.
They represent various classes in which the data can be classified. The two classes in this example are
Yes and No. Thus to obtain the class/final output, ask the question to the node and using the answer
travel through branch until one reaches the leaf node.

Algorithm

1. Start with a training data set which we’ll call S. It should have attributes and classification.
2. Determine the best attribute in the dataset. (We will go over the definition of best attribute)
3. Split S into subset that contains the possible values for the best attribute.
4. Make decision tree node that contains the best attribute.
5. Recursively generate new decision trees by using the subset of data created from step 3 until
a stage is reached where you cannot classify the data further. Represent the class as leaf node.

Deciding the “BEST ATTRIBUTE”

Now the most important part of Decision Tree algorithm is deciding the best attribute. But what does
‘best’ actually mean?

In Decision Tree algorithm, the best mean the attribute which has most information gain.

The left split has less information gain as the data is split on two classes which has almost equal ‘+’
and ‘-’ examples. While the split on the right as more ‘+’ example in one class and more ‘-’ example in
the other class. In order to calculate best attribute we will use Entropy.
Entropy

In machine learning sense and especially in this case Entropy is the measure of homogeneity in the
data. Its value is ranges from 0 to 1. Its value is close to 0 if all the example belongs to same class and
is close to 1 is there is almost equal split of the data into different classes. Now the formula to calculate
entropy is:

Here pi represents the proportion of the data with ith classification and c represents the different types
of classification.

Now Information Gain measure the reduction in entropy by classifying the data on a particular
attribute. The formula to calculate Gain by splitting the data on Dataset ‘S’ and on the attribute ‘A’ is:

Here Entropy(S) represents the entropy of the dataset and the second term on the right is the
weighted entropy of the different possible classes obtain after the split. Now the goal is to maximize
this information gain. The attribute which has the maximum information gain is selected as the parent
node and successively data is split on the node.

Entropy of the dataset (Entropy(S)) is:

Now to calculate Information Gain:


As we can see Outlook attribute has the maximum information gain and hence it is placed at top of
the tree.

Problem with Continuous Data

Now one question may arise is how the data is split in case of continuous data. Suppose there is
attribute temperature which has values from 10 to 45 degree Celsius. We cannot make a split on every
discrete value. What we can do in this case is to split the discrete values into continuous classes such
as 10–20 can be class1, 20–30 and so on and a particular discrete value is put to a particular class.

Avoiding overfitting the data

Now the main problem with decision tree is that it is prone to overfitting. We could create a tree that
could classify the data perfectly or we are not left with any attribute to split. This would work well in
on the training dataset but will have a bad result on the testing dataset. There are two popular
approaches to avoid this in decision trees: stop growing the tree before it becomes too large or prune
the tree after it becomes too large. Typically, a limit to a decision tree’s growth will be specified in
terms of the maximum number of layers, or depth, it’s allowed to have. The data available to train the
decision tree will be split into a training set and test set and trees with various maximum depths will
be created based on the training set and tested against the test set. Cross--validation can be used as
part of this approach as well.

Tree pruning is a technique that leverages this splitting redundancy to remove i.e. prune the
unnecessary splits in our tree. From a high-level, pruning compresses part of the tree from strict and
rigid decision boundaries into ones that are more smooth and generalise better, effectively reducing
the tree complexity. The complexity of a decision tree is defined as the number of splits in the tree.
Pruning the tree, on the other hand, involves testing the original tree against pruned versions of it.
Leaf nodes are taken away from the tree as long as the pruned tree performs better against test data
than the larger tree.

Here are a few of the pro and cons of decision trees that can help you decide on whether or not it’s
the right model for your problem, as well as some tips as to how you can effectively apply them:

Pros

 Easy to understand and interpret. At each node, we are able to see exactly what decision our
model is making. In practice we’ll be able to fully understand where our accuracies and errors
are coming from, what type of data the model would do well with, and how the output is
influenced by the values of the features.
 Require very little data preparation. Many ML models may require heavy data pre-processing
such as normalization and may require complex regularisation schemes. Decision trees on the
other hand work quite well out of the box after tweaking a few of the parameters.
 The cost of using the tree for inference is logarithmic in the number of data points used to
train the tree. That’s a huge plus since it means that having more data won’t necessarily make
a huge dent in our inference speed.

Cons

 Overfitting is quite common with decision trees simply due to the nature of their training. It’s
often recommended to perform some type of dimensionality reduction such as PCA so that
the tree doesn’t have to learn splits on so many features
 For similar reasons as the case of overfitting, decision trees are also vulnerable to becoming
biased to the classes that have a majority in the dataset. It’s always a good idea to do some
kind of class balancing such as class weights, sampling, or a specialised loss function.

2.3.4 Support Vector Machine

A Support Vector Machine as stated by Luis et al (Luis Gonz, 2005) (SVM) performs classification by
constructing an N dimensional hyper plane that optimally separates the data into two categories. SVM
models are closely related to neural networks. In fact, a SVM model using a sigmoid kernel function is
equivalent to a two layer, perceptron neural network. Support Vector Machine (SVM) models are a
close cousin to classical multilayer perceptron neural networks. Using a kernel function, SVM’s are an
alternative training method for polynomial, radial basis function and multi-layer perceptron classifiers
in which the weights of the network are found by solving a quadratic programming problem with linear
constraints, rather than by solving a non-convex, unconstrained minimization problem as in standard
neural network training. In the parlance of SVM literature, a predictor variable is called an attribute,
and a transformed attribute that is used to define the hyper plane is called a feature. The task of
choosing the most suitable representation is known as feature selection. A set of features that
describes one case (i.e., a row of predictor values) is called a vector. So the goal of SVM modelling is
to find the optimal hyper plane that separates clusters of vector in such a way that cases with one
category of the target variable are on one side of the plane and cases with the other category are on
the other size of the plane. The vectors near the hyper plane are the support vectors. The figure below
presents an overview of the SVM process.

A Two-Dimensional Example

Before considering N-dimensional hyper planes, let’s look at a simple 2-dimensional example. Assume
we wish to perform a classification, and our data has a categorical target variable with two categories.
Also assume that there are two predictor variables with continuous values. If we plot the data points
using the value of one predictor on the X axis and the other on the Y axis we might end up with an
image such as shown below. One category of the target variable is represented by rectangles while
the other category is represented by ovals.
In this idealized example, the cases with one category are in the lower left corner and the cases with
the other category are in the upper right corner; the cases are completely separated. The SVM analysis
attempts to find a 1-dimensional hyper plane (i.e. a line) that separates the cases based on their target
categories. There are an infinite number of possible lines; two candidate lines are shown above. The
question is which line is better, and how do we define the optimal line. The dashed lines drawn parallel
to the separating line mark the distance between the dividing line and the closest vectors to the line.
The distance between the dashed lines is called the margin. The vectors (points) that constrain the
width of the margin are the support vectors. The following figure illustrates this.

An SVM analysis (Luis Gonz, 2005) finds the line (or, in general, hyper plane) that is oriented so that
the margin between the support vectors is maximized. In the figure above, the line in the right panel
is superior to the line in the left panel. If all analyses consisted of two-category target variables with
two predictor variables, and the cluster of points could be divided by a straight line, life would be easy.
Unfortunately, this is not generally the case, so SVM must deal with (a) more than two predictor
variables, (b) separating the points with non-linear curves, (c) handling the cases where clusters
cannot be completely separated, and (d) handling classifications with more than two categories.

Check your Progress 1

Fill in the blanks.

1. In the case of _____, the classification is used to determine the error of the network and
then adjust the network to minimize it.
2. The ____ is also a non-parametric and instance-based learning algorithm.
3. ___ is the measure of homogeneity in the data.
4. _______ is a technique that leverages splitting redundancy to remove the unnecessary splits
in our tree.

Activity 1

Find and list the real examples where supervised machine algorithms are used. Also state the purpose
behind using supervised machine algorithms.
Summary

 In this unit we have discussed Supervised and Unsupervised learning. We have also described
the supervised learning algorithms such as K-Nearest-Neighbours (KNN), Naïve Bayes
Classifier, Decision Tree and Support Vector Machine with example.

Keywords

 Supervised Learning: Supervised learning is the machine learning task of learning a function
that maps an input to an output based on example input-output pairs
 Unsupervised Learning: Unsupervised learning is a machine learning technique, where you
do not need to supervise the model. Instead, you need to allow the model to work on its own
to discover information.
 Artificial Neural Networks (ANN): It is computing systems that are inspired by, but not
identical to, biological neural networks that constitute animal brains. Such systems "learn" to
perform tasks by considering examples, generally without being programmed with task-
specific rules.
 Classification: Classification is a process related to categorization, the process in which ideas
and objects are recognized, differentiated and understood.
 Regression: It is a statistical measurement used in finance, investing, and other disciplines
that attempts to determine the strength of the relationship between one dependent variable
(usually denoted by Y) and a series of other changing variables (known as independent
variables).

Self-Assessment Questions

1. State the difference between Supervised and Unsupervised machine algorithms.


2. Explain KNN algorithm with example.
3. What are the advantages and disadvantages of decision tree?

Answers to Check Your Progress

Check your Progress 1

Fill in the blanks.

1. In the case of neural networks, the classification is used to determine the error of the
network and then adjust the network to minimize it.
2. The KNN classifier is also a non-parametric and instance-based learning algorithm.
3. Entropy is the measure of homogeneity in the data.
4. Tree pruning is a technique that leverages splitting redundancy to remove the unnecessary
splits in our tree.

Suggested Reading

1. Introduction to Machine Learning by Ethem Alpaydin


2. Machine Learning For Beginners: Machine Learning Basics for Absolute Beginners. Learn What
ML Is and Why It Matters. Notes on Artificial Intelligence and Deep Learning are also Included,
by Scott Chesterton
3. Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz, Shai
Ben-David
4. Machine Learning For Dummies by John Paul Mueller, Luca Massaron

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/.
Unit 3

Unsupervised Learning

3.1 Introduction

3.2 Concept of Unsupervised learning

3.3 Unsupervised Learning Algorithms

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

• Understand the concept of unsupervised machine learning algorithms


• Explain the different unsupervised machine learning algorithms

3.1 INTRODUCTION

In the last unit we have discussed in brief about supervised learning algorithms. In this unit, we are
going to discuss unsupervised learning algorithms in detail with example.

3.2 CONCEPT OF UNSUPERVISED LEARNING

With unsupervised learning, the goal is to have the computer learn how to do something that we don't
tell it how to do! There are actually two approaches to unsupervised learning. The first approach is to
teach the agent not by giving explicit categorizations, but by using some sort of reward system to
indicate success. Note that this type of training will generally fit into the decision problem framework
because the goal is not to produce a classification but to make decisions that maximize rewards. This
approach nicely generalizes to the real world, where agents might be rewarded for doing certain
actions and punished for doing others. Often, a form of reinforcement learning can be used for
unsupervised learning, where the agent bases its actions on the previous rewards and punishments
without necessarily even learning any information about the exact ways that its actions affect the
world.

In a way, all of this information is unnecessary because by learning a reward function, the agent simply
knows what to do without any processing because it knows the exact reward it expects to achieve for
each action it could take. This can be extremely beneficial in cases where calculating every possibility
is very time consuming (even if all of the transition probabilities between world states were known).
On the other hand, it can be very time consuming to learn by, essentially, trial and error. But this kind
of learning can be powerful because it assumes no pre-discovered classification of examples. In some
cases, for example, our classifications may not be the best possible.

One striking example is that the conventional wisdom about the game of backgammon was turned on
its head when a series of computer programs (neuro-gammon and TD-gammon) that learned through
unsupervised learning became stronger than the best human chess players merely by playing
themselves over and over. These programs discovered some principles that surprised the
backgammon experts and performed better than backgammon programs trained on pre-classified
examples.

A second type of unsupervised learning is called clustering. In this type of learning, the goal is not to
maximize a utility function, but simply to find similarities in the training data. The assumption is often
that the clusters discovered will match reasonably well with an intuitive classification. For instance,
clustering individuals based on demographics might result in a clustering of the wealthy in one group
and the poor in another. Although the algorithm won't have names to assign to these clusters, it can
produce them and then use those clusters to assign new examples into one or the other of the clusters.
This is a data-driven approach that can work well when there is sufficient data; for instance, social
information filtering algorithms, such as those that Amazon.com use to recommend books, are based
on the principle of finding similar groups of people and then assigning new users to groups.
In some cases, such as with social information filtering, the information about other members of a
cluster (such as what books they read) can be sufficient for the algorithm to produce meaningful
results. In other cases, it may be the case that the clusters are merely a useful tool for a human analyst.
Unfortunately, even unsupervised learning suffers from the problem of overfitting the training data.
There's no silver bullet to avoiding the problem because any algorithm that can learn from its inputs
needs to be quite powerful. Lack of robustness is known as over fitting from the statistics and the
machine learning literature.

Unsupervised learning has produced many successes, such as world-champion calibre backgammon
programs and even machines capable of driving cars! It can be a powerful technique when there is an
easy way to assign values to actions. Clustering can be useful when there is enough data to form
clusters (though this turns out to be difficult at times) and especially when additional data about
members of a cluster can be used to produce further results due to dependencies in the data.

Classification learning is powerful when the classifications are known to be correct (for instance, when
dealing with diseases, it's generally straight-forward to determine the design after the fact by an
autopsy), or when the classifications are simply arbitrary things that we would like the computer to
be able to recognize for us. Classification learning is often necessary when the decisions made by the
algorithm will be required as input somewhere else. Otherwise, it wouldn't be easy for whoever
requires that input to figure out what it means. Both techniques can be valuable and which one you
choose should depend on the circumstances--what kind of problem is being solved, how much time is
allotted to solving it (supervised learning or clustering is often faster than reinforcement learning
techniques), and whether supervised learning is even possible.

3.3 UNSUPERVISED LEARNING ALGORITHMS

In the above section we have discussed about unsupervised learning in detail. Now let’s discuss
following two unsupervised algorithms in details:

1. K-Means Clustering
2. Apriori Algorithms
3. Self-Organized Map

3.3.1 K-Means Clustering:

The basic step of k-means clustering is uncomplicated. In the beginning we determine number of
cluster K and we assume the centre of these clusters. We can take any random objects as the initial
centre or the first K objects in sequence can also serve as the initial centre. Then the K means algorithm
will do the three steps below until convergence. Iterate until stable (= no object move group):

1. Determine the centre coordinate


2. Determine the distance of each object to the centre
3. Group the object based on minimum distance

The Figure 3.1 shows a K- means flow diagram


Fig. 3.1: K-means iteration

K-means (Bishop C. M., 1995) and (Tapas Kanungo, 2002) is one of the simplest unsupervised learning
algorithms that solve the well-known clustering problem. The procedure follows a simple and easy
way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.
The main idea is to define k centroids, one for each cluster. These centroids should be placed in a
cunning way because of different location causes different result. So, the better choice is to place
them as much as possible far away from each other.

The next step is to take each point belonging to a given data set and associate it to the nearest
centroid. When no point is pending, the first step is completed and an early groupage is done. At this
point we need to re-calculate k new centroids as barycentre’s of the clusters resulting from the
previous step. After we have these k new centroids, a new binding has to be done between the same
data set points and the nearest new centroid. A loop has been generated. As a result of this loop we
may notice that the k centroids change their location step by step until no more changes are done. In
other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective
function, in this case a squared error function. The objective function

where is a chosen distance measure between a data point and the cluster centre
, is an indicator of the distance of the n data points from their respective cluster centres. The algorithm
is composed of the following steps:

1. Place K points into the space represented by the objects that are being clustered. These points
represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the
objects into groups from which the metric to be minimized can be calculated.

Although it can be proved that the procedure will always terminate, the k-means algorithm does not
necessarily find the most optimal configuration, corresponding to the global objective function
minimum. The algorithm is also significantly sensitive to the initial randomly selected cluster centres.
The k-means algorithm can be run multiple times to reduce this effect. K-means is a simple algorithm
that has been adapted to many problem domains. As we are going to see, it is a good candidate for
extension to work with fuzzy feature vectors. An example Suppose that we have n sample feature
vectors x1, x2, ..., xn all from the same class, and we know that they fall into k compact clusters, k < n.
Let mi be the mean of the vectors in cluster i. If the clusters are well separated, we can use a minimum-
distance classifier to separate them. That is, we can say that x is in cluster i if || x - mi || is the minimum
of all the k distances. This suggests the following procedure for finding the k means:

• Make initial guesses for the means m1, m2, ..., mk


• Until there are no changes in any mean
• Use the estimated means to classify the samples into clusters
• For i from 1 to k
• Replace mi with the mean of all of the samples for cluster i
• end_for
• end_until

Here is an example showing how the means m1 and m2 move into the centers of two clusters.

This is a simple version of the k-means procedure. It can be viewed as a greedy algorithm for
partitioning the n samples into k clusters so as to minimize the sum of the squared distances to the
cluster centers. It does have some weaknesses:

• The way to initialize the means was not specified. One popular way to start is to
randomly choose k of the samples.
• The results produced depend on the initial values for the means, and it frequently
happens that suboptimal partitions are found. The standard solution is to try a
number of different starting points.
• It can happen that the set of samples closest to mi is empty, so that mi cannot be
updated. This is an annoyance that must be handled in an implementation, but that
we shall ignore.
• The results depend on the metric used to measure || x - mi ||. A popular solution is
to normalize each variable by its standard deviation, though this is not always
desirable.
• The results depend on the value of k.

This last problem is particularly troublesome, since we often have no way of knowing how many
clusters exist. In the example shown above, the same algorithm applied to the same data produces
the following 3-means clustering. Is it better or worse than the 2-means clustering?
Unfortunately there is no general theoretical solution to find the optimal number of clusters for any
given data set. A simple approach is to compare the results of multiple runs with different k classes
and choose the best one according to a given criterion

3.3.2 Apriori Algorithms:

Apriori is an algorithm proposed by R. Agrawal and R Srikant in 1993 for mining frequent item sets for
boolean association rule. The name of algorithm is based on the fact that the algorithm uses prior
knowledge of frequent item set properties. Apriori employs an iterative approach known as level-wise
search, where k item set are used to explore (k+1) item sets. There are two steps in each iteration.
The first step generates a set of candidate item sets. Then, in the second step the occurrence of each
candidate set in database is counted and then pruning of all disqualified candidates (i.e. all infrequent
item sets) is done. Apriori uses two pruning technique, first on the bases of support count (should be
greater than user specified support threshold) and second for an item set to be frequent , all its subset
should be in last frequent item set The iterations begin with size 2 item sets and the size is incremented
after each iteration. The algorithm is based on the closure property of frequent item sets: if a set of
items is frequent, then all its proper subsets are also frequent. This algorithm is easy to implement
and parallelized but it has the major disadvantage that it requires various scans of databases and is
memory resident.

The frequent item sets determined by Apriori can be used to determine association rules which
highlight general trends in the database. This has applications in domains such as market basket
analysis.

Key Concepts:

• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for ith
-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1
with itself.

The pseudo code for the algorithm is given below for a transaction database T, and a support threshold
of ε. Usual set theoretic notation is employed, though note that T is a multiset. Ck is the candidate set
for level k. At each step, the algorithm is assumed to generate the candidate sets from the large item
sets of the preceding level, heeding the downward closure lemma. count[c] accesses a field of the data
structure that represents candidate set c, which is initially assumed to be zero. Usually the most
important part of the implementation is the data structure used for storing the candidate sets, and
counting their frequencies.
Example:

Consider a database, D, consisting of 9 transactions. Suppose min. support count required is 2 (i.e.
min_sup = 2/9 = 22 %). Let minimum confidence required is 70%. We have to first find out the frequent
itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min.
confidence.

Step 1: Generating 1-itemset Frequent Pattern

• The set of frequent 1-itemsets, L1, consists of the candidate 1- itemsets satisfying
minimum support.
• In the first iteration of the algorithm, each item is a member of the set of candidate.
Step 2: Generating 2-itemset Frequent Pattern

 To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a
candidate set of 2-itemsets, C2.
 Next, the transactions in D are scanned and the support count for each candidate itemset in
C2 is accumulated (as shown in the middle table).
 The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2-
itemsets in C2 having minimum support. ( Note: We haven’t used Apriori Property yet)

Step 3: Generating 3-itemset Frequent Pattern

 The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori Property.
 In order to find C3, we compute L2 Join L2.
 C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
 Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step
helps to avoid heavy computation due to large Ck.

Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can
determine that four latter candidates cannot possibly be frequent. How?

For example, let’s take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item
subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.

Let’s take another example of {I2, I3, I5} which shows how the pruning is performed. The 2-item
subsets are {I2, I3}, {I2, I5} & {I3,I5}. But, {I3, I5} is not a member of L2 and hence it is not frequent
violating Apriori Property. Thus we will have to remove {I2, I3, I5} from C3.Therefore, C3 = {{I1, I2, I3},
{I1, I2, I5}} after checking for all members of result of Join operation for Pruning. •

Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-
itemsets in C3 having minimum support.

Step 4: Generating 4-itemset Frequent Pattern


The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in
{{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not frequent.

Thus, C4 = φ, and algorithm terminates, having found all of the frequent items. This completes our
Apriori Algorithm.

What’s Next? These frequent itemsets will be used to generate strong association rules (where strong
association rules satisfy both minimum support & minimum confidence).

Step 5: Generating Association Rules from Frequent Itemsets

Procedure:

• For each frequent itemset “l”, generate all nonempty subsets of l.


• For every nonempty subset s of l, output the rule “s -> (l-s)” if support_count(l) /
support_count(s) >= min_conf where min_conf is minimum confidence threshold.

Back To Example:

We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.

– Let’s take l = {I1, I2, I5}.

– Its all nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}.

Let minimum confidence threshold is, say 70%. The resulting association rules are shown below, each
listed with its confidence.

– R1: I1 ^ I2 -> I5

• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%

• R1 is Rejected.

– R2: I1 ^ I5 -> I2

• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%

• R2 is Selected.

– R3: I2 ^ I5 -> I1

• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%

• R3 is Selected.

– R4: I1 -> I2 ^ I5

• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%

• R4 is Rejected.

– R5: I2 -> I1 ^ I5

• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%

• R5 is Rejected.

– R6: I5 -> I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%

• R6 is Selected.

In this way, we have found three strong association rules.

Methods to Improve Apriori’s Efficiency

• Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count


is below the threshold cannot be frequent.
• Transaction reduction: A transaction that does not contain any frequent k-itemset is
useless in subsequent scans.
• Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least
one of the partitions of DB.
• Sampling: mining on a subset of given data, lower support threshold + a method to
determine the completeness.
• Dynamic itemset counting: add new candidate itemsets only when all of their subsets
are estimated to be frequent.

Association Rule Learning has the most popular applications of Machine Learning in business. It has
been widely used to understand and test various business and marketing strategies to increase sales
and productivity by various organizations including supermarket chains and online marketplaces.
Association Rule Learning is rule-based learning for identifying the association between different
variables in a database.

3.3.3 Self-Organized Map

A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial neural
network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically
two-dimensional), discretized representation of the input space of the training samples, called a map,
and is therefore a method to do dimensionality reduction. Self-organizing maps differ from other
artificial neural networks as they apply competitive learning as opposed to error-correction learning
(such as backpropagation with gradient descent), and in the sense that they use a neighbourhood
function to preserve the topological properties of the input space.

Self-Organizing Feature Map networks are used quite differently to the other networks. Whereas all
the other networks are designed for supervised learning tasks, SOFM networks are designed primarily
for unsupervised learning (Whereas in supervised learning the training data set contains cases
featuring input variables together with the associated outputs (and the network must infer a mapping
from the inputs to the outputs), in unsupervised learning the training data set contains only input
variables. At first glance this may seem strange. Without outputs, what can the network learn? The
answer is that the SOFM network attempts to learn the structure of the data.

The SOFM network can learn to recognize clusters of data, and can also relate similar classes to each
other. The user can build up an understanding of the data, which is used to refine the network. As
classes of data are recognized, they can be labelled, so that the network becomes capable of
classification tasks. SOFM networks can also be used for classification when output classes are
immediately available - the advantage in this case is their ability to highlight similarities between
classes. A second possible use is in novelty detection.

SOFM networks can learn to recognize clusters in the training data, and respond to it. If new data,
unlike previous cases, is encountered, the network fails to recognize it and this indicates novelty. A
SOFM network has only two layers: the input layer, and an output layer of radial units (also known as
the topological map layer). The units in the topological map layer are laid out in space - typically in
two dimensions (although ST Neural Networks also supports one dimensional Kohonen networks).
SOFM networks are trained using an iterative algorithm. Starting with an initially-random set of radial
centres, the algorithm gradually adjusts them to reflect the clustering of the training data. At one level,
this compares with the sub-sampling and K-Means algorithms used to assign centres in SOM network
and indeed the SOFM algorithm can be used to assign centres for these types of networks. However,
the algorithm also acts on a different level. The iterative training procedure also arranges the network
so that units representing centres close together in the input space are also situated close together
on the topological map.

You can think of the network's topological layer as a crude two-dimensional grid, which must be folded
and distorted into the N-dimensional input space, so as to preserve as far as possible the original
structure. Clearly any attempt to represent an N-dimensional space in two dimensions will result in
loss of detail; however, the technique can be worthwhile in allowing the user to visualize data which
might otherwise be impossible to understand. The basic iterative Kohonen algorithm simply runs
through a number of epochs, on each epoch executing each training case and applying the following
algorithm:

• Select the winning neuron (the one who's centre is nearest to the input case);
• Adjust the winning neuron to be more like the input case (a weighted sum of the old
neuron centre and the training case).

The algorithm uses a time-decaying learning rate, which is used to perform the weighted sum and
ensures that the alterations become more subtle as the epochs pass. This ensures that the centres
settle down to a compromise representation of the cases which cause that neuron to win. The
topological ordering property is achieved by adding the concept of a neighbourhood to the algorithm.

The neighbourhood is a set of neurons surrounding the winning neuron. The neighbourhood, like the
learning rate, decays over time, so that initially quite a large number of neurons belong to the
neighbourhood (perhaps almost the entire topological map); in the latter stages the neighbourhood
will be zero (i.e., consists solely of the winning neuron itself).

In the Kohonen algorithm, the adjustment of neurons is actually applied not just to the winning
neuron, but to all the members of the current neighbourhood. The effect of this neighbourhood
update is that initially quite large areas of the network are "dragged towards" training cases - and
dragged quite substantially. The network develops a crude topological ordering, with similar cases
activating clumps of neurons in the topological map. As epochs pass the learning rate and
neighbourhood both decrease, so that finer distinctions within areas of the map can be drawn,
ultimately resulting in fine tuning of individual neurons.

Often, training is deliberately conducted in two distinct phases: a relatively short phase with high
learning rates and neighbourhood, and a long phase with low learning rate and zero or near-zero
neighbourhoods. Once the network has been trained to recognize structure in the data, it can be used
as a visualization tool to examine the data. The Win Frequencies Datasheet (counts of the number of
times each neuron wins when training cases are executed) can be examined to see if distinct clusters
have formed on the map. Individual cases are executed and the topological map observed, to see if
some meaning can be assigned to the clusters (this usually involves referring back to the original
application area, so that the relationship between clustered cases can be established). Once clusters
are identified, neurons in the topological map are labelled to indicate their meaning (sometimes
individual cases may be labelled, too). Once the topological map has been built up in this way, new
cases can be submitted to the network. If the winning neuron has been labelled with a class name,
the network can perform classification. If not, the network is regarded as undecided. SOFM networks
also make use of the accept threshold, when performing classification. Since the activation level of a
neuron in a SOFM network is the distance of the neuron from the input case, the accept threshold
acts as a maximum recognized distance. If the activation of the winning neuron is greater than this
distance, the SOFM network is regarded as undecided.

Thus, by labelling all neurons and setting the accept threshold appropriately, a SOFM network can act
as a novelty detector (it reports undecided only if the input case is sufficiently dissimilar to all radial
units). SOFM networks as expressed by Kohonen (Kohonen, 1997) are inspired by some known
properties of the brain. The cerebral cortex is actually a large flat sheet (about 0.5m squared; it is
folded up into the familiar convoluted shape only for convenience in fitting into the skull!) with known
topological properties (for example, the area corresponding to the hand is next to the arm, and a
distorted human frame can be topologically mapped out in two dimensions on its surface).

Advantages and Disadvantages of SOM

 Self-organise map has the following advantages:


 Probably the best thing about SOMs that they are very easy to understand. It’s very simple, if
they are close together and there is grey connecting them, then they are similar. If there is a
black ravine between them, then they are different. Unlike Multidimensional Scaling or N-
land, people can quickly pick up on how to use them in an effective manner.
 Another great thing is that they work very well. As I have shown you they classify data well
and then are easily evaluate for their own quality so you can actually calculated how good a
map is and how strong the similarities between objects are.

These are the disadvantages:

 One major problem with SOMs is getting the right data. Unfortunately you need a value for
each dimension of each member of samples in order to generate a map. Sometimes this
simply is not possible and often it is very difficult to acquire all of this data so this is a limiting
feature to the use of SOMs often referred to as missing data.
 Another problem is that every SOM is different and finds different similarities among the
sample vectors. SOMs organize sample data so that in the final product, the samples are
usually surrounded by similar samples, however similar samples are not always near each
other. If you have a lot of shades of purple, not always will you get one big group with all the
purples in that cluster, sometimes the clusters will get split and there will be two groups of
purple. Using colour we could tell that those two groups in reality are similar and that they
just got split, but with most data, those two clusters will look totally unrelated. So a lot of
maps need to be constructed in order to get one final good map.
 The final major problem with SOMs is that they are very computationally expensive which is
a major drawback since as the dimensions of the data increases, dimension reduction
visualization techniques become more important, but unfortunately then time to compute
them also increases. For calculating that black and white similarity map, the more neighbours
you use to calculate the distance the better similarity map you will get, but the number of
distances the algorithm needs to compute increases exponentially.
Check your Progress 1

Fill in the blanks

1. The goal of ____ is not to maximize a utility function, but simply to find similarities in the
training data.
2. ____ is rule-based learning for identifying the association between different variables in a
database.
3. In the ______ algorithm, the adjustment of neurons is actually applied not just to the
winning neuron, but to all the members of the current neighbourhood.

Activity 1

Find and list the applications of unsupervised learning algorithms in different domains.

Summary

 In this unit we have discussed unsupervised learning. We have also described the
unsupervised learning algorithms such as K-Means Clustering, Apriori Algorithms and Self-
Organized Map with example.

Keywords

 Pruning: It is a technique in machine learning and search algorithms that reduces the size of
decision trees by removing sections of the tree that provide little power to classify instances.
 Frequent Pattern: Frequent pattern discovery as part of knowledge discovery in databases /
Massive Online Analysis, and data mining describes the task of finding the most frequent and
relevant patterns in large datasets.
 Join Operation: This operation pairs two tuples from different relations, if and only if a given
join condition is satisfied.

Self-Assessment Questions

1. Write a short note on unsupervised learning.


2. Explain Apriori algorithm with example.
3. Describe the application of K-means clustering.

Check your Progress 1

Fill in the blanks

Answers to Check Your Progress

1. The goal of clustering is not to maximize a utility function, but simply to find similarities in
the training data.
2. Association Rule Learning is rule-based learning for identifying the association between
different variables in a database.
3. In the Kohonen algorithm, the adjustment of neurons is actually applied not just to the
winning neuron, but to all the members of the current neighbourhood.
Suggested Reading

1. Introduction to Machine Learning by Ethem Alpaydin


2. Machine Learning For Beginners: Machine Learning Basics for Absolute Beginners. Learn What
ML Is and Why It Matters. Notes on Artificial Intelligence and Deep Learning are also Included,
by Scott Chesterton
3. Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz, Shai
Ben-David
4. Machine Learning For Dummies by John Paul Mueller, Luca Massaron

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/.
Unit 4

Regression Algorithms

4.1 Introduction

4.2 Linear Regression

4.3 Lasso Regression

4.4 Logistic Regression

4.5 Other Regression Algorithms

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

 Understand the concept of regression algorithm


 Explain Linear, Lasso and Logistic regression algorithms

4.1 INTRODUCTION

Regression algorithms fall under the family of Supervised Machine Learning algorithms which is a
subset of machine learning algorithms. One of the main features of supervised learning algorithms is
that they model dependencies and relationships between the target output and input features to
predict the value for new data. Regression algorithms predict the output values based on input
features from the data fed in the system. The go-to methodology is the algorithm builds a model on
the features of training data and using the model to predict value for new data. Regression models
have many applications, particularly in financial forecasting, trend analysis, marketing, time series
prediction and even drug response modeling. Some of the popular types of regression algorithms are
linear regression, regression trees, lasso regression and multivariate regression. In this unit we are
going to discussed regression algorithms in detail.

4.2 LINEAR REGRESSION

A linear regression model predicts the target as a weighted sum of the feature inputs. The linearity of
the learned relationship makes the interpretation easy. Linear regression models have long been used
by statisticians, computer scientists and other people who tackle quantitative problems.

Linear models can be used to model the dependence of a regression target y on some features x. The
learned relationships are linear and can be written for a single instance i as follows:

The predicted outcome of an instance is a weighted sum of its p features. The betas (βj) represent the
learned feature weights or coefficients. The first weight in the sum (β0) is called the intercept and is
not multiplied with a feature. The epsilon (ϵ) is the error we still make, i.e. the difference between the
prediction and the actual outcome. These errors are assumed to follow a Gaussian distribution, which
means that we make errors in both negative and positive directions and make many small errors and
few large errors.
Various methods can be used to estimate the optimal weight. The ordinary least squares method is
usually used to find the weights that minimize the squared differences between the actual and the
estimated outcomes:

We will not discuss in detail how the optimal weights can be found, but if you are interested, you can
read chapter 3.2 of the book “The Elements of Statistical Learning” (Friedman, Hastie and Tibshirani
2009) or one of the other online resources on linear regression models.

The biggest advantage of linear regression models is linearity: It makes the estimation procedure
simple and, most importantly, these linear equations have an easy to understand interpretation on a
modular level (i.e. the weights). This is one of the main reasons why the linear model and all similar
models are so widespread in academic fields such as medicine, sociology, psychology, and many other
quantitative research fields. For example, in the medical field, it is not only important to predict the
clinical outcome of a patient, but also to quantify the influence of the drug and at the same time take
sex, age, and other features into account in an interpretable way.

Estimated weights come with confidence intervals. A confidence interval is a range for the weight
estimate that covers the “true” weight with a certain confidence. For example, a 95% confidence
interval for a weight of 2 could range from 1 to 3. The interpretation of this interval would be: If we
repeated the estimation 100 times with newly sampled data, the confidence interval would include
the true weight in 95 out of 100 cases, given that the linear regression model is the correct model for
the data.

Whether the model is the “correct” model depends on whether the relationships in the data meet
certain assumptions, which are linearity, normality, homoscedasticity, independence, fixed features,
and absence of multicollinearity.

Linearity

The linear regression model forces the prediction to be a linear combination of features, which is both
its greatest strength and its greatest limitation. Linearity leads to interpretable models. Linear effects
are easy to quantify and describe. They are additive, so it is easy to separate the effects. If you suspect
feature interactions or a nonlinear association of a feature with the target value, you can add
interaction terms or use regression splines.

Normality

It is assumed that the target outcome given the features follows a normal distribution. If this
assumption is violated, the estimated confidence intervals of the feature weights are invalid.

Homoscedasticity (constant variance)

The variance of the error terms is assumed to be constant over the entire feature space. Suppose you
want to predict the value of a house given the living area in square meters. You estimate a linear
model that assumes that, regardless of the size of the house, the error around the predicted response
has the same variance. This assumption is often violated in reality. In the house example, it is plausible
that the variance of error terms around the predicted price is higher for larger houses, since prices are
higher and there is more room for price fluctuations. Suppose the average error (difference between
predicted and actual price) in your linear regression model is 50,000 Euros. If you assume
homoscedasticity, you assume that the average error of 50,000 is the same for houses that cost 1
million and for houses that cost only 40,000. This is unreasonable because it would mean that we can
expect negative house prices.

Independence

It is assumed that each instance is independent of any other instance. If you perform repeated
measurements, such as multiple blood tests per patient, the data points are not independent. For
dependent data you need special linear regression models, such as mixed effect models or GEEs. If
you use the “normal” linear regression model, you might draw wrong conclusions from the model.

Fixed features

The input features are considered “fixed”. Fixed means that they are treated as “given constants” and
not as statistical variables. This implies that they are free of measurement errors. This is a rather
unrealistic assumption. Without that assumption, however, you would have to fit very complex
measurement error models that account for the measurement errors of your input features. And
usually you do not want to do that.

Absence of multicollinearity

You do not want strongly correlated features, because this messes up the estimation of the weights.
In a situation where two features are strongly correlated, it becomes problematic to estimate the
weights because the feature effects are additive and it becomes indeterminable to which of the
correlated features to attribute the effects.

4.2.1 Interpretation

The interpretation of a weight in the linear regression model depends on the type of the
corresponding feature.

 Numerical feature: Increasing the numerical feature by one unit changes the estimated outcome
by its weight. An example of a numerical feature is the size of a house.
 Binary feature: A feature that takes one of two possible values for each instance. An example is
the feature “House comes with a garden”. One of the values counts as the reference category (in
some programming languages encoded with 0), such as “No garden”. Changing the feature from
the reference category to the other category changes the estimated outcome by the feature’s
weight.
 Categorical feature with multiple categories: A feature with a fixed number of possible values. An
example is the feature “floor type”, with possible categories “carpet”, “laminate” and “parquet”.
A solution to deal with many categories is the one-hot-encoding, meaning that each category has
its own binary column. For a categorical feature with L categories, you only need L-1 columns,
because the L-th column would have redundant information (e.g. when columns 1 to L-1 all have
value 0 for one instance, we know that the categorical feature of this instance takes on category
L). The interpretation for each category is then the same as the interpretation for binary features.
 Intercept β0: The intercept is the feature weight for the “constant feature”, which is always 1 for
all instances. Most software packages automatically add this “1”-feature to estimate the intercept.
The interpretation is: For an instance with all numerical feature values at zero and the categorical
feature values at the reference categories, the model prediction is the intercept weight. The
interpretation of the intercept is usually not relevant because instances with all features values at
zero often make no sense. The interpretation is only meaningful when the features have been
standardised (mean of zero, standard deviation of one). Then the intercept reflects the predicted
outcome of an instance where all features are at their mean value.

The interpretation of the features in the linear regression model can be automated by using following
text templates.

Interpretation of a Numerical Feature

An increase of feature xk by one unit increases the prediction for y by βk units when all other feature
values remain fixed.

Interpretation of a Categorical Feature

Changing feature xk from the reference category to the other category increases the prediction for y
by βk when all other features remain fixed.
Another important measurement for interpreting linear models is the R-squared measurement. R-
squared tells you how much of the total variance of your target outcome is explained by the model.
The higher R-squared, the better your model explains the data. The formula for calculating R-squared
is:

SSE is the squared sum of the error terms:

SST is the squared sum of the data variance:

The SSE tells you how much variance remains after fitting the linear model, which is measured by the
squared differences between the predicted and actual target values. SST is the total variance of the
target outcome. R-squared tells you how much of your variance can be explained by the linear model.
R-squared ranges between 0 for models where the model does not explain the data at all and 1 for
models that explain all of the variance in your data.

There is a catch, because R-squared increases with the number of features in the model, even if they
do not contain any information about the target value at all. Therefore, it is better to use the adjusted
R-squared, which accounts for the number of features used in the model. Its calculation is:

where p is the number of features and n the number of instances.

It is not meaningful to interpret a model with very low (adjusted) R-squared, because such a model
basically does not explain much of the variance. Any interpretation of the weights would not be
meaningful.

Feature Importance

The importance of a feature in a linear regression model can be measured by the absolute value of its
t-statistic. The t-statistic is the estimated weight scaled with its standard error.

Let us examine what this formula tells us: The importance of a feature increases with increasing
weight. This makes sense. The more variance the estimated weight has (= the less certain we are about
the correct value), the less important the feature is. This also makes sense.

4.2.2 Example

In this example, we use the linear regression model to predict the number of rented bikes on a
particular day, given weather and calendar information. For the interpretation, we examine the
estimated regression weights. The features consist of numerical and categorical features. For each
feature, the table shows the estimated weight, the standard error of the estimate (SE), and the
absolute value of the t-statistic (|t|).

Interpretation of a numerical feature (temperature): An increase of the temperature by 1 degree


Celsius increases the predicted number of bicycles by 110.7, when all other features remain fixed.

Interpretation of a categorical feature (“weathersit”): The estimated number of bicycles is -1901.5


lower when it is raining, snowing or stormy, compared to good weather – again assuming that all other
features do not change. When the weather is misty, the predicted number of bicycles is -379.4 lower
compared to good weather, given all other features remain the same.

All the interpretations always come with the footnote that “all other features remain the same”. This
is because of the nature of linear regression models. The predicted target is a linear combination of
the weighted features. The estimated linear equation is a hyperplane in the feature/target space (a
simple line in the case of a single feature). The weights specify the slope (gradient) of the hyperplane
in each direction. The good side is that the additivity isolates the interpretation of an individual feature
effect from all other features. That is possible because all the feature effects (= weight times feature
value) in the equation are combined with a plus. On the bad side of things, the interpretation ignores
the joint distribution of the features. Increasing one feature, but not changing another, can lead to
unrealistic or at least unlikely data points. For example increasing the number of rooms might be
unrealistic without also increasing the size of a house.
4.2.3 Visual Interpretation

Various visualizations make the linear regression model easy and quick to grasp for humans.

4.2.3.1 Weight Plot

The information of the weight table (weight and variance estimates) can be visualized in a weight plot.
The following plot shows the results from the previous linear regression model.

Fig. 4.1: Weights are displayed as points and the 95% confidence intervals as lines.

The weight plot shows that rainy/snowy/stormy weather has a strong negative effect on the predicted
number of bikes. The weight of the working day feature is close to zero and zero is included in the 95%
interval, which means that the effect is not statistically significant. Some confidence intervals are very
short and the estimates are close to zero, yet the feature effects were statistically significant.
Temperature is one such candidate. The problem with the weight plot is that the features are
measured on different scales. While for the weather the estimated weight reflects the difference
between good and rainy/stormy/snowy weather, for temperature it only reflects an increase of 1
degree Celsius. You can make the estimated weights more comparable by scaling the features (zero
mean and standard deviation of one) before fitting the linear model.

4.2.3.2 Effect Plot

The weights of the linear regression model can be more meaningfully analysed when they are
multiplied by the actual feature values. The weights depend on the scale of the features and will be
different if you have a feature that measures e.g. a person’s height and you switch from meter to
centimetre. The weight will change, but the actual effects in your data will not. It is also important to
know the distribution of your feature in the data, because if you have a very low variance, it means
that almost all instances have similar contribution from this feature. The effect plot can help you
understand how much the combination of weight and feature contributes to the predictions in your
data. Start by calculating the effects, which is the weight per feature times the feature value of an
instance:

The effects can be visualized with boxplots. A box in a boxplot contains the effect range for half of
your data (25% to 75% effect quantiles). The vertical line in the box is the median effect, i.e. 50% of
the instances have a lower and the other half a higher effect on the prediction. The horizontal lines
extend to ±1.5IQR/√n, with IQR being the inter quartile range (75% quantile minus 25% quantile). The
dots are outliers. The categorical feature effects can be summarized in a single boxplot, compared to
the weight plot, where each category has its own row.

Fig. 4.2: The feature effect plot shows the distribution of effects (= feature value times feature
weight) across the data per feature.

The largest contributions to the expected number of rented bicycles comes from the temperature
feature and the day’s feature, which captures the trend of bike rentals over time. The temperature
has a broad range of how much it contributes to the prediction. The day trend feature goes from zero
to large positive contributions, because the first day in the dataset (01.01.2011) has a very small trend
effect and the estimated weight for this feature is positive (4.93). This means that the effect increases
with each day and is highest for the last day in the dataset (31.12.2012). Note that for effects with a
negative weight, the instances with a positive effect are those that have a negative feature value. For
example, days with a high negative effect of wind speed are the ones with high wind speeds.
4.2.4 Explain Individual Predictions

How much has each feature of an instance contributed to the prediction? This can be answered by
computing the effects for this instance. An interpretation of instance-specific effects only makes sense
in comparison to the distribution of the effect for each feature. We want to explain the prediction of
the linear model for the 6-th instance from the bicycle dataset. The instance has the following feature
values.

To obtain the feature effects of this instance, we have to multiply its feature values by the
corresponding weights from the linear regression model. For the value “WORKING DAY” of feature
“workingday”, the effect is, 124.9. For a temperature of 1.6 degrees Celsius, the effect is 177.6. We
add these individual effects as crosses to the effect plot, which shows us the distribution of the effects
in the data. This allows us to compare the individual effects with the distribution of effects in the data.
Fig. 4.3: The effect plot for one instance shows the effect distribution and highlights the effects of
the instance of interest.

If we average the predictions for the training data instances, we get an average of 4504. In comparison,
the prediction of the 6-th instance is small, since only 1571 bicycle rents are predicted. The effect plot
reveals the reason why. The boxplots show the distributions of the effects for all instances of the
dataset, the crosses show the effects for the 6-th instance. The 6-th instance has a low temperature
effect because on this day the temperature was 2 degrees, which is low compared to most other days
(and remember that the weight of the temperature feature is positive). Also, the effect of the trend
feature “days_since_2011” is small compared to the other data instances because this instance is from
early 2011 (5 days) and the trend feature also has a positive weight.

4.2.5 Encoding of Categorical Features

There are several ways to encode a categorical feature, and the choice influences the interpretation
of the weights.

The standard in linear regression models is treatment coding, which is sufficient in most cases. Using
different encodings boils down to creating different (design) matrices from a single column with the
categorical feature. This section presents three different encodings, but there are many more. The
example used has six instances and a categorical feature with three categories. For the first two
instances, the feature takes category A; for instances three and four, category B; and for the last two
instances, category C.

Treatment coding

In treatment coding, the weight per category is the estimated difference in the prediction between
the corresponding category and the reference category. The intercept of the linear model is the mean
of the reference category (when all other features remain the same). The first column of the design
matrix is the intercept, which is always 1. Column two indicates whether instance i is in category B,
column three indicates whether it is in category C. There is no need for a column for category A,
because then the linear equation would be over specified and no unique solution for the weights can
be found. It is sufficient to know that an instance is neither in category B or C.
Feature matrix:

Effect coding

The weight per category is the estimated y-difference from the corresponding category to the overall
mean (given all other features are zero or the reference category). The first column is used to estimate
the intercept. The weight β0 associated with the intercept represents the overall mean and β1, the
weight for column two, is the difference between the overall mean and category B. The total effect of
category B is β0+β1. The interpretation for category C is equivalent. For the reference category A, −
(β1+β2) is the difference to the overall mean and β0− (β1+β2) the overall effect.
Feature matrix:

Dummy coding

The β per category is the estimated mean value of y for each category (given all other feature values
are zero or the reference category). Note that the intercept has been omitted here so that a unique
solution can be found for the linear model weights.
Feature matrix

4.3 LASSO REGRESSION

The examples of the linear models that we have discussed all look nice and neat. But in reality you
might not have just a handful of features, but hundreds or thousands. Interpretability goes downhill.
You might even find yourself in a situation where there are more features than instances, and you
cannot fit a standard linear model at all. The good news is that there are ways to introduce sparsity (=
few features) into linear models.
Lasso is an automatic and convenient way to introduce sparsity into the linear regression model. Lasso
stands for “least absolute shrinkage and selection operator” and, when applied in a linear regression
model, performs feature selection and regularization of the selected feature weights. Let us consider
the minimization problem that the weights optimize:

Lasso adds a term to this optimization problem.

The term ||β||1, the L1-norm of the feature vector, leads to a penalization of large weights. Since the
L1-norm is used, many of the weights receive an estimate of 0 and the others are shrunk. The
parameter lambda (λ) controls the strength of the regularizing effect and is usually tuned by cross-
validation. Especially when lambda is large, many weights become 0. The feature weights can be
visualized as a function of the penalty term lambda. Each feature weight is represented by a curve in
the following figure.

Fig. 4.4: With increasing penalty of the weights, fewer and fewer features receive a non-zero
weight estimate. These curves are also called regularization paths. The number above the plot is
the number of non-zero weights.

What value should we choose for lambda? If you see the penalization term as a tuning parameter,
then you can find the lambda that minimizes the model error with cross-validation. You can also
consider lambda as a parameter to control the interpretability of the model. The larger the
penalization, the fewer features are present in the model (because their weights are zero) and the
better the model can be interpreted.

Example with Lasso

We will predict bicycle rentals using Lasso. We set the number of features we want to have in the
model beforehand. Let us first set the number to 2 features:
Weight

seasonSPRING 0.00

seasonSUMMER 0.00

seasonFALL 0.00

seasonWINTER 0.00

holidayHOLIDAY 0.00

workingdayWORKING DAY 0.00

weathersitMISTY 0.00

weathersitRAIN/SNOW/STORM 0.00

temp 52.33

hum 0.00

windspeed 0.00

days_since_2011 2.15

The first two features with non-zero weights in the Lasso path are temperature (“temp”) and the time
trend (“days_since_2011”).

Now, let us select 5 features:

Weight

seasonSPRING -389.99

seasonSUMMER 0.00

seasonFALL 0.00
Weight

seasonWINTER 0.00

holidayHOLIDAY 0.00

workingdayWORKING DAY 0.00

weathersitMISTY 0.00

weathersitRAIN/SNOW/STORM -862.27

temp 85.58

hum -3.04

windspeed 0.00

days_since_2011 3.82

Note that the weights for “temp” and “days_since_2011” differ from the model with two features.
The reason for this is that by decreasing lambda even features that are already “in” the model are
penalized less and may get a larger absolute weight. The interpretation of the Lasso weights
corresponds to the interpretation of the weights in the linear regression model. You only need to pay
attention to whether the features are standardized or not, because this affects the weights. In this
example, the features were standardized by the software, but the weights were automatically
transformed back for us to match the original feature scales.

Other methods for sparsity in linear models

A wide spectrum of methods can be used to reduce the number of features in a linear model.

 Pre-processing methods:

o Manually selected features: You can always use expert knowledge to select or discard
some features. The big drawback is that it cannot be automated and you need to have
access to someone who understands the data.
o Univariate selection: An example is the correlation coefficient. You only consider
features that exceed a certain threshold of correlation between the feature and the
target. The disadvantage is that it only considers the features individually. Some
features might not show a correlation until the linear model has accounted for some
other features. Those ones you will miss with univariate selection methods.
 Step-wise methods:

o Forward selection: Fit the linear model with one feature. Do this with each feature.
Select the model that works best (e.g. highest R-squared). Now again, for the
remaining features, fit different versions of your model by adding each feature to your
current best model. Select the one that performs best. Continue until some criterion
is reached, such as the maximum number of features in the model.
o Backward selection: Similar to forward selection. But instead of adding features, start
with the model that contains all features and try out which feature you have to
remove to get the highest performance increase. Repeat this until some stopping
criterion is reached.

It is recommended to use Lasso, because it can be automated, considers all features simultaneously,
and can be controlled via lambda. It also works for the logistic regression model for classification.

4.3.1 Advantages and Disadvantages

The modeling of the predictions as a weighted sum makes it transparent how predictions are
produced. And with Lasso we can ensure that the number of features used remains small. Many
people use linear regression models. This means that in many places it is accepted for predictive
modeling and doing inference. There is a high level of collective experience and expertise, including
teaching materials on linear regression models and software implementations. Linear regression can
be found in R, Python, Java, Julia, Scala, JavaScript, etc.

Mathematically, it is straightforward to estimate the weights and you have a guarantee to find
optimal weights (given all assumptions of the linear regression model are met by the data).

Together with the weights you get confidence intervals, tests, and solid statistical theory. There are
also many extensions of the linear regression model.

Disadvantages

Linear regression models can only represent linear relationships, i.e. a weighted sum of the input
features. Each nonlinearity or interaction has to be hand-crafted and explicitly given to the model as
an input feature.

Linear models are also often not that good regarding predictive performance, because the
relationships that can be learned are so restricted and usually oversimplify how complex reality is.

The interpretation of a weight can be unintuitive because it depends on all other features. A feature
with high positive correlation with the outcome y and another feature might get a negative weight in
the linear model, because, given the other correlated feature, it is negatively correlated with y in the
high-dimensional space. Completely correlated features make it even impossible to find a unique
solution for the linear equation. An example: You have a model to predict the value of a house and
have features like number of rooms and size of the house. House size and number of rooms are highly
correlated: the bigger a house is, the more rooms it has. If you take both features into a linear model,
it might happen, that the size of the house is the better predictor and gets a large positive weight. The
number of rooms might end up getting a negative weight, because, given that a house has the same
size, increasing the number of rooms could make it less valuable or the linear equation becomes less
stable, when the correlation is too strong.
4.4 LOGISTIC REGRESSION

Logistic regression models the probabilities for classification problems with two possible
outcomes. It’s an extension of the linear regression model for classification problems.

The linear regression model can work well for regression, but fails for classification. Why is that? In
case of two classes, you could label one of the classes with 0 and the other with 1 and use linear
regression. Technically it works and most linear model programs will spit out weights for you. But
there are a few problems with this approach:

A linear model does not output probabilities, but it treats the classes as numbers (0 and 1) and fits the
best hyperplane (for a single feature, it is a line) that minimizes the distances between the points and
the hyperplane. So it simply interpolates between the points, and you cannot interpret it as
probabilities.

A linear model also extrapolates and gives you values below zero and above one. This is a good sign
that there might be a smarter approach to classification.

Since the predicted outcome is not a probability, but a linear interpolation between points, there is
no meaningful threshold at which you can distinguish one class from the other.

Linear models do not extend to classification problems with multiple classes. You would have to start
labeling the next class with 2, then 3, and so on. The classes might not have any meaningful order, but
the linear model would force a weird structure on the relationship between the features and your
class predictions. The higher the value of a feature with a positive weight, the more it contributes to
the prediction of a class with a higher number, even if classes that happen to get a similar number are
not closer than other classes.

Fig. 4.5: A linear model classifies tumors as malignant (1) or benign (0) given their size. The lines
show the prediction of the linear model. For the data on the left, we can use 0.5 as classification
threshold. After introducing a few more malignant tumor cases, the regression line shifts and a
threshold of 0.5 no longer separates the classes. Points are slightly jittered to reduce over-plotting
4.4.1 Theory

A solution for classification is logistic regression. Instead of fitting a straight line or hyperplane, the
logistic regression model uses the logistic function to squeeze the output of a linear equation between
0 and 1. The logistic function is defined as:

And it looks like this:

Fig. 4.6: The logistic function. It outputs numbers between 0 and 1. At input 0, it outputs 0.5.

The step from linear regression to logistic regression is kind of straightforward. In the linear regression
model, we have modelled the relationship between outcome and features with a linear equation:

For classification, we prefer probabilities between 0 and 1, so we wrap the right side of the equation
into the logistic function. This forces the output to assume only values between 0 and 1.

Let us revisit the tumor size example again. But instead of the linear regression model, we use the
logistic regression model:
Fig. 4.7: The logistic regression model finds the correct decision boundary between malignant and
benign depending on tumor size. The line is the logistic function shifted and squeezed to fit the
data.

Classification works better with logistic regression and we can use 0.5 as a threshold in both cases.
The inclusion of additional points does not really affect the estimated curve.

4.4.2 Interpretation

The interpretation of the weights in logistic regression differs from the interpretation of the weights
in linear regression, since the outcome in logistic regression is a probability between 0 and 1. The
weights do not influence the probability linearly any longer. The weighted sum is transformed by the
logistic function to a probability. Therefore we need to reformulate the equation for the interpretation
so that only the linear term is on the right side of the formula.

We call the term in the log () function “odds” (probability of event divided by probability of no event)
and wrapped in the logarithm it is called log odds.

This formula shows that the logistic regression model is a linear model for the log odds. Great! That
does not sound helpful! With a little shuffling of the terms, you can figure out how the prediction
changes when one of the features xj is changed by 1 unit. To do this, we can first apply the exp()
function to both sides of the equation:

Then we compare what happens when we increase one of the feature values by 1. But instead of
looking at the difference, we look at the ratio of the two predictions:
We apply the following rule:

And we remove many terms:

In the end, we have something as simple as exp() of a feature weight. A change in a feature by one
unit changes the odds ratio (multiplicative) by a factor of exp(βj). We could also interpret it this way:
A change in xj by one unit increases the log odds ratio by the value of the corresponding weight. Most
people interpret the odds ratio because thinking about the log() of something is known to be hard on
the brain. Interpreting the odds ratio already requires some getting used to. For example, if you have
odds of 2, it means that the probability for y=1 is twice as high as y=0. If you have a weight (= log odds
ratio) of 0.7, then increasing the respective feature by one unit multiplies the odds by exp(0.7)
(approximately 2) and the odds change to 4. But usually you do not deal with the odds and interpret
the weights only as the odds ratios. Because for actually calculating the odds you would need to set a
value for each feature, which only makes sense if you want to look at one specific instance of your
dataset.

These are the interpretations for the logistic regression model with different feature types:

 Numerical feature: If you increase the value of feature xj by one unit, the estimated odds
change by a factor of exp(βj)
 Binary categorical feature: One of the two values of the feature is the reference category (in
some languages, the one encoded in 0). Changing the feature xj from the reference category
to the other category changes the estimated odds by a factor of exp(βj).
 Categorical feature with more than two categories: One solution to deal with multiple
categories is one-hot-encoding, meaning that each category has its own column. You only
need L-1 columns for a categorical feature with L categories, otherwise it is over-
parameterized. The L-th category is then the reference category. You can use any other
encoding that can be used in linear regression. The interpretation for each category then is
equivalent to the interpretation of binary features.
 Intercept β0: When all numerical features are zero and the categorical features are at the
reference category, the estimated odds are exp(β0 ). The interpretation of the intercept weight
is usually not relevant.

4.4.3 Example

We use the logistic regression model to predict cervical cancer based on some risk factors. The
following table shows the estimate weights, the associated odds ratios, and the standard error of the
estimates.
Table 4.1: The results of fitting a logistic regression model on the cervical cancer dataset. Shown are
the features used in the model, their estimated weights and corresponding odds ratios, and the
standard errors of the estimated weights.

Interpretation of a numerical feature (“Num. of diagnosed STDs”): An increase in the number of


diagnosed STDs (sexually transmitted diseases) changes (increases) the odds of cancer vs. no cancer
by a factor of 2.26, when all other features remain the same. Keep in mind that correlation does not
imply causation.

Interpretation of a categorical feature (“Hormonal contraceptives y/n”): For women using hormonal
contraceptives, the odds for cancer vs. no cancer are by a factor of 0.89 lower, compared to women
without hormonal contraceptives, given all other features stay the same.

Like in the linear model, the interpretations always come with the clause that ‘all other features stay
the same’.

4.4.4 Advantages and Disadvantages

Many of the pros and cons of the linear regression model also apply to the logistic regression model.
Logistic regression has been widely used by many different people, but it struggles with its restrictive
expressiveness (e.g. interactions must be added manually) and other models may have better
predictive performance.

Another disadvantage of the logistic regression model is that the interpretation is more difficult
because the interpretation of the weights is multiplicative and not additive.

Logistic regression can suffer from complete separation. If there is a feature that would perfectly
separate the two classes, the logistic regression model can no longer be trained. This is because the
weight for that feature would not converge, because the optimal weight would be infinite. This is
really a bit unfortunate, because such a feature is really useful. But you do not need machine learning
if you have a simple rule that separates both classes. The problem of complete separation can be
solved by introducing penalization of the weights or defining a prior probability distribution of weights.

On the good side, the logistic regression model is not only a classification model, but also gives you
probabilities. This is a big advantage over models that can only provide the final classification. Knowing
that an instance has a 99% probability for a class compared to 51% makes a big difference.

Logistic regression can also be extended from binary classification to multi-class classification. Then it
is called Multinomial Regression.
4.5 Other Regression Algorithms

 Multivariate Regression Algorithm:

This technique is used when there is more than one predictor variable in a multivariate regression
model and the model is called a multivariate multiple regression. Termed as one of the simplest
supervised machine learning algorithms by researchers, this regression algorithm is used to predict
the response variable for a set of explanatory variables. This regression technique can be implemented
efficiently with the help of matrix operations and in Python, it can be implemented via the “numpy”
library which contains definitions and operations for matrix object.

Industry application of Multivariate Regression algorithm is seen heavily in the retail sector where
customers make a choice on a number of variables such as brand, price and product. The multivariate
analysis helps decision makers to find the best combination of factors to increase footfalls in the store.

 Multiple Regression Algorithm:

This regression algorithm has several applications across the industry for product pricing, real estate
pricing, marketing departments to find out the impact of campaigns. Unlike linear regression
technique, multiple regression, is a broader class of regressions that encompasses linear and nonlinear
regressions with multiple explanatory variables.

Some of the business applications of multiple regression algorithm in the industry are in social science
research, behavioural analysis and even in the insurance industry to determine claim worthiness.

Check your Progress 1

State True or False

1. The interpretation of a weight in the linear regression model depends on the type of the
corresponding feature.
2. In effect coding, the weight per category is the estimated difference in the prediction
between the corresponding category and the reference category.
3. Coding is an automatic and convenient way to introduce sparsity into the linear regression
model.

Fill in the blanks

1. _____ models the probabilities for classification problems with two possible outcomes.
2. Multiple regression is a broader class of regressions that encompasses linear and nonlinear
regressions with _____ variables.
Activity 1

Collect or use any existing data and apply linear regression on it. Also, list the findings.

Summary

In this unit we have discussed regression algorithm in detailed. We have also described the various
regression algorithms such as Linear regression, Logistic regression, Lasso regression, Multivariate and
Multiple regression algorithms with example and applications.

Keywords

 Regression: It is a statistical measurement used in finance, investing, and other disciplines


that attempts to determine the strength of the relationship between one dependent variable
(usually denoted by Y) and a series of other changing variables (known as independent
variables).
 Prediction: It refers to the output of an algorithm after it has been trained on a historical
dataset and applied to new data when forecasting the likelihood of a particular outcome.
 Explanatory variable: It is a type of independent variable. The two terms are often used
interchangeably. But there is a subtle difference between the two. When a variable is
independent, it is not affected at all by any other variables. When a variable isn't independent
for certain, it's an explanatory variable.

Self-Assessment Questions

1. Compare linear and logistics regression algorithms.


2. Write a short note on lasso regression algorithm.
3. State the application of Multivariate regression algorithm.

Answers to Check Your Progress

Check your Progress 1

State True or False

1. True
2. False
3. False

Fill in the blanks

1. Logistic regression models the probabilities for classification problems with two possible
outcomes.
2. Multiple regression is a broader class of regressions that encompasses linear and nonlinear
regressions with multiple explanatory variables.

Suggested Reading

1. Introduction to Machine Learning by Ethem Alpaydin


2. Machine Learning For Beginners: Machine Learning Basics for Absolute Beginners. Learn What
ML Is and Why It Matters.Notes on Artificial Intelligence and Deep Learning are also Included,
by Scott Chesterton
3. Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz, Shai
Ben-David
4. Machine Learning For Dummies by John Paul Mueller, Luca Massaron
5. Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models
Explainable", 2019. https://christophm.github.io/interpretable-ml-book/.

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/.
Unit 5

Clustering Models

5.1 Introduction

5.2 Basic Concept of Clustering Algorithm

5.3 Types of Clustering Algorithms

5.4 Clustering Example (with R)

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

 Understand about Clustering model


 Describe various clustering algorithms
 Explain the difference between K-means and hierarchical clustering

5.1 INTRODUCTION

When you're trying to learn about something, say music, one approach might be to look for
meaningful groups or collections. You might organize music by genre, while your friend might organize
music by decade. How you choose to group items helps you to understand more about them as
individual pieces of music. You might find that you have a deep affinity for punk rock and further break
down the genre into different approaches or music from different locations. On the other hand, your
friend might look at music from the 1980's and be able to understand how the music across genres at
that time was influenced by the socio-political climate. In both cases, you and your friend have learned
something interesting about music, even though you took different approaches.

In machine learning too, we often group examples as a first step to understand a subject (data set) in
a machine learning system. Grouping unlabeled examples is called clustering. As the examples are
unlabeled, clustering relies on unsupervised machine learning. If the examples are labeled, then
clustering becomes classification. In this unit, we are going to discuss clustering algorithms in detailed.

5.2 BASIC CONCEPT OF CLUSTERING ALGORITHM

Clustering is a Machine Learning technique that involves the grouping of data points. Given a set of
data points, we can use a clustering algorithm to classify each data point into a specific group. In
theory, data points that are in the same group should have similar properties and/or features, while
data points in different groups should have highly dissimilar properties and/or features. Clustering is
a method of unsupervised learning and is a common technique for statistical data analysis used in
many fields.

Let’s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look at
details of each costumer and devise a unique business strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing
habits and use a separate strategy for costumers in each of these 10 groups. And this is what we call
clustering.

To cluster your data, you'll follow these steps:

 Prepare data.
 Create similarity metric.
 Run clustering algorithm.
 Interpret results and adjust your clustering.

In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing
what groups the data points fall into when we apply a clustering algorithm.
Clustering has a myriad of uses in a variety of industries. Some common applications for clustering
include the following:

 market segmentation
 social network analysis
 search result grouping
 medical imaging
 image segmentation
 anomaly detection

After clustering, each cluster is assigned a number called a cluster ID. Now, you can condense the
entire feature set for an example into its cluster ID. Representing a complex example by a simple
cluster ID makes clustering powerful. Extending the idea, clustering data can simplify large datasets.

Machine learning systems can then use cluster IDs to simplify the processing of large datasets. Thus,
clustering’s output serves as feature data for downstream ML systems. At Google, clustering is used
for generalization, data compression, and privacy preservation in products such as YouTube videos,
Play apps, and Music tracks.

5.3 TYPES OF CLUSTERING ALGORITHMS

Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty.
Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In
fact, there are more than 100 clustering algorithms known. But few of the algorithms are used
popularly, let’s look at them in detail:

 Connectivity models: As the name suggests, these models are based on the notion that the
data points closer in data space exhibit more similarity to each other than the data points lying
farther away. These models can follow two approaches. In the first approach, they start with
classifying all data points into separate clusters & then aggregating them as the distance
decreases. In the second approach, all data points are classified as a single cluster and then
partitioned as the distance increases. Also, the choice of distance function is subjective. These
models are very easy to interpret but lacks scalability for handling big datasets. Examples of
these models are hierarchical clustering algorithm and its variants.
 Centroid models: These are iterative clustering algorithms in which the notion of similarity is
derived by the closeness of a data point to the centroid of the clusters. K-Means clustering
algorithm is a popular algorithm that falls into this category (discussed in Unit 3). In these
models, the no. of clusters required at the end have to be mentioned beforehand, which
makes it important to have prior knowledge of the dataset. These models run iteratively to
find the local optima.
 Distribution models: These clustering models are based on the notion of how probable is it
that all data points in the cluster belong to the same distribution (For example: Normal,
Gaussian). These models often suffer from overfitting. A popular example of these models is
Expectation-maximization algorithm which uses multivariate normal distributions.
 Density Models: These models search the data space for areas of varied density of data points
in the data space. It isolates various different density regions and assign the data points within
these regions in the same cluster. Popular examples of density models are DBSCAN and
OPTICS.
Fig. 5.1 (a): Example of centroid-based Fig. 5.1 (b): Example of density-based clustering
clustering

Fig. 5.1 (c): Example of distribution-based Fig. 5.1 (d): Example of a hierarchical tree clustering
clustering animals

5.3.1 Mean-Shift Clustering

Mean shift clustering is a sliding-window-based algorithm that attempts to find dense areas of data
points. It is a centroid-based algorithm meaning that the goal is to locate the center points of each
group/class, which works by updating candidates for center points to be the mean of the points within
the sliding-window. These candidate windows are then filtered in a post-processing stage to eliminate
near-duplicates, forming the final set of center points and their corresponding groups.
1. To explain mean-shift we will consider a set of points in two-dimensional space like the above
illustration. We begin with a circular sliding window centered at a point C (randomly selected)
and having radius r as the kernel. Mean shift is a hill-climbing algorithm that involves shifting
this kernel iteratively to a higher density region on each step until convergence.
2. At every iteration, the sliding window is shifted towards regions of higher density by shifting
the center point to the mean of the points within the window (hence the name). The density
within the sliding window is proportional to the number of points inside it. Naturally, by
shifting to the mean of the points in the window it will gradually move towards areas of higher
point density.
3. We continue shifting the sliding window according to the mean until there is no direction at
which a shift can accommodate more points inside the kernel. Check out the graphic above;
we keep moving the circle until we no longer are increasing the density (i.e. number of points
in the window).
4. This process of steps 1 to 3 is done with many sliding windows until all points lie within a
window. When multiple sliding windows overlap the window containing the most points is
preserved. The data points are then clustered according to the sliding window in which they
reside.

In contrast to K-means clustering, there is no need to select the number of clusters as mean-shift
automatically discovers this. That’s a massive advantage. The fact that the cluster centers converge
towards the points of maximum density is also quite desirable as it is quite intuitive to understand and
fits well in a naturally data-driven sense. The drawback is that the selection of the window size/radius
“r” can be non-trivial.

5.3.2 Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is a density-based clustered algorithm similar to mean-shift, but with a couple of notable
advantages.

1. DBSCAN begins with an arbitrary starting data point that has not been visited. The
neighborhood of this point is extracted using a distance epsilon ε (All points which are within
the ε distance are neighborhood points).
2. If there are a sufficient number of points (according to minPoints) within this neighborhood
then the clustering process starts and the current data point becomes the first point in the
new cluster. Otherwise, the point will be labeled as noise (later this noisy point might become
the part of the cluster). In both cases that point is marked as “visited”.
3. For this first point in the new cluster, the points within its ε distance neighborhood also
become part of the same cluster. This procedure of making all points in the ε neighborhood
belong to the same cluster is then repeated for all of the new points that have been just added
to the cluster group.
4. This process of steps 2 and 3 is repeated until all points in the cluster are determined i.e. all
points within the ε neighborhood of the cluster have been visited and labeled.
5. Once we’re done with the current cluster, a new unvisited point is retrieved and processed,
leading to the discovery of a further cluster or noise. This process repeats until all points are
marked as visited. Since at the end of this all points have been visited, each point will have
been marked as either belonging to a cluster or being noise.

DBSCAN poses some great advantages over other clustering algorithms. Firstly, it does not require a
preset number of clusters at all. It also identifies outliers as noises, unlike mean-shift which simply
throws them into a cluster even if the data point is very different. Additionally, it can find arbitrarily
sized and arbitrarily shaped clusters quite well.

The main drawback of DBSCAN is that it doesn’t perform as well as others when the clusters are of
varying density. This is because the setting of the distance threshold ε and minPoints for identifying
the neighborhood points will vary from cluster to cluster when the density varies. This drawback also
occurs with very high-dimensional data since again the distance threshold ε becomes challenging to
estimate.

5.3.3 Expectation–Maximization (EM) Clustering using Gaussian Mixture Models (GMM)

One of the major drawbacks of K-Means is its naive use of the mean value for the cluster center. We
can see why this isn’t the best way of doing things by looking at the image below. On the left-hand
side, it looks quite obvious to the human eye that there are two circular clusters with different radius’
centered at the same mean. K-Means can’t handle this because the mean values of the clusters are
very close together. K-Means also fails in cases where the clusters are not circular, again as a result of
using the mean as cluster center.

Gaussian Mixture Models (GMMs) give us more flexibility than K-Means. With GMMs we assume that
the data points are Gaussian distributed; this is a less restrictive assumption than saying they are
circular by using the mean. That way, we have two parameters to describe the shape of the clusters:
the mean and the standard deviation! Taking an example in two dimensions, this means that the
clusters can take any kind of elliptical shape (since we have a standard deviation in both the x and y
directions). Thus, each Gaussian distribution is assigned to a single cluster.

To find the parameters of the Gaussian for each cluster (e.g. the mean and standard deviation), we
will use an optimization algorithm called Expectation–Maximization (EM). Take a look at the graphic
below as an illustration of the Gaussians being fitted to the clusters. Then we can proceed with the
process of Expectation–Maximization clustering using GMMs.
1. We begin by selecting the number of clusters (like K-Means does) and randomly initializing
the Gaussian distribution parameters for each cluster. One can try to provide a good
guesstimate for the initial parameters by taking a quick look at the data too. This isn’t 100%
necessary as the Gaussians start our as very poor but are quickly optimized.
2. Given these Gaussian distributions for each cluster, compute the probability that each data
point belongs to a particular cluster. The closer a point is to the Gaussian’s center, the more
likely it belongs to that cluster. This should make intuitive sense since with a Gaussian
distribution we are assuming that most of the data lies closer to the center of the cluster.
3. Based on these probabilities, we compute a new set of parameters for the Gaussian
distributions such that we maximize the probabilities of data points within the clusters. We
compute these new parameters using a weighted sum of the data point positions, where the
weights are the probabilities of the data point belonging in that particular cluster.
4. Steps 2 and 3 are repeated iteratively until convergence, where the distributions don’t change
much from iteration to iteration.

There are 2 key advantages to using GMMs. Firstly GMMs are a lot more flexible in terms of cluster
covariance than K-Means; due to the standard deviation parameter, the clusters can take on any
ellipse shape, rather than being restricted to circles. K-Means is actually a special case of GMM in
which each cluster’s covariance along all dimensions approaches 0. Secondly, since GMMs use
probabilities, they can have multiple clusters per data point. So if a data point is in the middle of two
overlapping clusters, we can simply define its class by saying it belongs X-percent to class 1 and Y-
percent to class 2. i.e. GMMs support mixed membership.

5.3.4 Hierarchical Clustering

Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This
algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters
are merged into the same cluster. In the end, this algorithm terminates when there is only a single
cluster left.

The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be
interpreted as:
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest clusters
are then merged till we have just one cluster at the top. The height in the dendrogram at which two
clusters are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by observing
the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram
cut by a horizontal line that can transverse the maximum distance vertically without intersecting a
cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.

Two important things that you should know about hierarchical clustering are:

 This algorithm has been implemented above using bottom up approach. It is also possible
to follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.

 The decision of merging two clusters is taken on the basis of closeness of these clusters.
There are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
o Maximum distance:||a-b||INFINITY = maxi|ai-bi|
o Mahalanobis distance: √((a-b)T S-1 (-b)) {where, s : covariance matrix}

Difference between K Means and Hierarchical clustering

 Hierarchical clustering can’t handle big data well but K Means clustering can. This is because
the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is
quadratic i.e. O(n2).
 In K Means clustering, since we start with random choice of clusters, the results produced by
running the algorithm multiple times might differ. While results are reproducible in
Hierarchical clustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like circle in
2D, sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your
data into. But, you can stop at whatever number of clusters you find appropriate in
hierarchical clustering by interpreting the dendrogram.

5.3.5 Agglomerative Hierarchical Clustering

Hierarchical clustering algorithms fall into 2 categories: top-down or bottom-up. Bottom-up


algorithms treat each data point as a single cluster at the outset and then successively merge (or
agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all
data points. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering
or HAC. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the
unique cluster that gathers all the samples, the leaves being the clusters with only one sample.

1. We begin by treating each data point as a single cluster i.e. if there are X data points in our
dataset then we have X clusters. We then select a distance metric that measures the distance
between two clusters. As an example, we will use average linkage which defines the distance
between two clusters to be the average distance between data points in the first cluster and
data points in the second cluster.
2. On each iteration, we combine two clusters into one. The two clusters to be combined are
selected as those with the smallest average linkage. i.e. according to our selected distance
metric, these two clusters have the smallest distance between each other and therefore are
the most similar and should be combined.
3. Step 2 is repeated until we reach the root of the tree i.e. we only have one cluster which
contains all data points. In this way we can select how many clusters we want in the end,
simply by choosing when to stop combining the clusters i.e. when we stop building the tree!

Hierarchical clustering does not require us to specify the number of clusters and we can even select
which number of clusters looks best since we are building a tree. Additionally, the algorithm is not
sensitive to the choice of distance metric; all of them tend to work equally well whereas with other
clustering algorithms, the choice of distance metric is critical. A particularly good use case of
hierarchical clustering methods is when the underlying data has a hierarchical structure and you want
to recover the hierarchy; other clustering algorithms can’t do this. These advantages of hierarchical
clustering come at the cost of lower efficiency, as it has a time complexity of O(n³), unlike the linear
complexity of K-Means and GMM.

5.4 CLUSTERING EXAMPLE (WITH R)

Let’s check out the impact of clustering on the accuracy of our model for the classification problem
using 3000 observations with 100 predictors of stock data to predicting whether the stock will go up
or down using R. This dataset contains 100 independent variables from X1 to X100 representing profile
of a stock and one outcome variable Y with two levels: 1 for rise in stock price and -1 for drop in stock
price. The dataset is available here for download:

https://drive.google.com/file/d/0ByPBn4rtMQ5HaVFITnBObXdtVUU/view

Let’s first try applying randomforest without clustering.


#loading required libraries

library('randomForest')

library('Metrics')

#set random seed

set.seed(101)

#loading dataset

data<-read.csv("train.csv",stringsAsFactors= T)

#checking dimensions of data

dim(data)

## [1] 3000 101

#specifying outcome variable as factor

data$Y<-as.factor(data$Y)

#dividing the dataset into train and test

train<-data[1:2000,]

test<-data[2001:3000,]

#applying randomForest

model_rf<-randomForest(Y~.,data=train)

preds<-predict(object=model_rf,test[,-101])

table(preds)

## preds

## -1 1

## 453 547

#checking accuracy

auc(preds,test$Y)

## [1] 0.4522703

So, the accuracy we get is 0.45. Now let’s create five clusters based on values of independent variables
using k-means clustering and reapply randomforest.

#combing test and train

all<-rbind(train,test)

#creating 5 clusters using K- means clustering

Cluster <- kmeans(all[,-101], 5)


#adding clusters as independent variable to the dataset.

all$cluster<-as.factor(Cluster$cluster)

#dividing the dataset into train and test

train<-all[1:2000,]

test<-all[2001:3000,]

#applying randomforest

model_rf<-randomForest(Y~.,data=train)

preds2<-predict(object=model_rf,test[,-101])

table(preds2)

## preds2

## -1 1

##548 452

auc(preds2,test$Y)

## [1] 0.5345908

In the above example, even though the final accuracy is poor but clustering has given our model a
significant boost from accuracy of 0.45 to slightly above 0.53. This shows that clustering can indeed
be helpful for supervised machine learning tasks.

Check your Progress 1

State True or False

1. Mean shift clustering is a sliding-window-based algorithm that attempts to find dense areas
of data points.
2. The main advantage of DBSCAN is that it perform well when the clusters are of varying
density.
3. GMMs are a lot more flexible in terms of cluster covariance than K-Means.

Activity 1

Collect or use any existing data and apply different clustering algorithms (discussed in this unit) on it.
Also, list the findings.

Summary

In this unit, we have discussed what the various ways of performing clustering are. It find applications
for unsupervised learning in a large no. of domains. You also saw how you can improve the accuracy
of your supervised machine learning algorithm using clustering. Although clustering is easy to
implement, you need to take care of some important aspects like treating outliers in your data and
making sure each cluster has sufficient population.
Keywords

 Centroid: It is the average position of all the points of an object. In machine learning, it is
define as the center of a cluster as determined by a k-means or k-median algorithm. For
instance, if k is 3, then the k-means or k-median algorithm finds 3 centroids.
 Clustering: It is a machine learning technique that involves the grouping of data points.
 Normal distribution: Also known as the Gaussian distribution, is a probability distribution that
is symmetric about the mean, showing that data near the mean are more frequent in
occurrence than data far from the mean.
 Classification model: A type of machine learning model for distinguishing among two or more
discrete classes.

Self-Assessment Questions

1. State the difference between K-means and hierarchical clustering algorithm.


2. List the applications of clustering algorithms.
3. Explain any 2 types of clustering algorithms.

Answers to Check Your Progress

Check your Progress 1

State True or False

1. True
2. False
3. True

Suggested Reading

1. Introduction to Machine Learning by Ethem Alpaydin


2. Machine Learning For Beginners: Machine Learning Basics for Absolute Beginners. Learn What
ML Is and Why It Matters.Notes on Artificial Intelligence and Deep Learning are also Included,
by Scott Chesterton
3. Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz, Shai
Ben-David
4. Machine Learning For Dummies by John Paul Mueller, Luca Massaron
5. Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models
Explainable", 2019. https://christophm.github.io/interpretable-ml-book/.

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/
Unit 6

R Markdown, Knitr, RPubs

6.1 Introduction

6.2 R Markdown

6.3 Knitr

6.4 RPubs

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

 Understand the benefits of R Markdown and Knitr


 Understand how to publish documents on RPubs

6.1 INTRODUCTION

Literate programming is the basic idea behind dynamic documents and was proposed by Donald Knuth
in 1984. Originally, it was for mixing the source code and documentation of software development
together. Today, we will create dynamic documents in which program or analysis code is run to
produce output (e.g. tables, plots, models, etc.) and then are explained through narrative writing.

The 3 steps of Literate Programming:

1. Parse the source document and separate code from narratives.


2. Execute source code and return results.
3. Mix results from the source code with the original narratives.

If we use literate programming, we could also:

1. Tangle: Extract the source code out of the document.


2. Weave: Execute the code to get the compiled results.

6.2 R Markdown

To fully understand RMarkdown, we first need to cover Markdown, which is a system for writing
simple, readable text that is easily converted to HTML. Markdown essentially is two things:

 A plain text formatting syntax


 A software tool written in Perl.
o Converts the plain text formatting into HTML.

HTML code:

<body>
<section>
<h1>Rock Climbing Packing List</h1>
<ul>
<li>Climbing Shoes</li>
<li>Harness</li>
<li>Backpack</li>
<li>Rope</li>
<li>Belayer</li>
</ul>
</section>
</body>

Its equivalent code in Markdown:

# Rock Climbing Packing List

* Climbing Shoes
* Harness
* Backpack
* Rope
* Belayer

RMarkdown is a variant of Markdown that makes it easy to create dynamic documents, presentations
and reports within RStudio. It has embedded R code chunks to be used with knitr to make it easy to
create reproducible (web-based) reports in the sense that they can be automatically regenerated
when the underlying code it modified.

 RMarkdown lets you combine Markdown with images, links, tables, LaTeX, and actual code.
 RStudio makes creating documents from RMarkdown easy
 RStudio (like R) is free and runs on any operating system.

RMarkdown renders many different types of files including:

 HTML
 PDF
 Markdown
 Microsoft Word
 Presentations:
o Fancy HTML5 presentations:
 ioslides
 Slidy
 Slidify
o PDF Presentations:
 Beamer
o Handouts:
 Tufte Handouts
 HTML R Package Vignettes
 Even Entire Websites

R Markdown is a convenient tool for reproducible and dynamic reports. While it was created for R, it
now accepts many programming languages. For simplicity, we will only work with R today.

 Execute code in a few ways:


o Inline Code: Brief code that takes place during the written part of the document.
o Code Chunks: Parts of the document that includes several lines of program or analysis
code. It may render a plot or table, calculate summary statistics, load packages, etc.
 It is easy to:
o Embed images.
o Learn Markdown syntax.
o Include LaTeX equations.
o Include interactive tables.
o Use version control with Git.
 Even easier to share and collaborate on analyses, projects and publications!
o Add external links - Rmarkdown even understands some html code!
o Make beautifully formatted documents.
 Do not need to worry about page breaks or figure placement.
 Consolidate your code and write up into a single file:
o Slideshows, PDFs, html documents, word files

Simple Workflow:

Briefly, to make a report:

1. Open a .Rmd file.


a. Create a YAML header
2. Write the content with RMarkdown syntax.
3. Embed the R code in code chunks or inline code.
4. Render the document output.

Workflow for creating a report

Overview of the steps RMarkdown takes to get to the rendered document:

1. Create .Rmd report that includes R code chunks and markdown narratives (as indicated in
steps above.).
2. Give the .Rmd file to knitr to execute the R code chunks and create a new .md file.
o Knitr is a package within R that allows the integration of R code into rendered
RMarkdown documents such as HTML, latex, pdf, word, among other document
types.
3. Give the .md file to pandoc, which will create the final rendered document (e.g. html,
Microsoft word, pdf, etc.).
o Pandoc is a universal document converter and enables the conversion of one
document type (in this case: .Rmd) to another (in this case: HTML)
While this may seem complicated, we can hit the “Knit” button at the top of the page like this:

or we can run the following code:

rmarkdown::render ("RMarkdown_Lesson.Rmd", "html_document")

6.2.1 Creating a .Rmd File

Let’s start working with RMarkdown!

1. In the menu bar, click File -> New File -> RMarkdown
a. Or simply click on the green plus sign in the top left corner of RStudio.

2. The window below will pop up.


a. Inside of this window, choose the type of output by selecting the radio buttons. Note:
this output can be easily changed later!
3. Click OK

6.2.2 YAML Headers

YAML stands for “YAML Ain’t Markup Language” and is basically a nested list structure that includes
the metadata of the document. It is enclosed between two lines of three dashes --- and as we saw
above is automatically written by RStudio. A simple example:

---

title: "Analysis Report"

Author: "Marian L. Schmidt"

date: "May 11th, 2016"

output: html_document

---

The above example will create an HTML document. However, the following options are also available.

 html_document
 pdf_document
 word_document
 beamer_presentation (pdf slideshow)
 ioslides_presentation (HTML slideshow)
 and more…

In this unit, we will be focused on HTML files. However, please be welcome to play around with
creating word and pdf documents. Presentation slides take on a slightly different syntax (e.g. to specify
when one slide ends and the next one starts) and so there is a bit of markdown syntax specific to
presentations.
Markdown Basics

Check out the RMarkdown Reference Guide (https://rstudio.com/wp-


content/uploads/2015/03/rmarkdown-reference.pdf)

To list a few from the RMarkdown Cheatsheet:


Helpful Hints:

 End a line with two spaces to start a new paragraph.


 Words formatted like code should be surrounded by back ticks on both sides: `
 To make something superscript surround it with ^ on each side. Superscript was created by
typing Super^script^.
 Equations can be inline code using $ and centered as a blocked equation within the document
with $$. For example E=mc2 is inline while the following is a blocked equation is:

E=mc2

o Note: To make it superscript with $ and $$ a ^ is needed before each alphanumeric


that is superscript.
o Other fun math stuff:
 Square root: $\sqrt{b}$ will create √b
 Fractions: $\frac{1}{2}$ = 1/2
 Fractional Equations: $f(x)=\frac{P(x)}{Q(x)}$ = f(x)=P(x)/Q(x)
 Binomial Coefficients: $\binom{k}{n}$ =
 Integrals: $$\int_{a}^{b} x^2 dx$$ =

o ShareLaTeX is an awesome source for LaTeX code.

Some more mathy stuff:

6.2.3 Embed Code

There are 2 ways to embed code within an RMarkdown document.

 Inline Code: Brief code that takes place during the written part of the document.
 Code Chunks: Parts of the document that includes several lines of program or analysis code.
It may render a plot or table, calculate summary statistics, load packages, etc.
6.2.3.1 Inline R Code

Inline code is created by using a back tick (`) and the letter r followed by another back tick.

For example: 211 is 2048.

Imagine that you’re reporting a p-value and you do not want to go back and add it every time the
statistical test is re-run. Rather, the p-value is 0.0045.

This is really helpful when writing up the results section of a paper. For example, you may have ran a
bunch of statistics for your scientific questions and this would be a way to have R save that value in a
variable name.

For example: Is the gas mileage of automatic versus manual transmissions significantly different within
the mtcars data set? (mtcars dataset is within the R data sets. Use the command datatable(mtcars))

mpg_auto <- mtcars[mtcars$am == 0,]$mpg # automatic transmission mileage

mpg_manual <- mtcars[mtcars$am == 1,]$mpg # manual transmission mileage

transmission_ttest <- t.test(mpg_auto, mpg_manual)

To extract the p-value we can type transmission_ttest$p.value within inline code.

The p-value is 0.0013736.

6.2.3.2 R Code Chunks

R code chunks can be used to render R output into documents or to display code for illustration.

The Anatomy of a code chunk:

To insert an R code chunk, you can type it manually by typing ```{r} followed by ``` on the next line. You
can also press the Insert a new code chunk button or use the shortcut key. This will produce the
following code chunk:

The code chunk input and output is then displayed as follows:

n = 10
seq(n)
## [1] 1 2 3 4 5 6 7 8 9 10

Chunk options

The initial line in a code chunk may include various options. For example, echo=FALSE indicates that
the code will not be shown in the final document (though any results/output would still be displayed).

```{r chunk_name, echo=FALSE}

x <- rnorm(100)

y <- 2*x + rnorm(100)

cor(x, y)

```

You use results="hide" to hide the results/output (but here the code would still be displayed).

```{r chunk_name, results="hide"}

x <- rnorm(100)

y <- 2*x + rnorm(100)

cor(x, y)

```

You use include=FALSE to have the chunk evaluated, but neither the code nor its output displayed.

```{r chunk_name, include=FALSE}

x <- rnorm(100)

y <- 2*x + rnorm(100)

cor(x, y)

```

If we are writing a report for a collaborator, we will often use include=FALSE to suppress all of the
code and largely just include figures.

For figures, you’ll want to use options like fig.width and fig.height. For example:

```{r scatterplot, fig.width=8, fig.height=6}

plot(x, y)

```

Note that if include=FALSE, all of the code, results, and figures will be suppressed. If include=TRUE and
results="hide", the results will be hidden but figures will still be shown. To hide the figures, use
fig.show="hide".
There are lots of different possible “chunk options”. Each must be real R code, as R will be used to
evaluate them. So results=hide is wrong; you need results="hide".

6.3 Knitr

The cut-and-paste approach to report production is tedious, slow, and error-prone. It can be very
harmful to reproducible research and it is inconvenient to reproduce results.

Knitr is an R package that integrates computing and reporting. By incorporating code into text
documents, the analysis, results and discussion are all in one place. Files can then be processed into a
diverse array of document formats, including the important ones for collaborative science: pdfs, Word
documents, slide presentations, and web pages.

This is important for reproducible research because you can create reports, papers, presentations in
which every table, figure, and inline result is generated by code that is tied to the document itself. It
makes life easier when it comes time to make small updates to an analysis, and more importantly, the
code becomes more understandable by virtue of being directly related to the text description.

The importance of text

There are many advantages to creating scientific content using simple text documents. For one, they
are future-proof. Secondly, content tracking tools like git and GitHub work wonderfully with text files.
It is dead-easy to view line-by-line changes in a text file.

Tools like knitr, rmarkdown, and pandoc do the hard work of translating your text files into
“production” documents, like beautifully typeset pdfs, smooth presentations, and Word documents
that your collaborators can’t live without. Creating the base in a text file allows you to focus on the
content and not obsess over the details like formatting and figure placement.

How to use knitr

The basic idea is that your text documents are interrupted by chunks of code that are identified in a
special way. These chunks are evaluated by an R process in the order that they appear. R objects that
are created by chunks remain in the environment for subsequent chunks. The code in the chunks is
displayed and formatted in the document. The results printed by the code are also incorporated into
the document as figures, tables, or other objects. Chunks can stand alone, or be included inline with
text. Many options exist to format the code and the results, and the options can be extended in
countless ways.

How exactly to use knitr depends on the input document format. First we will describe how to use it
with markdown.

Knitr with markdown

When we say markdown we are referring to the plain text formatting syntax, as opposed to the
software. Markdown is a syntax closely related to html (discussed in previous section). It is semantic,
like html, but it is much less verbose. Markdown documents can in fact stand on their own as readable
text documents without being rendered. Semantic means that elements of the document are
described by what they represent, as opposed to how they should look. Thus, for a title, you indicate
that this text is the title, as opposed to this text should be large font and in bold. Here is an example
with the result:
Importantly, the markdown can stand on its own and continue to be readable even though it’s a simple
text file. Contrast that with the generated html:

Or the equivalent Latex:


Thus markdown has the dual advantages of being readable on its own, and having associated tools to
create other document formats from it. Math in markdown is just like math in Latex.

Incorporating code chunks

In markdown, the start of a code chunk is indicated by three backticks and the end of a code chunk is
indicated by three backticks. At the start of the chunk, you tell knitr what type of code it is, give the
chunk a name, and other options:

```{r my-first-chunk, results='asis'}


## code goes in here

```

Inline code is similar, using single backticks instead. Inline code does not have names or options. For
example, r rnorm (10).

Here’s an example of raw output using the mtcars dataset:

```{r mtcars-example}

lm(mpg ~ hp + wt, data = mtcars)

```

##

## Call:

## lm(formula = mpg ~ hp + wt, data = mtcars)

##

## Coefficients:

## (Intercept) hp wt

## 37.2273 -0.0318 -3.8778

And here’s a plot

```{r mt-plot}

library(ggplot2)

ggplot(mtcars, aes(y = mpg, x = wt, size = hp)) + geom_point() + stat_smooth(method = "lm", se = FAL
SE)

```
The concept is very simple. Anything you want to do in R is incorporated into your document, the
results alongside the code. The important details to learn are methods of controlling the output. This
means making nice looking tables, decent figures, and formatting inline results. We will cover these
topics next.

Controlling R output

1. Tables

When outputting tables in knitr, it is important to use the option results = 'asis'. There are several
options for formatting tables in R. The knitr package includes a function called kable that makes
basic knitr tables. There are options to control the number of digits, whether row names are included
or not, column alignment, and other options that depend on the output type.

```{r kable, results = 'asis'}

kable(head(mtcars), digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4)))

```

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.88 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1

For finer control, use the xtable package. There are tons of options (see the help file), and the interface
is a bit clunky. For instance some options are passed to xtable, while others are passed to print.xtable.
The key one for markdown is to use print(xtable(x), type = "html").

```{r xtable, results = 'asis'}


library(xtable)

print(xtable(head(mtcars)), type = "html")

```

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.00 6.00 160.00 110.00 3.90 2.62 16.46 0.00 1.00 4.00 4.00

Mazda RX4 Wag 21.00 6.00 160.00 110.00 3.90 2.88 17.02 0.00 1.00 4.00 4.00

Datsun 710 22.80 4.00 108.00 93.00 3.85 2.32 18.61 1.00 1.00 4.00 1.00

Hornet 4 Drive 21.40 6.00 258.00 110.00 3.08 3.21 19.44 1.00 0.00 3.00 1.00

Hornet Sportabout 18.70 8.00 360.00 175.00 3.15 3.44 17.02 0.00 0.00 3.00 2.00

Valiant 18.10 6.00 225.00 105.00 2.76 3.46 20.22 1.00 0.00 3.00 1.00

The stargazer package creates good-looking tables with minimal effort. It is especially useful for
summarizing a series of regression models. See the help files for all the available options.

```{r star, results = 'asis', warning=FALSE, message=FALSE}

library(stargazer, quietly = TRUE)

fit1 <- lm(mpg ~ wt, mtcars)

fit2 <- lm(mpg ~ wt + hp, mtcars)

fit3 <- lm(mpg ~ wt + hp + disp, mtcars)

stargazer(fit1, fit2, fit3, type = 'html')

```

Dependent variable:

mpg

(1) (2) (3)

wt -5.344*** -3.878*** -3.801***

(0.559) (0.633) (1.066)

hp -0.032*** -0.031**

(0.009) (0.011)

disp -0.001

(0.010)
Constant 37.280*** 37.230*** 37.110***

(1.878) (1.599) (2.111)

Observations 32 32 32

R2 0.753 0.827 0.827

Adjusted R2 0.745 0.815 0.808

Residual Std. Error 3.046 (df = 30) 2.593 (df = 29) 2.639 (df = 28)

F Statistic 91.380*** (df = 1; 30) 69.210*** (df = 2; 29) 44.570*** (df = 3; 28)

Note: p<0.1; p<0.05; p<0.01

For long-running computations, a useful chunk option is cache = TRUE. This creates a folder called
cache in your working directory that will store the results of the computation after the first time you
run it. After that, re-knitting the document is fast.

2. Figures

In general, figures will appear in a knit document exactly as they would appear in the R session. There
are several important figure options to be aware of.

 dev, controls the graphics device used to create the figures. For example pdf, png, or jpeg.
Check out tikzDevice if you are creating pdf output. The tikzDevice generates Latex code
from R plots for use in Latex documents. That way, all fonts match the main text, and the Tex
syntax for mathematics can be used directly in plots. Here are two examples of the power
of tikzDevice: http://bit.ly/114GNdP
 path what directory to save the figures.
 fig_width, fig_height, in inches. Can also be set globally.
 fig_align, left, right or center.

3. Inline results

In papers or reports, we strive to generate every in-text number from code. Anytime we reference a
sample size, report summary statistics, or p-values in the text, paste and sprintf are our friends.
Reporting in this way means no more missed corrections when you have to rerun an analysis, no more
rounding errors, and the comfort that you didn’t mis-type a digit. Here are a couple examples
using paste and sprintf.

paste_meansd <- function(x, digits = 2, na.rm = TRUE){

paste0(round(mean(x, na.rm = na.rm), digits), " (", round(sd(x, na.rm = na.rm), digits), ")")

}
## The mean (sd) of a random sample of normals is `r paste_meansd(rnorm(100))`

The mean (sd) of a random sample of normals is 0.04 (1.01)

sprint_CI95 <- function (mu, se, trans = function(x) x) {

lim <- trans(mu + c(-1.96, 1.96)*se)

sprintf("%.2f (95%% CI: %.2f to %.2f)", mu, lim[1], lim[2])

bfit <- lm(hp ~ disp, mtcars)

## The coefficient estimate is `r sprint_CI95(bfit$coeff[2], sqrt(diag(vcov(bfit)))[2])`

The coefficient estimate is 0.44 (95% CI: 0.32 to 0.56).

Extending knitr

Output and rendering can be customized endlessly. knitr is written in R to process chunks, so write
your own functions. These types of functions are called “hooks”. For example, in this document we
used a custom hook to display the code chunks as they appear:

knit_hooks$set(source = function(x, options){

if (!is.null(options$verbatim) && options$verbatim){

opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*", "", options$params.src)

bef = sprintf('\n\n ```{r %s}\n', opts, "\n")

stringr::str_c(

bef,

knitr:::indent_block(paste(x, collapse = '\n'), " "),

"\n ```\n"

} else {

stringr::str_c("\n\n```", tolower(options$engine), "\n",

paste(x, collapse = '\n'), "\n```\n\n"

})
Output formats

1. Pandoc

Pandoc allows you to take markdown documents and convert them to almost any file format, docx,
pdf, html, and much more. Furthermore, you can even convert markdown to Tex, which comes in
handy for journal submission. Previously, if you wanted html output, you wrote a report in rmarkdown
or rhtml and if you wanted a pdf, you had to start over with a .Rnw file. Now we can seamlessly switch
between formats. The key is in defining what you want in the front matter of your document.

At the top of your markdown file, all of the pandoc parameters are put between --- delimiters. For
example, in this document, I have

---

title: "knitr"

author: Michael Sachs

output:

md_document:

variant: markdown_github

---

This is the front-matter, and it is in yaml format. Rstudio passes these options to pandoc to convert
document to the appropriate format, after knitting. See all of the possible options
here http://rmarkdown.rstudio.com.

For example, if we want pdf output, keeping the intermediate tex file we would use

---

title: "knitr"

author: Michael Sachs

output:

pdf_document:

keep_tex: true

---

Other common document formats are word_document, html_document, and beamer_presentation.


You can include a bibliography in a variety of formats, including .bib, with the bibliography:
myfile.bib option. Include citations with the @bibkey syntax.

To recap, this diagram illustrates typical workflow. We start with an .Rmd, an r-markdown document
with code, background, and descriptions. knitr does the work to evaluate the code, putting the results
in a markdown document, which can then be converted to a variety of formats with pandoc. To extract
the R code from the original document we use the knitr::purl function. We can then easily incorporate
the code chunks into other document types, like presentations.
These tools are powerful and useful for all of the beautiful reports and variety of documents you can
create. More important, however, are the plain-text files that generate the reports. Writing code along
with prose describing the analysis gives others and our future selves a better and clearer
understanding of the analysis itself. knitr makes it easy to write analysis code for people, not
computers.

6.4 RPubs

Once you make a beautiful dynamic document you may wish to share it with others. One option to
share it with the world is to host it on RPubs. With RStudio, this makes is very easy. Do the following:

1. Create your awesome .Rmd document.


2. Click the Knit button to render your HTML document to be published.
3. In the top right corner of the preview window, click the publish button and follow the
directions. Note: You will need to create an RPubs profile.
4. Once you have a profile you can choose the following:
a. The title of the document.
b. A description of the document.
c. The URL in which the website will be hosted. (Note: The beginning of the URL will be:
www.rpubs.com/your_username/name_of_your_choice)

Updating RPubs

If you make some changes to your document it is very easy to update the webpage. Once you have
rendered your edited document click the button on the top right corner of the preview window.
The edited document will be in the same URL as the original document.

Check your Progress 1

Match the following

1. Inline code a. helpful in sharing dynamic document with the world


2. Code Chunks b. allows to convert markdown documents into almost any file format
3. Pandoc c. parts of the document that includes several lines of program or analysis
code
4. RPubs d. brief code that takes place during the written part of the document

Activity 1

Install R Markdown, Knitr and RPubs and work on it as per the steps given in this unit.

Summary

RMarkdown is a variant of Markdown that makes it easy to create dynamic documents, presentations
and reports within RStudio. It has embedded R code chunks to be used with knitr to make it easy to
create reproducible (web-based) reports in the sense that they can be automatically regenerated
when the underlying code it modified.

YAML stands for “YAML Ain’t Markup Language” and is basically a nested list structure that includes
the metadata of the document. There are 2 ways to embed code within an RMarkdown document:
Inline Code and Code Chunks. knitr is an R package that integrates computing and reporting. By
incorporating code into text documents, the analysis, results and discussion are all in one place.
Pandoc allows you to take markdown documents and convert them to almost any file format, docx,
pdf, html, and much more. Furthermore, you can even convert markdown to Tex, which comes in
handy for journal submission.

Keywords

 YAML: It is a human-readable data-serialization language. It is commonly used for


configuration files and in applications where data is being stored or transmitted.
 Knitr: knitr is an engine for dynamic report generation with R. It is a package in the statistical
programming language R that enables integration of R code into LaTeX, LyX, HTML,
Markdown, AsciiDoc, and reStructuredText documents.
 RPubs: It is a great platform for easy and free publishing of HTML documents generated
from RMarkdown and written in RStudio.
 Dynamic Document: It is a document that is continually edited and updated.

Self-Assessment Questions

1. Explain the purpose of R Markdown.


2. State the advantages of knitr.
3. Write a short note on Pandoc.

Answers to Check Your Progress


Check your Progress 1
Match the following
1. - d
2. - c
3. - b
4. - a
Suggested Reading

1. Xie Y. (2015). Dynamic documents with R and knitr. 2nd ed. Chapman; Hall/CRC: Boca Raton,
Florida.
2. R Markdown: The Definitive Guide, by Yihui Xie, J.J. Allaire, Garrett Grolemund
3. Dynamic Documents with R and knitr, by Yihui Xie
4. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, by Hadley Wickham,
Garrett Grolemund

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/
Unit 7

ggplot2

7.1 Introduction

7.2 Introduction to ggplot2

7.3 Customizing the look and feel

7.4 ggplot2 Visualizations - The Master List

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives

After going through this unit, you will be able to:

 Understand ggplot syntax and its use


 Implement graphs using ggplot

7.1 Introduction

The most elegant and aesthetically pleasing graphics framework available in R is ggplot2. It has a nicely
planned structure to it. This unit focuses on exposing this underlying structure you can use to make
any ggplot. It covers the basic knowledge about constructing simple ggplots and modifying the
components and aesthetics, customizing the look and feel.

7.2 Introduction to ggplot2

7.2.1. Understanding the ggplot Syntax

The syntax for constructing ggplots could be puzzling if we work primarily with base graphics. The main
difference is that, unlike base graphics, ggplot works with dataframes and not individual vectors. All
the data needed to make the plot is typically be contained within the dataframe supplied to
the ggplot() itself or can be supplied to respective geoms.
The second noticeable feature is that you can keep enhancing the plot by adding more layers (and
themes) to an existing plot created using the ggplot() function.
Let’s initialize a basic ggplot based on the midwest dataset.
# Setup
options(scipen=999) # turn off scientific notation like 1e+06
library(ggplot2)
data("midwest", package = "ggplot2") # load the data
# midwest <- read.csv("http://goo.gl/G1K41K") # alt source
# Init ggplot
ggplot(midwest, aes(x=area, y=poptotal)) # area and poptotal are columns in 'midwest'
A blank ggplot is drawn. Even though the x and y are specified, there are no points or lines in it. This
is because, ggplot doesn’t assume that you meant a scatterplot or a line chart to be drawn. We have
only told ggplot what dataset to use and what columns should be used for X and Y axis. We haven’t
explicitly asked it to draw any points.
Also note that aes() function is used to specify the X and Y axes. That’s because, any information that
is part of the source dataframe has to be specified inside the aes() function.
7.2.2. How to Make a Simple Scatterplot
Let’s make a scatterplot on top of the blank ggplot by adding points using a geom layer
called geom_point.
library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) + geom_point()

We got a basic scatterplot, where each point represents a county. However, it lacks some basic
components such as the plot title, meaningful axis labels etc. Moreover most of the points are
concentrated on the bottom portion of the plot, which is not so nice. You will see how to rectify these
in upcoming steps.
Like geom_point(), there are many such geom layers. For now, let’s just add a smoothing layer
using geom_smooth(method='lm'). Since the method is set as lm (short for linear model), it draws the
line of best fit.
library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") # set s
e=FALSE to turnoff confidence bands
plot(g)
The line of best fit is in blue. Can you find out what other method options are available
for geom_smooth? You might have noticed that majority of points lie in the bottom of the chart which
doesn’t really look nice. So, let’s change the Y-axis limits to focus on the lower half.
7.2.3. Adjusting the X and Y axis limits
The X and Y axis limits can be controlled in 2 ways.
Method 1: By deleting the points outside the range
This will change the lines of best fit or smoothing lines as compared to the original data.
This can be done by xlim() and ylim(). You can pass a numeric vector of length 2 (with max and min
values) or just the max and min values itself.
library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") # set s
e=FALSE to turnoff confidence bands
# Delete the points outside the limits
g + xlim(c(0, 0.1)) + ylim(c(0, 1000000)) # deletes points
# g + xlim(0, 0.1) + ylim(0, 1000000) # deletes points
In this case, the chart was not built from scratch but rather was built on top of g. This is because, the
previous plot was stored as g, a ggplot object, which when called will reproduce the original plot. Using
ggplot, you can add more layers, themes and other settings on top of this plot.
Did you notice that the line of best fit became more horizontal compared to the original plot? This is
because, when using xlim() and ylim(), the points outside the specified range are deleted and will not
be considered while drawing the line of best fit (using geom_smooth(method='lm')). This feature
might come in handy when you wish to know how the line of best fit would change when some
extreme values (or outliers) are removed.
Method 2: Zooming In
The other method is to change the X and Y axis limits by zooming in to the region of
interest without deleting the points. This is done using coord_cartesian().
Let’s store this plot as g1.
library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") # set s
e=FALSE to turnoff confidence bands
# Zoom in without deleting the points outside the limits.
# As a result, the line of best fit is the same as the original plot.
g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) # zooms in
plot(g1)
Since all points were considered, the line of best fit did not change.

7.2.4 How to Change the Title and Axis Labels

We have stored this as g1. Let’s add the plot title and labels for X and Y axis. This can be done in one
go using the labs() function with title, x and y arguments. Another option is to use
the ggtitle(), xlab() and ylab().
library(ggplot2)
g <- ggplot(midwest, aes(x=area, y=poptotal)) + geom_point() + geom_smooth(method="lm") # set s
e=FALSE to turnoff confidence bands
g1 <- g + coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) # zooms in
# Add Title and Labels
g1 + labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", cap
tion="Midwest Demographics")
# or
g1 + ggtitle("Area Vs Population", subtitle="From midwest dataset") + xlab("Area") + ylab("Populatio
n")
Excellent! So here is the full function call.
# Full Plot call
library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() +
geom_smooth(method="lm") +
coord_cartesian(xlim=c(0,0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", captio
n="Midwest Demographics")
7.2.5 How to Change the Color and Size of Points
How to Change the Color and Size to Static?
We can change the aesthetics of a geom layer by modifying the respective geoms. Let’s change the
color of the points and the line to a static value.
library(ggplot2)
ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(col="steelblue", size=3) + # Set static color and size for points
geom_smooth(method="lm", col="firebrick") + # change the color of line
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", captio
n="Midwest Demographics")

How to Change the Color to Reflect Categories in another Column?


Suppose if we want the color to change based on another column in the source dataset (midwest), it
must be specified inside the aes() function.
library(ggplot2)
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", captio
n="Midwest Demographics")
plot(gg)

Now each point is colored based on the state it belongs because of aes(col=state). Not just color,
but size, shape, stroke (thickness of boundary) and fill (fill color) can be used to discriminate
groupings. As an added benefit, the legend is added automatically. If needed, it can be removed by
setting the legend.position to None from within a theme() function.
gg + theme(legend.position="None") # remove legend
Also, you can change the color palette entirely.
gg + scale_colour_brewer(palette = "Set1") # change color palette
More of such palettes can be found in the RColorBrewer package
library(RColorBrewer)
head(brewer.pal.info, 10) # show 10 palettes
#> maxcolors category colorblind
#> BrBG 11 div TRUE
#> PiYG 11 div TRUE
#> PRGn 11 div TRUE
#> PuOr 11 div TRUE
#> RdBu 11 div TRUE
#> RdGy 11 div FALSE
#> RdYlBu 11 div TRUE
#> RdYlGn 11 div FALSE
#> Spectral 11 div FALSE
#> Accent 8 qual FALSE

7.2.6 How to Change the X Axis Texts and Ticks Location


How to Change the X and Y Axis Text and its Location?
Now let’s see how to change the X and Y axis text and its location. This involves two
aspects: breaks and labels.

1. Step 1: Set the breaks

The breaks should be of the same scale as the X axis variable. Note that we are
using scale_x_continuous because, the X axis variable is a continuous variable. Had it been a
date variable, scale_x_date could be used. Like scale_x_continuous() an
equivalent scale_y_continuous() is available for Y axis.

library(ggplot2)
# Base plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", captio
n="Midwest Demographics")
# Change breaks
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01))
2. Step 2: Change the labels

You can optionally change the labels at the axis ticks. labels take a vector of the same length
as breaks.

Let us demonstrate by setting the labels to alphabets from a to k (though there is no meaning to it in
this context).
library(ggplots)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", captio
n="Midwest Demographics")
# Change breaks + label
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01), labels = letters[1:11])
If you need to reverse the scale, use scale_x_reverse().
library(ggplot2)
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", captio
n="Midwest Demographics")
# Reverse X Axis Scale
gg + scale_x_reverse()

How to Write Customized Texts for Axis Labels, by Formatting the Original Values?
Let’s set the breaks for Y axis text as well and format the X and Y axis labels. We have used 2 methods
for formatting labels:
Method 1: Using sprintf(). (Have formatted it as % in below example)
Method 2: Using a custom user defined function. (Formatted 1000’s to 1K scale)
Use whichever method feels convenient.
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", captio
n="Midwest Demographics")
# Change Axis Texts
gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01), labels = sprintf("%1.2f%%", seq(0, 0.1, 0.01))) +
scale_y_continuous(breaks=seq(0, 1000000, 200000), labels = function(x){paste0(x/1000, 'K')})

How to Customize the Entire Theme in One Shot using Pre-Built Themes?
Finally, instead of changing the theme components individually, we can change the entire theme itself
using pre-built themes. The help page?theme_bw shows all the available built-in themes.
This again is commonly done in couple of ways.
* Use the theme_set() to set the theme before drawing the ggplot. Note that this setting will affect all
future plots.
* Draw the ggplot and then add the overall theme setting (eg. theme_bw())
library(ggplot2)
# Base plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state), size=3) + # Set color to vary based on state categories.
geom_smooth(method="lm", col="firebrick", size=2) +
coord_cartesian(xlim=c(0, 0.1), ylim=c(0, 1000000)) +
labs(title="Area Vs Population", subtitle="From midwest dataset", y="Population", x="Area", captio
n="Midwest Demographics")
gg <- gg + scale_x_continuous(breaks=seq(0, 0.1, 0.01))
# method 1: Using theme_set()
theme_set(theme_classic()) # not run
gg
# method 2: Adding theme Layer itself.
gg + theme_bw() + labs(subtitle="BW Theme")
gg + theme_classic() + labs(subtitle="Classic Theme")

7.3 Customizing the look and feel

Let’s begin with a scatterplot of Population against Area from midwest dataset. The point’s color and
size vary based on state (categorical) and popdensity (continuous) columns respectively. We have
done something similar in the previous section already.
The below plot has the essential components such as the title, axis labels and legend setup nicely. But
how to modify the looks?
Most of the requirements related to look and feel can be achieved using the theme() function. It
accepts a large number of arguments. Type ?theme in the R console and see for yourself.
# Setups
options(scipen=999)
library(ggplot2)
data("midwest", package = "ggplot2")
theme_set(theme_bw())
# midwest <- read.csv("http://goo.gl/G1K41K") # bkup data source
# Add plot components --------------------------------
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
# Call plot ------------------------------------------
plot(gg)

The arguments passed to theme() components require to be set using


special element_type() functions. They are of 4 major types.

1. element_text(): Since the title, subtitle and captions are textual


items, element_text() function is used to set it.
2. element_line(): Likewise element_line() is use to modify line based components such as
the axis lines, major and minor grid lines, etc.
3. element_rect(): Modifies rectangle components such as plot and panel background.
4. element_blank(): Turns off displaying the theme item.

Let’s discuss a number of tasks related to changing the plot output, starting with modifying the title
and axis texts.
7.3.1 Adding Plot and Axis Titles
Plot and axis titles and the axis text are part of the plot’s theme. Therefore, it can be modified using
the theme() function. The theme() function accepts one of the four element_type() functions
mentioned above as arguments. Since the plot and axis titles are textual
components, element_text() is used to modify them.
Below, we have changed the size, color, face and line-height. The axis text can be rotated by changing
the angle.
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
# Modify theme components -------------------------------------------
gg + theme(plot.title=element_text(size=20,
face="bold",
family="American Typewriter",
color="tomato",
hjust=0.5,
lineheight=1.2), # title
plot.subtitle=element_text(size=15,
family="American Typewriter",
face="bold",
hjust=0.5), # subtitle
plot.caption=element_text(size=15), # caption
axis.title.x=element_text(vjust=10,
size=15), # X axis title
axis.title.y=element_text(size=15), # Y axis title
axis.text.x=element_text(size=10,
angle = 30,
vjust=.5), # X axis text
axis.text.y=element_text(size=10)) # Y axis text
 vjust, controls the vertical spacing between title (or label) and plot.
 hjust, controls the horizontal spacing. Setting it to 0.5 centers the title.
 family, is used to set a new font
 face, sets the font face (“plain”, “italic”, “bold”, “bold.italic”)

Above example covers some of the frequently used theme modifications and the actual list is too long.
So ?theme is the first place you want to look at if you want to change the look and feel of any
component.
7.3.2. Modifying Legend
Whenever your plot’s geom (like points, lines, bars, etc) is set to change the aesthetics
(fill, size, col, shape or stroke) based on another column, as in geom_point(aes(col=state,
size=popdensity)), a legend is automatically drawn.
If you are creating a geom where the aesthetics are static, a legend is not drawn by default. In such
cases you might want to create your own legend manually. The below examples are for cases where
you have the legend created automatically.
How to Change the Legend Title
Let’s now change the legend title. We have two legends, one each for color and size. The size is based
on a continuous variable while the color is based on a categorical (discrete) variable.
There are 3 ways to change the legend title.
Method 1: Using labs()
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
gg + labs(color="State", size="Density") # modify legend title
Method 2: Using guides()
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
gg <- gg + guides(color=guide_legend("State"), size=guide_legend("Density")) # modify legend title
plot(gg)
Method 3: Using scale_aesthetic_vartype() format

The format of scale_aestheic_vartype() allows you to turn off legend for one particular aesthetic,
leaving the rest in place. This can be done just by setting guide=FALSE. For example, if the legend is
for size of points based on a continuous variable, then scale_size_continuous() would be the right
function to use.
Can you guess what function to use if you have a legend for shape and is based on a categorical
variable?
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
# Modify Legend
gg + scale_color_discrete(name="State") + scale_size_continuous(name = "Density", guide = FALSE) #
turn off legend for size
How to Change Legend Labels and Point Colors for Categories
This can be done using the respective scale_aesthetic_manual() function. The new legend labels are
supplied as a character vector to the labels argument. If you want to change the color of the
categories, it can be assigned to the values argument as shown in below example.

library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
gg + scale_color_manual(name="State",
labels = c("Illinois",
"Indiana",
"Michigan",
"Ohio",
"Wisconsin"),
values = c("IL"="blue",
"IN"="red",
"MI"="green",
"OH"="brown",
"WI"="orange"))

Change the Order of Legend


In case you want to show the legend for color (State) before size (Density), it can be done with
the guides() function. The order of the legend has to be set as desired.
If you want to change the position of the labels inside the legend, set it in the required order as seen
in previous example.
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
gg + guides(colour = guide_legend(order = 1),
size = guide_legend(order = 2))

How to Style the Legend Title, Text and Key


The styling of legend title, text, key and the guide can also be adjusted. The legend’s key is a figure like
element, so it has to be set using element_rect() function.
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
gg + theme(legend.title = element_text(size=12, color = "firebrick"),
legend.text = element_text(size=10),
legend.key=element_rect(fill='springgreen')) +
guides(colour = guide_legend(override.aes = list(size=2, stroke=1.5)))
How to Remove the Legend and Change Legend Positions
The legend’s position inside the plot is an aspect of the theme. So it can be modified using
the theme() function. If you want to place the legend inside the plot, you can additionally control the
hinge point of the legend using legend.justification.
The legend.position is the x and y axis position in chart area, where (0, 0) is bottom left of the chart
and (1, 1) is top right. Likewise, legend.justification refers to the hinge point inside the legend.
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
# No legend --------------------------------------------------
gg + theme(legend.position="None") + labs(subtitle="No Legend")
# Legend to the left -----------------------------------------
gg + theme(legend.position="left") + labs(subtitle="Legend on the Left")
# legend at the bottom and horizontal ------------------------
gg + theme(legend.position="bottom", legend.box = "horizontal") + labs(subtitle="Legend at Bottom"
)
# legend at bottom-right, inside the plot --------------------
gg + theme(legend.title = element_text(size=12, color = "salmon", face="bold"),
legend.justification=c(1,0),
legend.position=c(0.95, 0.05),
legend.background = element_blank(),
legend.key = element_blank()) +
labs(subtitle="Legend: Bottom-Right Inside the Plot")
# legend at top-left, inside the plot -------------------------
gg + theme(legend.title = element_text(size=12, color = "salmon", face="bold"),
legend.justification=c(0,1),
legend.position=c(0.05, 0.95),
legend.background = element_blank(),
legend.key = element_blank()) +
labs(subtitle="Legend: Top-Left Inside the Plot")
7.3.3. Adding Text, Label and Annotation
How to Add Text and Label around the Points
Let’s try adding some text. We will add text to only those counties that have population greater than
400K. In order to achieve this, I create another subsetted dataframe (midwest_sub) that contains only
the counties that qualifies the said condition.
Then, draw the geom_text and geom_label with this new dataframe as the data source. This will
ensure that labels (geom_label) are added only for the points contained in the new dataframe.
library(ggplot2)
# Filter required rows.
midwest_sub <- midwest[midwest$poptotal > 300000, ]
midwest_sub$large_county <- ifelse(midwest_sub$poptotal > 300000, midwest_sub$county, "")
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
# Plot text and label ------------------------------------------------------
gg + geom_text(aes(label=large_county), size=2, data=midwest_sub) + labs(subtitle="With ggplot2::g
eom_text") + theme(legend.position = "None") # text
gg + geom_label(aes(label=large_county), size=2, data=midwest_sub, alpha=0.25) + labs(subtitle="W
ith ggplot2::geom_label") + theme(legend.position = "None") # label
# Plot text and label that REPELS eachother (using ggrepel pkg) ------------
library(ggrepel)
gg + geom_text_repel(aes(label=large_county), size=2, data=midwest_sub) + labs(subtitle="With ggr
epel::geom_text_repel") + theme(legend.position = "None") # text
gg + geom_label_repel(aes(label=large_county), size=2, data=midwest_sub) + labs(subtitle="With gg
repel::geom_label_repel") + theme(legend.position = "None") # label
Since the label is looked up from a different dataframe, we need to set the data argument.
How to Add Annotations Anywhere inside Plot
Let’s see how to add annotation to any specific point of the chart. It can be done with
the annotation_custom() function which takes in a grob as the argument. So, let’s create a grob that
holds the text you want to display using the grid package.
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest")
# Define and add annotation -------------------------------------
library(grid)
my_text <- "This text is at x=0.7 and y=0.8!"
my_grob = grid.text(my_text, x=0.7, y=0.8, gp=gpar(col="firebrick", fontsize=14, fontface="bold"))
gg + annotation_custom(my_grob)
7.3.4. Flipping and Reversing X and Y Axis
How to flip the X and Y axis?
Just add coord_flip().
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest", subtitle="X a
nd Y axis Flipped") + theme(legend.position = "None")
# Flip the X and Y axis -------------------------------------------------
gg + coord_flip()
How to reverse the scale of an axis?
This is quite simple. Use scale_x_reverse() for X axis and scale_y_reverse() for Y axis.
library(ggplot2)
# Base Plot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point(aes(col=state, size=popdensity)) +
geom_smooth(method="loess", se=F) + xlim(c(0, 0.1)) + ylim(c(0, 500000)) +
labs(title="Area Vs Population", y="Population", x="Area", caption="Source: midwest", subtitle="Axi
s Scales Reversed") + theme(legend.position = "None")
# Reverse the X and Y Axis ---------------------------
gg + scale_x_reverse() + scale_y_reverse()
7.3.5. Faceting: Draw multiple plots within one figure
Let’s use the mpg dataset for this one. It is available in the ggplot2 package, or you can import it from
this link.
library(ggplot2)
data(mpg, package="ggplot2") # load data
# mpg <- read.csv("http://goo.gl/uEeRGu") # alt data source
g <- ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
labs(title="hwy vs displ", caption = "Source: mpg") +
geom_smooth(method="lm", se=FALSE) +
theme_bw() # apply bw theme
plot(g)

We have a simple chart of highway mileage (hwy) against the engine displacement (displ) for the
whole dataset. But what if you want to study how this relationship varies for different classes of
vehicles?
Facet Wrap
The facet_wrap() is used to break down a large plot into multiple small plots for individual categories.
It takes a formula as the main argument. The items to the left of ~ forms the rows while those to the
right form the columns.
By default, all the plots share the same scale in both X and Y axis. You can set them free by
setting scales='free' but this way it could be harder to compare between groups.
library(ggplot2)
# Base Plot
g <- ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
theme_bw() # apply bw theme
# Facet wrap with common scales
g + facet_wrap( ~ class, nrow=3) + labs(title="hwy vs displ", caption = "Source: mpg", subtitle="Ggplo
t2 - Faceting - Multiple plots in one figure") # Shared scales
# Facet wrap with free scales
g + facet_wrap( ~ class, scales = "free") + labs(title="hwy vs displ", caption = "Source: mpg", subtitle=
"Ggplot2 - Faceting - Multiple plots in one figure with free scales") # Scales free
So, what do you infer from this? For one, most 2 seater cars have higher engine displacement while
the minivan and compact vehicles are on the lower side. This is evident from where the points are
placed along the X-axis.
Also, the highway mileage drops across all segments as the engine displacement increases. This drop
seems more pronounced in compact and subcompact vehicles.
Facet Grid
The headings of the middle and bottom rows take up significant space. The facet_grid() would get rid
of it and give more area to the charts. The main difference with facet_grid is that it is not possible to
choose the number of rows and columns in the grid.
Let’s create a grid to see how it varies with manufacturer.
library(ggplot2)
# Base Plot
g <- ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
labs(title="hwy vs displ", caption = "Source: mpg", subtitle="Ggplot2 - Faceting - Multiple plots in
one figure") +
geom_smooth(method="lm", se=FALSE) +
theme_bw() # apply bw theme
# Add Facet Grid
g1 <- g + facet_grid(manufacturer ~ class) # manufacturer in rows and class in columns
plot(g1)
Let’s make one more to vary by cylinder.
library(ggplot2)
# Base Plot
g <- ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
labs(title="hwy vs displ", caption = "Source: mpg", subtitle="Ggplot2 - Facet Grid - Multiple plots i
n one figure") +
theme_bw() # apply bw theme
# Add Facet Grid
g2 <- g + facet_grid(cyl ~ class) # cyl in rows and class in columns.
plot(g2)
It is possible to layout both these charts in the sample panel. Let’s use the gridExtra() package for
this.
# Draw multiple plots in same figure.
library(gridExtra)
gridExtra::grid.arrange(g1, g2, ncol=2)

7.3.6 Modifying Plot Background, Major and Minor Axis


How to Change Plot background
library(ggplot2)
# Base Plot
g <- ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
theme_bw() # apply bw theme
# Change Plot Background elements -----------------------------------
g + theme(panel.background = element_rect(fill = 'khaki'),
panel.grid.major = element_line(colour = "burlywood", size=1.5),
panel.grid.minor = element_line(colour = "tomato",
size=.25,
linetype = "dashed"),
panel.border = element_blank(),
axis.line.x = element_line(colour = "darkorange",
size=1.5,
lineend = "butt"),
axis.line.y = element_line(colour = "darkorange",
size=1.5)) +
labs(title="Modified Background",
subtitle="How to Change Major and Minor grid, Axis Lines, No Border")
# Change Plot Margins -----------------------------------------------
g + theme(plot.background=element_rect(fill="salmon"),
plot.margin = unit(c(2, 2, 1, 1), "cm")) + # top, right, bottom, left
labs(title="Modified Background", subtitle="How to Change Plot Margin")

How to Remove Major and Minor Grid, Change Border, Axis Title, Text and Ticks
library(ggplot2)
# Base Plot
g <- ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
theme_bw() # apply bw theme
g + theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank()) +
labs(title="Modified Background", subtitle="How to remove major and minor axis grid, border, axis t
itle, text and ticks")

Add an Image in Background


library(ggplot2)
library(grid)
library(png)
img <- png::readPNG("screenshots/Rlogo.png") # source: https://www.r-project.org/
g_pic <- rasterGrob(img, interpolate=TRUE)
# Base Plot
g <- ggplot(mpg, aes(x=displ, y=hwy)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
theme_bw() # apply bw theme
g + theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(size = rel(1.5), face = "bold"),
axis.ticks = element_blank()) +
annotation_custom(g_pic, xmin=5, xmax=7, ymin=30, ymax=45)

Inheritance Structure of Theme Components

source: http://docs.ggplot2.org/dev/vignettes/themes.html
7.4 ggplot2 Visualizations - The Master List

An effective chart is one that:

1. Conveys the right information without distorting facts.


2. Is simple but elegant. It should not force you to think much in order to get it.
3. Aesthetics supports information rather that overshadow it.
4. Is not overloaded with information.

The list below sorts the visualizations based on its primary purpose. Primarily, there are 8 types of
objectives you may construct plots. So, before you actually make the plot, try and figure what findings
and relationships you would like to convey or examine through the visualization. Chances are it will
fall under one (or sometimes more) of these 8 categories.

 Correlation
o Scatterplot
o Scatterplot With Encircling
o Jitter Plot
o Counts Chart
o Bubble Plot
o Animated Bubble Plot
o Marginal Histogram / Boxplot
o Correlogram
 Deviation
o Diverging Bars
o Diverging Lollipop Chart
o Diverging Dot Plot
o Area Chart
 Ranking
o Ordered Bar Chart
o Lollipop Chart
o Dot Plot
o Slope Chart
o Dumbbell Plot
 Distribution
o Histogram
o Density Plot
o Box Plot
o Dot + Box Plot
o Tufte Boxplot
o Violin Plot
o Population Pyramid
 Composition
o Waffle Chart
o Pie Chart
o Treemap
o Bar Chart
 Change
o Time Series Plots
 From a Data Frame
 Format to Monthly X Axis
 Format to Yearly X Axis
 From Long Data Format
 From Wide Data Format
o Stacked Area Chart
o Calendar Heat Map
o Slope Chart
o Seasonal Plot
 Groups
o Dendrogram
o Clusters
 Spatial
o Open Street Map
o Google Road Map
o Google Hybrid Map

Check your Progress 1

State True or False

1. ggplot works with dataframes and not individual vectors.


2. Plot and axis titles and the axis text can be modified using the theme() function.
3. vjust, controls the horizontal spacing between title (or label) and plot.
4. The facet_wrap() is used to break down a large plot into multiple small plots for individual
categories.

Activity 1

Install ggplot2 package and use to draw charts as per the steps given in this unit.

Summary

An effective chart is one that conveys the right information without distorting facts. In this unit, we
have discussed about ggplot2 which is the most elegant and aesthetically pleasing graphics framework
available in R. It has a nicely planned structure to it. We have covers the basic knowledge about
constructing simple ggplots and modifying the components and aesthetics, customizing the look and
feel of charts. We have also listed 8 categories of chart.

Keywords

 Data frames: It is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in
rows and columns.
 Continuous variable: A continuous variable is one which can take on an uncountable set of
values. For example, a variable over a non-empty range of the real numbers is continuous, if
it can take on any value in that range.
Self-Assessment Questions

1. Explain the properties of effective chart.


2. What are the functions used to change the color and size of a graph?
3. Explain the methods used for changing the X and Y limits.
4. Write a short note on geom_text and geom_label functions.

Answers to Check Your Progress


Check your Progress 1

State True or False

1. True
2. True
3. False
4. True

Suggested Reading

1. ggplot2: Elegant Graphics for Data Analysis by Hadley Wickham


2. ggplot2 Essentials by Donato Teutonico
3. R Graphics Cookbook by Winston Chang
4. ggplot2: The Elements for Elegant Data Visualization in R by Alboukadel Kassambara
5. Applied Data Visualization with R and ggplot2: Create useful, elaborate, and visually appealing
plots by Dr. Tania Moulik
6. r-statistics.co by Selva Prabhakaran: https://r-statistics.co/

The content of this unit of the Learning Material is a derivative copy of materials from r-statistics.co
by Selva Prabhakaran licensed under Creative Commons Attribution-NonCommercial 3.0 CC-BY-NC
3.0). Download this for free at, https://r-statistics.co/
Unit 8
Computation with Python – NumPy, SciPy

8.1 Introduction
8.2 The NumPy Array Object
8.3 Numerical Operations on Arrays
8.4 More Elaborate Arrays
8.5 Scipy
Summary
Keywords
Self-Assessment Questions
Answers to Check Your Progress
Suggested Reading
Objectives:

After going through this unit, you will be able to:

 Understand the Numpy and Scipy


 Perform the computation using Numpy and Scipy

8.1 INTRODUCTION
Numpy is a general-purpose array-processing package. It provides a high-performance
multidimensional array object, and tools for working with these arrays. It is the fundamental package
for scientific computing with Python. Besides its obvious scientific uses, Numpy can also be used as an
efficient multi-dimensional container of generic data. Scipy is a free and open-source python library
used for scientific computing and technical computing. It contains modules for optimization, linear
algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers
and other tasks common in science and engineering. Scipy builds on the Numpy array object and is
part of the Numpy stack which includes tools like Matplotlib, Pandas and SymPy, and an expanding
set of scientific computing libraries. This unit gives an overview of Numpy and Scipy, the core tool for
performant numerical computing with Python.

8.2 THE NUMPY ARRAY OBJECT


8.2.1 What are NumPy and NumPy arrays?
NumPy arrays
 Python objects
o high-level number objects: integers, floating point
o containers: lists (costless insertion and append), dictionaries (fast lookup)
 NumPy provides
o extension package to Python for multi-dimensional arrays
o closer to hardware (efficiency)
o designed for scientific computation (convenience)
o Also known as array oriented computing
>>> import numpy as np
>>> a = np.array([0, 1, 2, 3])
>>> a
array([0, 1, 2, 3])

For example, An array containing:


• values of an experiment/simulation at discrete time steps
• signal recorded by a measurement device, e.g. sound wave
• pixels of an image, grey-level or colour
• 3-D data measured at different X-Y-Z positions, e.g. MRI scan
• ...

Why it is useful: Memory-efficient container that provides fast numerical operations.


In [1]: L = range(1000)
In [2]: %timeit [i**2 for i in L]
1000 loops, best of 3: 403 us per loop
In [3]: a = np.arange(1000)
In [4]: %timeit a**2
100000 loops, best of 3: 12.7 us per loop

NumPy Reference documentation


• Interactive help:
In [5]: np.array?
String Form:<built-in function array>
Docstring:
array(object, dtype=None, copy=True, order=None, subok=False, ndmin=0, ...
• Looking for something:
>>> np.lookfor('create array')
Search results for 'create array'
---------------------------------
numpy.array
Create an array.
numpy.memmap
Create a memory-map to an array stored in a *binary* file on disk.
In [6]: np.con*?
np.concatenate
np.conj
np.conjugate
np.convolve

Import conventions
The recommended convention to import numpy is:
>>> import numpy as np

8.2.2 Creating arrays


Manual construction of arrays
• 1-D:
>>> a = np.array([0, 1, 2, 3])
>>> a
array([0, 1, 2, 3])
>>> a.ndim
1
>>> a.shape
(4,)
>>> len(a)
4
• 2-D, 3-D, ...:
>>> b = np.array([[0, 1, 2], [3, 4, 5]]) # 2 x 3 array
>>> b
array([[0, 1, 2],
[3, 4, 5]])
>>> b.ndim
2
>>> b.shape
(2, 3)
>>> len(b) # returns the size of the first dimension
2
>>> c = np.array([[[1], [2]], [[3], [4]]])
>>> c
array([[[1],
[2]],
[[3],
[4]]])
>>> c.shape
(2, 2, 1)

Functions for creating arrays


In practice, we rarely enter items one by one. . .
• Evenly spaced:
>>> a = np.arange(10) # 0 .. n-1 (!)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = np.arange(1, 9, 2) # start, end (exclusive), step
>>> b
array([1, 3, 5, 7])
• or by number of points:
>>> c = np.linspace(0, 1, 6) # start, end, num-points
>>> c
array([ 0. , 0.2, 0.4, 0.6, 0.8, 1. ])
>>> d = np.linspace(0, 1, 5, endpoint=False)
>>> d
array([ 0. , 0.2, 0.4, 0.6, 0.8])
• Common arrays:
>>> a = np.ones((3, 3)) # reminder: (3, 3) is a tuple
>>> a
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
>>> b = np.zeros((2, 2))
>>> b
array([[ 0., 0.],
[ 0., 0.]])
>>> c = np.eye(3)
>>> c
array([[ 1., 0., 0.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
>>> d = np.diag(np.array([1, 2, 3, 4]))
>>> d
array([[1, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 4]])
• np.random: random numbers (Mersenne Twister PRNG):
>>> a = np.random.rand(4) # uniform in [0, 1]
>>> a
array([ 0.95799151, 0.14222247, 0.08777354, 0.51887998])
>>> b = np.random.randn(4) # Gaussian
>>> b
array([ 0.37544699, -0.11425369, -0.47616538, 1.79664113])
>>> np.random.seed(1234) # Setting the random seed

8.2.3 Basic data types


You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. 2.
vs 2). This is due to a difference in the data-type used:
>>> a = np.array([1, 2, 3])
>>> a.dtype
dtype('int64')
>>> b = np.array([1., 2., 3.])
>>> b.dtype
dtype('float64')

Different data-types allow us to store data more compactly in memory, but most of the time we simply
work with floating point numbers. Note that, in the example above, NumPy auto-detects the data-
type from the input.
You can explicitly specify which data-type you want:
>>> c = np.array([1, 2, 3], dtype=float)
>>> c.dtype
dtype('float64')
The default data type is floating point:
>>> a = np.ones((3, 3))
>>> a.dtype
dtype('float64')
There are also other types:
Complex:
>>> d = np.array([1+2j, 3+4j, 5+6*1j])
>>> d.dtype
dtype('complex128')
Bool:
>>> e = np.array([True, False, False, True])
>>> e.dtype
dtype('bool')
Strings:
>>> f = np.array(['Bonjour', 'Hello', 'Hallo'])
>>> f.dtype # <--- strings containing max. 7 letters
dtype('S7')
Much more:
• int32
• int64
• uint32
• uint64

8.2.4 Basic visualization


Now that we have our first data arrays, we are going to visualize them.
Start by launching IPython:
$ ipython # or ipython3 depending on your install
Or the notebook:
$ jupyter notebook
Once IPython has started, enable interactive plots:
>>> %matplotlib
Or, from the notebook, enable plots in the notebook:
>>> %matplotlib inline
The inline is important for the notebook, so that plots are displayed in the notebook and not in a new
window. Matplotlib is a 2D plotting package. We can import its functions as below:
>>> import matplotlib.pyplot as plt # the tidy way
And then use (note that you have to use show explicitly if you have not enabled interactive plots with
%matplotlib):
>>> plt.plot(x, y) # line plot
>>> plt.show() # <-- shows the plot (not needed with interactive plots)
Or, if you have enabled interactive plots with %matplotlib:
>>> plt.plot(x, y) # line plot
• 1D plotting:
>>> x = np.linspace(0, 3, 20)
>>> y = np.linspace(0, 9, 20)
>>> plt.plot(x, y) # line plot
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(x, y, 'o') # dot plot
[<matplotlib.lines.Line2D object at ...>]
• 2D arrays (such as images):
>>> image = np.random.rand(30, 30)
>>> plt.imshow(image, cmap=plt.cm.hot)
<matplotlib.image.AxesImage object at ...>
>>> plt.colorbar()
<matplotlib.colorbar.Colorbar object at ...>

8.2.5 Indexing and slicing


The items of an array can be accessed and assigned to the same way as other Python sequences (e.g.
lists):
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> a[0], a[2], a[-1]
(0, 2, 9)
Indices begin at 0, like other Python sequences (and C/C++). In contrast, in Fortran or Matlab,
indices begin at 1.
The usual python idiom for reversing a sequence is supported:
>>> a[::-1]
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
For multidimensional arrays, indexes are tuples of integers:
>>> a = np.diag(np.arange(3))
>>> a
array([[0, 0, 0],
[0, 1, 0],
[0, 0, 2]])
>>> a[1, 1]
1
>>> a[2, 1] = 10 # third line, second column
>>> a
array([[ 0, 0, 0],
[ 0, 1, 0],
[ 0, 10, 2]])
>>> a[1]
array([0, 1, 0])
Note:
• In 2D, the first dimension corresponds to rows, the second to columns.
• for multidimensional a, a[0] is interpreted by taking all elements in the unspecified dimensions.

Slicing: Arrays, like other Python sequences can also be sliced:


>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> a[2:9:3] # [start:end:step]
array([2, 5, 8])
Note that the last index is not included! :
>>> a[:4]
array([0, 1, 2, 3])
All three slice components are not required: by default, start is 0, end is the last and step is 1:
>>> a[1:3]
array([1, 2])
>>> a[::2]
array([0, 2, 4, 6, 8])
>>> a[3:]
array([3, 4, 5, 6, 7, 8, 9])
A small illustrated summary of NumPy indexing and slicing. . .

You can also combine assignment and slicing:


>>> a = np.arange(10)
>>> a[5:] = 10
>>> a
array([ 0, 1, 2, 3, 4, 10, 10, 10, 10, 10])
>>> b = np.arange(5)
>>> a[5:] = b[::-1]
>>> a
array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0])

8.2.6 Copies and views


A slicing operation creates a view on the original array, which is just a way of accessing array data.
Thus the original array is not copied in memory. You can use np.may_share_memory() to check if two
arrays share the same memory block. Note however, that this uses heuristics and may give you false
positives.
When modifying the view, the original array is modified as well:
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = a[::2]
>>> b
array([0, 2, 4, 6, 8])
>>> np.may_share_memory(a, b)
True
>>> b[0] = 12
>>> b
array([12, 2, 4, 6, 8])
>>> a # (!)
array([12, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> a = np.arange(10)
>>> c = a[::2].copy() # force a copy
>>> c[0] = 12
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.may_share_memory(a, c)
False
This behavior can be surprising at first sight. . . but it allows to save both memory and time.
Worked example: Prime number sieve

Compute prime numbers in 0–99, with a sieve


• Construct a shape (100,) boolean array is_prime, filled with True in the beginning:
>>> is_prime = np.ones((100,), dtype=bool)
• Cross out 0 and 1 which are not primes:
>>> is_prime[:2] = 0
• For each integer j starting from 2, cross out its higher multiples:
>>> N_max = int(np.sqrt(len(is_prime) - 1))
>>> for j in range(2, N_max + 1):
... is_prime[2*j::j] = False
• Skim through help(np.nonzero), and print the prime numbers
• Follow-up:
– Move the above code into a script file named prime_sieve.py
– Run it to check it works
– Use the optimization suggested in the sieve of Eratosthenes:
 Skip j which are already known to not be primes
 The first number to cross out is j

8.2.7 Fancy indexing


NumPy arrays can be indexed with slices, but also with boolean or integer arrays (masks). This method
is called fancy indexing. It creates copies not views.
Using boolean masks
>>> np.random.seed(3)
>>> a = np.random.randint(0, 21, 15)
>>> a
array([10, 3, 8, 0, 19, 10, 11, 9, 10, 6, 0, 20, 12, 7, 14])
>>> (a % 3 == 0)
array([False, True, False, True, False, False, False, True, False, True, True, False, True, False,
False], dtype=bool)
>>> mask = (a % 3 == 0)
>>> extract_from_a = a[mask] # or, a[a%3==0]
>>> extract_from_a # extract a sub-array with the mask
array([ 3, 0, 9, 6, 0, 12])

Indexing with a mask can be very useful to assign a new value to a sub-array:
>>> a[a % 3 == 0] = -1
>>> a
array([10, -1, 8, -1, 19, 10, 11, -1, 10, -1, -1, 20, -1, 7, 14])

Indexing with an array of integers


>>> a = np.arange(0, 100, 10)
>>> a
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

Indexing can be done with an array of integers, where the same index is repeated several time:
>>> a[[2, 3, 2, 4, 2]] # note: [2, 3, 2, 4, 2] is a Python list
array([20, 30, 20, 40, 20])
New values can be assigned with this kind of indexing:
>>> a[[9, 7]] = -100
>>> a
array([ 0, 10, 20, 30, 40, 50, 60, -100, 80, -100])

When a new array is created by indexing with an array of integers, the new array has the same shape
as the array of integers:
>>> a = np.arange(10)
>>> idx = np.array([[3, 4], [9, 7]])
>>> idx.shape
(2, 2)
>>> a[idx]
array([[3, 4],
[9, 7]])
The image below illustrates various fancy indexing applications

8.3 NUMERICAL OPERATIONS ON ARRAYS


8.3.1 Elementwise operations
Basic operations
With scalars:
>>> a = np.array([1, 2, 3, 4])
>>> a + 1
array([2, 3, 4, 5])
>>> 2**a
array([ 2, 4, 8, 16])
All arithmetic operates elementwise:
>>> b = np.ones(4) + 1
>>> a - b
array([-1., 0., 1., 2.])
>>> a * b
array([ 2., 4., 6., 8.])
>>> j = np.arange(5)
>>> 2**(j + 1) - j
array([ 2, 3, 6, 13, 28])
These operations are of course much faster than if you did them in pure python:
>>> a = np.arange(10000)
>>> %timeit a + 1
10000 loops, best of 3: 24.3 us per loop
>>> l = range(10000)
>>> %timeit [i+1 for i in l]
1000 loops, best of 3: 861 us per loop

Other operations
Comparisons:
>>> a = np.array([1, 2, 3, 4])
>>> b = np.array([4, 2, 2, 4])
>>> a == b
array([False, True, False, True], dtype=bool)
>>> a > b
array([False, False, True, False], dtype=bool)
Array-wise comparisons:
>>> a = np.array([1, 2, 3, 4])
>>> b = np.array([4, 2, 2, 4])
>>> c = np.array([1, 2, 3, 4])
>>> np.array_equal(a, b)
False
>>> np.array_equal(a, c)
True
Logical operations:
>>> a = np.array([1, 1, 0, 0], dtype=bool)
>>> b = np.array([1, 0, 1, 0], dtype=bool)
>>> np.logical_or(a, b)
array([ True, True, True, False], dtype=bool)
>>> np.logical_and(a, b)
array([ True, False, False, False], dtype=bool)
Transcendental functions:
>>> a = np.arange(5)
>>> np.sin(a)
array([ 0. , 0.84147098, 0.90929743, 0.14112001, -0.7568025 ])
>>> np.log(a)
array([ -inf, 0. , 0.69314718, 1.09861229, 1.38629436])
>>> np.exp(a)
array([ 1. , 2.71828183, 7.3890561 , 20.08553692, 54.59815003])
Shape mismatches
>>> a = np.arange(4)
>>> a + np.array([1, 2])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: operands could not be broadcast together with shapes (4) (2)

Transposition:
>>> a = np.triu(np.ones((3, 3)), 1) # see help(np.triu)
>>> a
array([[ 0., 1., 1.],
[ 0., 0., 1.],
[ 0., 0., 0.]])
>>> a.T
array([[ 0., 0., 0.],
[ 1., 0., 0.],
[ 1., 1., 0.]])
Warning: The transposition is a view
As a result, the following code is wrong and will not make a matrix symmetric:
>>> a += a.T
It will work for small arrays (because of buffering) but fail for large one, in unpredictable ways.
Linear algebra:
The sub-module numpy.linalg implements basic linear algebra, such as solving linear systems, singular
value decomposition, etc. However, it is not guaranteed to be compiled using efficient routines, and
thus we recommend the use of scipy.linalg, as detailed in section Linear algebra operations:
scipy.linalg

8.3.2 Basic reductions


Computing sums
>>> x = np.array([1, 2, 3, 4])
>>> np.sum(x)
10
>>> x.sum()
10
Sum by rows and by columns:
>>> x = np.array([[1, 1], [2, 2]])
>>> x
array([[1, 1],
[2, 2]])
>>> x.sum(axis=0) # columns (first dimension)
array([3, 3])
>>> x[:, 0].sum(), x[:, 1].sum()
(3, 3)
>>> x.sum(axis=1) # rows (second dimension)
array([2, 4])
>>> x[0, :].sum(), x[1, :].sum()
(2, 4)
Same idea in higher dimensions:
>>> x = np.random.rand(2, 2, 2)
>>> x.sum(axis=2)[0, 1]
1.14764...
>>> x[0, 1, :].sum()
1.14764...
Other reductions
— works the same way (and take axis=)
Extrema:
>>> x = np.array([1, 3, 2])
>>> x.min()
1
>>> x.max()
3
>>> x.argmin() # index of minimum
0
>>> x.argmax() # index of maximum
1
Logical operations:
>>> np.all([True, True, False])
False
>>> np.any([True, True, False])
True
Note: Can be used for array comparisons:
>>> a = np.zeros((100, 100))
>>> np.any(a != 0)
False
>>> np.all(a == a)
True
>>> a = np.array([1, 2, 3, 2])
>>> b = np.array([2, 2, 3, 2])
>>> c = np.array([6, 4, 4, 5])
>>> ((a <= b) & (b <= c)).all()
True
Statistics:
>>> x = np.array([1, 2, 3, 1])
>>> y = np.array([[1, 2, 3], [5, 6, 1]])
>>> x.mean()
1.75
>>> np.median(x)
1.5
>>> np.median(y, axis=-1) # last axis
array([ 2., 5.])
>>> x.std() # full population standard dev.
0.82915619758884995
. . . and many more (best to learn as you go).

Worked Example: Data statistics


Data in populations.txt describes the populations of hares and lynxes (and carrots) in northern Canada
during 20 years. You can view the data in an editor, or alternatively in IPython (both shell and
notebook):
In [1]: !cat data/populations.txt
First, load the data into a NumPy array:
>>> data = np.loadtxt('data/populations.txt')
>>> year, hares, lynxes, carrots = data.T # trick: columns to variables
Then plot it:
>>> from matplotlib import pyplot as plt
>>> plt.axes([0.2, 0.1, 0.5, 0.8])
>>> plt.plot(year, hares, year, lynxes, year, carrots)
>>> plt.legend(('Hare', 'Lynx', 'Carrot'), loc=(1.05, 0.5))

The mean populations over time:


>>> populations = data[:, 1:]
>>> populations.mean(axis=0)
array([ 34080.95238095, 20166.66666667, 42400. ])
The sample standard deviations:
>>> populations.std(axis=0)
array([ 20897.90645809, 16254.59153691, 3322.50622558])
Which species has the highest population each year?:
>>> np.argmax(populations, axis=1)
array([2, 2, 0, 0, 1, 1, 2, 2, 2, 2, 2, 2, 0, 0, 0, 1, 2, 2, 2, 2, 2])

8.3.3 Broadcasting
Basic operations on numpy arrays (addition, etc.) are elementwise. This works on arrays of the same
size.
Nevertheless, It’s also possible to do operations on arrays of different sizes if NumPy can transform
these arrays so that they all have the same size this conversion is called broadcasting.
The image below gives an example of broadcasting:
Let’s verify:
>>> a = np.tile(np.arange(0, 40, 10), (3, 1)).T
>>> a
array([[ 0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]])
>>> b = np.array([0, 1, 2])
>>> a + b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])

We have already used broadcasting without knowing it:


>>> a = np.ones((4, 5))
>>> a[0] = 2 # we assign an array of dimension 0 to an array of dimension 1
>>> a
array([[ 2., 2., 2., 2., 2.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1.]])
A useful trick:
>>> a = np.arange(0, 40, 10)
>>> a.shape
(4,)
>>> a = a[:, np.newaxis] # adds a new axis -> 2D array
>>> a.shape
(4, 1)
>>> a
array([[ 0],
[10],
[20],
[30]])
>>> a + b
array([[ 0, 1, 2],
[10, 11, 12],
[20, 21, 22],
[30, 31, 32]])
Broadcasting seems a bit magical, but it is actually quite natural to use it when we want to solve a
problem whose output data is an array with more dimensions than input data.

Worked Example: Broadcasting


Let’s construct an array of distances (in miles) between cities of Route 66: Chicago, Springfield, Saint-
Louis, Tulsa, Oklahoma City, Amarillo, Santa Fe, Albuquerque, Flagstaff and Los Angeles.
>>> mileposts = np.array([0, 198, 303, 736, 871, 1175, 1475, 1544, ... 1913, 2448])
>>> distance_array = np.abs(mileposts - mileposts[:, np.newaxis])
>>> distance_array
array([[ 0, 198, 303, 736, 871, 1175, 1475, 1544, 1913, 2448],
[ 198, 0, 105, 538, 673, 977, 1277, 1346, 1715, 2250],
[ 303, 105, 0, 433, 568, 872, 1172, 1241, 1610, 2145],
[ 736, 538, 433, 0, 135, 439, 739, 808, 1177, 1712],
[ 871, 673, 568, 135, 0, 304, 604, 673, 1042, 1577],
[1175, 977, 872, 439, 304, 0, 300, 369, 738, 1273],
[1475, 1277, 1172, 739, 604, 300, 0, 69, 438, 973],
[1544, 1346, 1241, 808, 673, 369, 69, 0, 369, 904],
[1913, 1715, 1610, 1177, 1042, 738, 438, 369, 0, 535],
[2448, 2250, 2145, 1712, 1577, 1273, 973, 904, 535, 0]])

A lot of grid-based or network-based problems can also use broadcasting. For instance, if we want to
compute the distance from the origin of points on a 10x10 grid, we can do
>>> x, y = np.arange(5), np.arange(5)[:, np.newaxis]
>>> distance = np.sqrt(x ** 2 + y ** 2)
>>> distance
array([[ 0. , 1. , 2. , 3. , 4. ],
[ 1. , 1.41421356, 2.23606798, 3.16227766, 4.12310563],
[ 2. , 2.23606798, 2.82842712, 3.60555128, 4.47213595],
[ 3. , 3.16227766, 3.60555128, 4.24264069, 5. ],
[ 4. , 4.12310563, 4.47213595, 5. , 5.65685425]])
Or in color:
>>> plt.pcolor(distance)
>>> plt.colorbar()
Remark : the numpy.ogrid() function allows to directly create vectors x and y of the previous
example, with two “significant dimensions”:
>>> x, y = np.ogrid[0:5, 0:5]
>>> x, y
(array([[0],
[1],
[2],
[3],
[4]]), array([[0, 1, 2, 3, 4]]))
>>> x.shape, y.shape
((5, 1), (1, 5))
>>> distance = np.sqrt(x ** 2 + y ** 2)

Tip: So, np.ogrid is very useful as soon as we have to handle computations on a grid. On the other
hand, np.mgrid directly provides matrices full of indices for cases where we can’t (or don’t want to)
benefit from broadcasting:
>>> x, y = np.mgrid[0:4, 0:4]
>>> x
array([[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3]])
>>> y
array([[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])

8.3.4 Array shape manipulation


Flattening
>>> a = np.array([[1, 2, 3], [4, 5, 6]])
>>> a.ravel()
array([1, 2, 3, 4, 5, 6])
>>> a.T
array([[1, 4],
[2, 5],
[3, 6]])
>>> a.T.ravel()
array([1, 4, 2, 5, 3, 6])
Higher dimensions: last dimensions ravel out “first”.

Reshaping
The inverse operation to flattening:
>>> a.shape
(2, 3)
>>> b = a.ravel()
>>> b = b.reshape((2, 3))
>>> b
array([[1, 2, 3],
[4, 5, 6]])
Or,
>>> a.reshape((2, -1)) # unspecified (-1) value is inferred
array([[1, 2, 3],
[4, 5, 6]])

Adding a dimension
Indexing with the np.newaxis object allows us to add an axis to an array (you have seen this already
above in the broadcasting section):
>>> z = np.array([1, 2, 3])
>>> z
array([1, 2, 3])
>>> z[:, np.newaxis]
array([[1],
[2],
[3]])
>>> z[np.newaxis, :]
array([[1, 2, 3]])
Dimension shuffling
>>> a = np.arange(4*3*2).reshape(4, 3, 2)
>>> a.shape
(4, 3, 2)
>>> a[0, 2, 1]
5
>>> b = a.transpose(1, 2, 0)
>>> b.shape
(3, 2, 4)
>>> b[2, 1, 0]
5
Also creates a view:
>>> b[2, 1, 0] = -1
>>> a[0, 2, 1]
-1
Resizing
Size of an array can be changed with ndarray.resize:
>>> a = np.arange(4)
>>> a.resize((8,))
>>> a
array([0, 1, 2, 3, 0, 0, 0, 0])
However, it must not be referred to somewhere else:
>>> b = a
>>> a.resize((4,))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: cannot resize an array that has been referenced or is referencing another array in
this way. Use the resize function
8.3.5 Sorting data
Sorting along an axis:
>>> a = np.array([[4, 3, 5], [1, 2, 1]])
>>> b = np.sort(a, axis=1)
>>> b
array([[3, 4, 5],
[1, 1, 2]])
In-place sort:
>>> a.sort(axis=1)
>>> a
array([[3, 4, 5],
[1, 1, 2]])
Sorting with fancy indexing:
>>> a = np.array([4, 3, 1, 2])
>>> j = np.argsort(a)
>>> j
array([2, 3, 1, 0])
>>> a[j]
array([1, 2, 3, 4])
Finding minima and maxima:
>>> a = np.array([4, 3, 1, 2])
>>> j_max = np.argmax(a)
>>> j_min = np.argmin(a)
>>> j_max, j_min
(0, 2)

8.4 MORE ELABORATE ARRAYS


8.4.1 More data types
Casting
“Bigger” type wins in mixed-type operations:
>>> np.array([1, 2, 3]) + 1.5
array([ 2.5, 3.5, 4.5])
Assignment never changes the type!
>>> a = np.array([1, 2, 3])
>>> a.dtype
dtype('int64')
>>> a[0] = 1.9 # <-- float is truncated to integer
>>> a
array([1, 2, 3])
Forced casts:
>>> a = np.array([1.7, 1.2, 1.6])
>>> b = a.astype(int) # <-- truncates to integer
>>> b
array([1, 1, 1])
Rounding:
>>> a = np.array([1.2, 1.5, 1.6, 2.5, 3.5, 4.5])
>>> b = np.around(a)
>>> b # still floating-point
array([ 1., 2., 2., 2., 4., 4.])
>>> c = np.around(a).astype(int)
>>> c
array([1, 2, 2, 2, 4, 4])

8.4.2 Structured data types


sensor_code (4-character string)
position (float)
value (float)
>>> samples = np.zeros((6,), dtype=[('sensor_code', 'S4'),... ('position', float), ('value', float)])
>>> samples.ndim
1
>>> samples.shape
(6,)
>>> samples.dtype.names
('sensor_code', 'position', 'value')
>>> samples[:] = [('ALFA', 1, 0.37), ('BETA', 1, 0.11), ('TAU', 1, 0.13), ... ('ALFA', 1.5, 0.37),
('ALFA', 3, 0.11), ('TAU', 1.2, 0.13)]
>>> samples
array([('ALFA', 1.0, 0.37), ('BETA', 1.0, 0.11), ('TAU', 1.0, 0.13),
('ALFA', 1.5, 0.37), ('ALFA', 3.0, 0.11), ('TAU', 1.2, 0.13)],
dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])
Field access works by indexing with field names:
>>> samples['sensor_code']
array(['ALFA', 'BETA', 'TAU', 'ALFA', 'ALFA', 'TAU'],dtype='|S4')
>>> samples['value']
array([ 0.37, 0.11, 0.13, 0.37, 0.11, 0.13])
>>> samples[0]
('ALFA', 1.0, 0.37)
>>> samples[0]['sensor_code'] = 'TAU'
>>> samples[0]
('TAU', 1.0, 0.37)
Multiple fields at once:
>>> samples[['position', 'value']]
array([( 1. , 0.37), ( 1. , 0.11), ( 1. , 0.13), ( 1.5, 0.37), ( 3. , 0.11), ( 1.2, 0.13)],
dtype=[('position', '<f8'), ('value', '<f8')])
Fancy indexing works, as usual:
>>> samples[samples['sensor_code'] == b'ALFA']
array([(b'ALFA', 1.5, 0.37), (b'ALFA', 3. , 0.11)],
dtype=[('sensor_code', 'S4'), ('position', '<f8'), ('value', '<f8')])

8.4.3 maskedarray: dealing with (propagation of) missing data


• For floats one could use NaN’s, but masks work for all types:
>>> x = np.ma.array([1, 2, 3, 4], mask=[0, 1, 0, 1])
>>> x
masked_array(data = [1 -- 3 --], mask = [False True False True], fill_value = 999999)
>>> y = np.ma.array([1, 2, 3, 4], mask=[0, 1, 1, 1])
>>> x + y
masked_array(data = [2 -- -- --], mask = [False True True True], fill_value = 999999)
• Masking versions of common functions:
>>> np.ma.sqrt([1, -1, 2, -2])
masked_array(data = [1.0 -- 1.41421356237... --], mask = [False True False True], fill_value =
1e+20)

8.5 SCIPY
The scipy package contains various toolboxes dedicated to common issues in scientific computing. Its
different submodules correspond to different applications, such as interpolation, integration,
optimization, image processing, statistics, special functions, etc.
scipy is composed of task-specific sub-modules:
They all depend on numpy, but are mostly independent of each other. The standard way of importing
Numpy and these Scipy modules is:
>>> import numpy as np
>>> from scipy import stats # same for other sub-modules
The main scipy namespace mostly contains functions that are really numpy functions (try scipy.cos is
np.cos). Those are exposed for historical reasons; there’s no reason to use import scipy in your code.

8.5.1 File input/output: scipy.io


Matlab files: Loading and saving:
>>> from scipy import io as spio
>>> a = np.ones((3, 3))
>>> spio.savemat('file.mat', {'a': a}) # savemat expects a dictionary
>>> data = spio.loadmat('file.mat')
>>> data['a']
array([[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]])
Image files: Reading images:
>>> from scipy import misc
>>> misc.imread('fname.png')
array(...)
>>> # Matplotlib also has a similar function
>>> import matplotlib.pyplot as plt
>>> plt.imread('fname.png')
array(...)
See also:
• Load text files: numpy.loadtxt()/numpy.savetxt()
• Clever loading of text/csv files: numpy.genfromtxt()/numpy.recfromcsv()
• Fast and efficient, but numpy-specific, binary format: numpy.save()/numpy.load()
• More advanced input/output of images in scikit-image: skimage.io

8.5.2 Special functions: scipy.special


Special functions are transcendental functions. The docstring of the scipy.special module is well-
written, so we won’t list all functions here. Frequently used ones are:
• Bessel function, such as scipy.special.jn() (nth integer order Bessel function)
• Elliptic function (scipy.special.ellipj() for the Jacobian elliptic function, . . . )
• Gamma function: scipy.special.gamma(), also note scipy.special.gammaln() which will give
the log of Gamma to a higher numerical precision.
• Erf, the area under a Gaussian curve: scipy.special.erf()

8.5.3 Linear algebra operations: scipy.linalg


The scipy.linalg module provides standard linear algebra operations, relying on an underlying efficient
implementation (BLAS, LAPACK).
• The scipy.linalg.det() function computes the determinant of a square matrix:
>>> from scipy import linalg
>>> arr = np.array([[1, 2],
... [3, 4]])
>>> linalg.det(arr)
-2.0
>>> arr = np.array([[3, 2],
... [6, 4]])
>>> linalg.det(arr)
0.0
>>> linalg.det(np.ones((3, 4)))
Traceback (most recent call last): ...
ValueError: expected square matrix

• The scipy.linalg.inv() function computes the inverse of a square matrix:


>>> arr = np.array([[1, 2],
... [3, 4]])
>>> iarr = linalg.inv(arr)
>>> iarr
array([[-2. , 1. ],
[ 1.5, -0.5]])
>>> np.allclose(np.dot(arr, iarr), np.eye(2))
True
Finally computing the inverse of a singular matrix (its determinant is zero) will raise
LinAlgError:
>>> arr = np.array([[3, 2],
... [6, 4]])
>>> linalg.inv(arr)
Traceback (most recent call last): ...
...LinAlgError: singular matrix
• More advanced operations are available, for example singular-value decomposition
(SVD):
>>> arr = np.arange(9).reshape((3, 3)) + np.diag([1, 0, 1])
>>> uarr, spec, vharr = linalg.svd(arr)
The resulting array spectrum is:
>>> spec
array([ 14.88982544, 0.45294236, 0.29654967])
The original matrix can be re-composed by matrix multiplication of the outputs of svd with
np.dot:
>>> sarr = np.diag(spec)
>>> svd_mat = uarr.dot(sarr).dot(vharr)
>>> np.allclose(svd_mat, arr)
True
SVD is commonly used in statistics and signal processing. Many other standard
decompositions (QR, LU, Cholesky, Schur), as well as solvers for linear systems, are available
in scipy.linalg.
8.5.4 Interpolation: scipy.interpolate
scipy.interpolate is useful for fitting a function from experimental data and thus evaluating points
where no measure exists. The module is based on the FITPACK Fortran subroutines.
By imagining experimental data close to a sine function:
>>> measured_time = np.linspace(0, 1, 10)
>>> noise = (np.random.random(10)*2 - 1) * 1e-1
>>> measures = np.sin(2 * np.pi * measured_time) + noise
scipy.interpolate.interp1d can build a linear interpolation function:
>>> from scipy.interpolate import interp1d
>>> linear_interp = interp1d(measured_time, measures)
Then the result can be evaluated at the time of interest:
>>> interpolation_time = np.linspace(0, 1, 50)
>>> linear_results = linear_interp(interpolation_time)
A cubic interpolation can also be selected by providing the kind optional keyword argument:
>>> cubic_interp = interp1d(measured_time, measures, kind='cubic')
>>> cubic_results = cubic_interp(interpolation_time)

scipy.interpolate.interp2d is similar to scipy.interpolate.interp1d, but for 2-D arrays. Note that for the
interp family, the interpolation points must stay within the range of given data points.

8.5.5 Optimization and fit: scipy.optimize


Optimization is the problem of finding a numerical solution to a minimization or equality. The
scipy.optimize module provides algorithms for function minimization (scalar or multidimensional),
curve fitting and root finding.
>>> from scipy import optimize
8.5.5.1 Curve fitting
Suppose we have data on a sine wave, with some noise:
>>> x_data = np.linspace(-5, 5, num=50)
>>> y_data = 2.9 * np.sin(1.5 * x_data) + np.random.normal(size=50)
If we know that the data lies on a sine wave, but not the amplitudes or the period, we can find those
by least squares curve fitting. First we have to define the test function to fit, here a sine with unknown
amplitude and period:
>>> def test_func(x, a, b):
... return a * np.sin(b * x)
We then use scipy.optimize.curve_fit() to find a and b:
>>> params, params_covariance = optimize.curve_fit(test_func, x_data, y_data, p0=[2, 2])
>>> print(params)
[ 3.05931973 1.45754553]

8.5.5.2 Finding the minimum of a scalar function


Let’s define the following function:
>>> def f(x):
... return x**2 + 10*np.sin(x)
and plot it:
>>> x = np.arange(-10, 10, 0.1)
>>> plt.plot(x, f(x))
>>> plt.show()
This function has a global minimum around -1.3 and a local minimum around 3.8.
Searching for minimum can be done with scipy.optimize.minimize(), given a starting point x0, it returns
the location of the minimum that it has found:
>>> result = optimize.minimize(f, x0=0)
>>> result
fun: -7.9458233756...
hess_inv: array([[ 0.0858...]])
jac: array([ -1.19209...e-06])
message: 'Optimization terminated successfully.'
nfev: 18
nit: 5
njev: 6
status: 0
success: True
x: array([-1.30644...])
>>> result.x # The coordinate of the minimum
array([-1.30644...])

As the function is a smooth function, gradient-descent based methods are good options. The lBFGS
algorithm is a good choice in general:
>>> optimize.minimize(f, x0=0, method="L-BFGS-B")
fun: array([-7.94582338])
hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
jac: array([ -1.42108547e-06])
message: ...'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
nfev: 12
nit: 5
status: 0
success: True
x: array([-1.30644013])
Note how it cost only 12 functions evaluation above to find a good value for the minimum.

Global minimum: A possible issue with this approach is that, if the function has local minima, the
algorithm may find these local minima instead of the global minimum depending on the initial point
x0:
>>> res = optimize.minimize(f, x0=3, method="L-BFGS-B")
>>> res.x
array([ 3.83746709])
If we don’t know the neighborhood of the global minimum to choose the initial point, we need to
resort to costlier global optimization. To find the global minimum, we use
scipy.optimize.basinhopping() (added in version 0.12.0 of Scipy). It combines a local optimizer with
sampling of starting points:
>>> optimize.basinhopping(f, 0)
nfev: 1725
minimization_failures: 0
fun: -7.9458233756152845
x: array([-1.30644001])
message: ['requested number of basinhopping iterations completed successfully']
njev: 575
nit: 100
Constraints: We can constrain the variable to the interval (0, 10) using the “bounds” argument:
>>> res = optimize.minimize(f, x0=1, ... bounds=((0, 10), ))
>>> res.x
array([ 0.])

Minimizing functions of several variables


To minimize over several variables, the trick is to turn them into a function of a multi-dimensional
variable (a vector).
Note: scipy.optimize.minimize_scalar() is a function with dedicated methods to minimize functions of
only one variable.

8.5.5.3 Finding the roots of a scalar function


To find a root, i.e. a point where f (x) = 0, of the function we can use scipy.optimize.root():
>>> root = optimize.root(f, x0=1) # our initial guess is 1
>>> root # The full result
fjac: array([[-1.]])
fun: array([ 0.])
message: 'The solution converged.'
nfev: 10
qtf: array([ 1.33310463e-32])
r: array([-10.])
status: 1
success: True
x: array([ 0.])
>>> root.x # Only the root found
array([ 0.])
Note that only one root is found. Inspecting the plot of f reveals that there is a second root around -
2.5. We find the exact value of it by adjusting our initial guess:
>>> root2 = optimize.root(f, x0=-2.5)
>>> root2.x
array([-2.47948183])
Now that we have found the minima and roots of f and used curve fitting on it, we put all those results
together in a single plot:

8.5.6 Statistics and random numbers: scipy.stats


The module scipy.stats contains statistical tools and probabilistic descriptions of random processes.
Random number generators for various random process can be found in numpy.random.
8.5.6.1 Distributions: histogram and probability density function
Given observations of a random process, their histogram is an estimator of the random process’s PDF
(probability density function):
>>> samples = np.random.normal(size=1000)
>>> bins = np.arange(-4, 5)
>>> bins
array([-4, -3, -2, -1, 0, 1, 2, 3, 4])
>>> histogram = np.histogram(samples, bins=bins, normed=True)[0]
>>> bins = 0.5*(bins[1:] + bins[:-1])
>>> bins
array([-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5])
>>> from scipy import stats
>>> pdf = stats.norm.pdf(bins) # norm is a distribution object
>>> plt.plot(bins, histogram)
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(bins, pdf)
[<matplotlib.lines.Line2D object at ...>]
If we know that the random process belongs to a given family of random processes, such as normal
processes, we can do a maximum-likelihood fit of the observations to estimate the parameters of the
underlying distribution. Here we fit a normal process to the observed data:
>>> loc, std = stats.norm.fit(samples)
>>> loc
-0.045256707...
>>> std
0.9870331586...
8.5.6.2 Mean, median and percentiles
The mean is an estimator of the center of the distribution:
>>> np.mean(samples)
-0.0452567074...
The median another estimator of the center. It is the value with half of the observations below, and
half above:
>>> np.median(samples)
-0.0580280347...
Unlike the mean, the median is not sensitive to the tails of the distribution. It is “robust”.

The median is also the percentile 50, because 50% of the observation are below it:
>>> stats.scoreatpercentile(samples, 50)
-0.0580280347...
Similarly, we can calculate the percentile 90:
>>> stats.scoreatpercentile(samples, 90)
1.2315935511...
The percentile is an estimator of the CDF: cumulative distribution function.

8.5.6.3 Statistical tests


A statistical test is a decision indicator. For instance, if we have two sets of observations, that we
assume are generated from Gaussian processes, we can use a T-test to decide whether the means of
two sets of observations are significantly different:
>>> a = np.random.normal(0, 1, size=100)
>>> b = np.random.normal(1, 1, size=10)
>>> stats.ttest_ind(a, b)
(array(-3.177574054...), 0.0019370639...)
The resulting output is composed of:
• The T statistic value: it is a number the sign of which is proportional to the difference
between the two random processes and the magnitude is related to the significance of
this difference.
• The p value: the probability of both processes being identical. If it is close to 1, the two
process are almost certainly identical. The closer it is to zero, the more likely it is that the
processes have different means.

8.5.7 Numerical integration: scipy.integrate


8.5.7.1 Function integrals
The most generic integration routine is scipy.integrate.quad(). To compute R π/2
>>> from scipy.integrate import quad
>>> res, err = quad(np.sin, 0, np.pi/2)
>>> np.allclose(res, 1) # res is the result, is should be close to 1
True
>>> np.allclose(err, 1 - res) # err is an estimate of the err
True
Other integration schemes are available: scipy.integrate.fixed_quad(), scipy.integrate.quadrature(),
scipy.integrate.romberg(). . .
8.5.7.2 Integrating differential equations
scipy.integrate also features routines for integrating Ordinary Differential Equations (ODE). In
particular, scipy.integrate.odeint() solves ODE of the form:
dy/dt = rhs(y1, y2, .., t0,...)
As an introduction, let us solve the ODE between t = 0... 4, with the initial condition y(t
= 0) = 1. First the function computing the derivative of the position needs to be defined:
>>> def calc_derivative(ypos, time):
... return -2 * ypos
Then, to compute y as a function of time:
>>> from scipy.integrate import odeint
>>> time_vec = np.linspace(0, 4, 40)
>>> y = odeint(calc_derivative, y0=1, t=time_vec)
Let us integrate a more complex ODE: a damped spring-mass oscillator. The position of a mass
attached to a spring obeys the 2nd order ODE with with k the spring
constant, m the mass and ε = c/(2mω0) with c the damping coefficient. We set:
>>> mass = 0.5 # kg
>>> kspring = 4 # N/m
>>> cviscous = 0.4 # N s/m
Hence:
>>> eps = cviscous / (2 * mass * np.sqrt(kspring/mass))
>>> omega = np.sqrt(kspring / mass)
The system is underdamped, as:
>>> eps < 1
True
For odeint(), the 2nd order equation needs to be transformed in a system of two first-order equations
for the vector Y = (y, y’): the function computes the velocity and acceleration:
>>> def calc_deri(yvec, time, eps, omega):
... return (yvec[1], -eps * omega * yvec[1] - omega **2 * yvec[0])
Integration of the system follows:
>>> time_vec = np.linspace(0, 10, 100)
>>> yinit = (1, 0)
>>> yarr = odeint(calc_deri, yinit, time_vec, args=(eps, omega))

8.5.8 Other Modules:


Following are some more modules available with Scipy:

• The scipy.fftpack module computes fast Fourier transforms (FFTs) and offers utilities to
handle them. The main functions are:
o scipy.fftpack.fft() to compute the FFT
o scipy.fftpack.fftfreq() to generate the sampling frequencies
o scipy.fftpack.ifft() computes the inverse FFT, from frequency space to signal space
• scipy.signal is for typical signal processing: 1D, regularly-sampled signals.
• scipy.ndimage provides manipulation of n-dimensional arrays as images.

Check your Progress 1


Fill in the blanks
1. ______ is a general-purpose array-processing package.
2. A _______ operation creates a view on the original array, which is just a way of accessing
array data.
3. The scipy package contains various toolboxes dedicated to common issues in ____
computing.

Activity 1
1. Create a simple two dimensional array. Use the functions len(), numpy.shape() on these
arrays. How do they relate to each other? And to the ndim attribute of the arrays?
2. Experiment with arange, linspace, ones, zeros, eye and diag.
3. Plot some simple arrays: a cosine as a function of time and a 2D matrix. Try using the gray
colormap on the 2D matrix.
4. Try the different flavours of slicing, using start, end and step: starting from a linspace, try to
obtain odd numbers counting backwards, and even numbers counting forwards.
5. Try simple arithmetic elementwise operations: add even elements with odd elements.
6. Try both in-place and out-of-place sorting. Try creating arrays with different dtypes and
sorting them.

Summary

In this unit we have discussed about Numpy and Scipy, the core tool for performant numerical
computing with Python. Numpy is a general-purpose array-processing package. It provides a high-
performance multidimensional array object, and tools for working with these arrays. Scipy contains
modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and
image processing, ODE solvers and other tasks common in science and engineering. Scipy builds on
the Numpy array object and is part of the Numpy stack which includes various tools.

Keywords

 Array: A kind of data structure that can store a fixed-size sequential collection of elements of
the same type.
 Random Variable: It is a set of possible values from a random experiment.
 Indexing: Indexing is defined as a data structure technique which allows you to quickly
retrieve records from a database file.
 Slicing: It is a flexible tool to build new lists out of an existing list

Self-Assessment Questions
1. What is the difference between sum and cumsum?
2. Explain applications of Numpy and Scipy.
3. Create different kinds of arrays with random numbers. Try setting the seed before
creating an array with random values.
4. Look at the docstring for reshape, especially the notes section which has some more
information about copies and views. Use flatten as an alternative to ravel. What is the
difference?
5. Generate 1000 random variates from a gamma distribution with a shape parameter of 1,
then plot a histogram from those samples. Can you plot the pdf on top (it should match)?

Answers to Check Your Progress


Check your Progress 1
Fill in the blanks
1. Numpy is a general-purpose array-processing package.
2. A slicing operation creates a view on the original array, which is just a way of accessing
array data.
3. The scipy package contains various toolboxes dedicated to common issues in scientific
computing.

Suggested Reading
1. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, by Hadley
Wickham, Garrett Grolemund
2. Scipy Lecture Notes, https://scipy-lectures.org
3. SciPy and NumPy, by Eli Bressert
4. Numerical Python: Scientific Computing and Data Science Applications with NumPy, SciPy
and Matplotlib, by Robert Johansson

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
4.0 International (CC BY 4.0) as requested by the work’s creator or licensees. This license is available
at https://creativecommons.org/licenses/by/4.0/
Unit 9

Pandas

9.1 Introduction

9.2 Introduction to Main Pandas Methods

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

 Understand and implement main methods of Pandas used for visual analysis.

9.1 INTRODUCTION

Pandas is a Python library that provides extensive means for data analysis. Data scientists often work
with data stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it very convenient to load,
process, and analyze such tabular data using SQL-like queries. In conjunction
with Matplotlib and Seaborn, Pandas provides a wide range of opportunities for visual analysis of
tabular data. In this Unit we are going to discuss Pandas in detailed.

9.2 INTRODUCTION TO MAIN PANDAS METHODS

The main data structures in Pandas are implemented with Series and DataFrame classes. The former
is a one-dimensional indexed array of some fixed data type. The latter is a two-dimensional data
structure - a table - where each column contains data of the same type. You can see it as a dictionary
of Series instances. DataFrames are great for representing real data: rows correspond to instances
(examples, observations, etc.), and columns correspond to features of these instances.

In [1]:
import numpy as np
import pandas as pd
pd.set_option("display.precision", 2)
We'll demonstrate the main methods in action by analyzing a dataset
(https://bigml.com/user/francisco/gallery/dataset/5163ad540c0b5e5b22000383) on the churn rate
of telecom operator clients. Let's read the data (using read_csv), and take a look at the first 5 lines
using the head method:

In [2]:
df = pd.read_csv('../input/telecom_churn.csv')
df.head()
Out[2]:

About printing DataFrames in Jupyter notebooks


Recall that each row corresponds to one client, an instance, and columns are features of this instance.
Let’s have a look at data dimensionality, feature names, and feature types.

In [3]:
print(df.shape)
(3333, 20)
From the output, we can see that the table contains 3333 rows and 20 columns. Now let's try printing
out column names using columns:

In [4]:
print(df.columns)
Index(['State', 'Account length', 'Area code', 'International plan',
'Voice mail plan', 'Number vmail messages', 'Total day minutes',
'Total day calls', 'Total day charge', 'Total eve minutes',
'Total eve calls', 'Total eve charge', 'Total night minutes',
'Total night calls', 'Total night charge', 'Total intl minutes',
'Total intl calls', 'Total intl charge', 'Customer service calls',
'Churn'],
dtype='object')
We can use the info() method to output some general information about the dataframe:

In [5]:
print(df.info())

bool, int64, float64 and object are the data types of our features. We see that one feature is logical
(bool), 3 features are of type object, and 16 features are numeric. With this same method, we can
easily see if there are any missing values. Here, there are none because each column contains 3333
observations, the same number of rows we saw before with shape.
We can change the column type with the astype method. Let's apply this method to the Churn feature
to convert it into int64:

In [6]:
df['Churn'] = df['Churn'].astype('int64')
The describe method shows basic statistical characteristics of each numerical feature
(int64 and float64 types): number of non-missing values, mean, standard deviation, range, median,
0.25 and 0.75 quartiles.

In [7]:
df.describe()
Out[7]:

In order to see statistics on non-numerical features, one has to explicitly indicate data types of interest
in the include parameter.

In [8]:
df.describe(include=['object', 'bool'])
Out[8]:

For categorical (type object) and boolean (type bool) features we can use the value_counts method.
Let's have a look at the distribution of Churn:

In [9]:
df['Churn'].value_counts()
Out[9]:
0 2850
1 483
Name: Churn, dtype: int64
2850 users out of 3333 are loyal; their Churn value is 0. To calculate fractions, pass normalize=True to
the value_counts function.

In [10]:
df['Churn'].value_counts(normalize=True)
Out[10]:
0 0.86
1 0.14
Name: Churn, dtype: float64

Sorting
A DataFrame can be sorted by the value of one of the variables (i.e columns). For example, we can
sort by Total day charge (use ascending=False to sort in descending order):

In [11]:
df.sort_values(by='Total day charge', ascending=False).head()
Out[11]:

We can also sort by multiple columns:

In [12]:
df.sort_values(by=['Churn', 'Total day charge'], ascending=[True, False]).head()
Out[12]:
Indexing and retrieving data
A DataFrame can be indexed in a few different ways. To get a single column, you can use
a DataFrame['Name'] construction. Let's use this to answer a question about that column alone: what
is the proportion of churned users in our dataframe?

In [13]:
df['Churn'].mean()
Out[13]:
0.14491449144914492
14.5% is actually quite bad for a company; such a churn rate can make the company go bankrupt.

Boolean indexing with one column is also very convenient. The syntax is df[P(df['Name'])], where P is
some logical condition that is checked for each element of the Name column. The result of such
indexing is the DataFrame consisting only of rows that satisfy the P condition on the Name column.

Let's use it to answer the question:

What are average values of numerical features for churned users?

In [14]:
df[df['Churn'] == 1].mean()
Out[14]:

How much time (on average) do churned users spend on the phone during daytime?

In [15]:
df[df['Churn'] == 1]['Total day minutes'].mean()
Out[15]:
206.91407867494814
What is the maximum length of international calls among loyal users (Churn == 0) who do not have an
international plan?
In [16]:
df[(df['Churn'] == 0) & (df['International plan'] == 'No')]['Total intl minutes'].max()
Out[16]:
18.9
DataFrames can be indexed by column name (label) or row name (index) or by the serial number of a
row. The loc method is used for indexing by name, while iloc() is used for indexing by number.

In the first case below, we say "give us the values of the rows with index from 0 to 5 (inclusive) and
columns labeled from State to Area code (inclusive)". In the second case, we say "give us the values of
the first five rows in the first three columns" (as in a typical Python slice: the maximal value is not
included).

In [17]:
df.loc[0:5, 'State':'Area code']
Out[17]:

In [18]:
df.iloc[0:5, 0:3]
Out[18]:

If we need the first or the last line of the data frame, we can use the df[:1] or df[-1:] construct:

In [19]:
df[-1:]
Out[19]:
Applying Functions to Cells, Columns and Rows
To apply functions to each column, use apply():

In [20]:
df.apply(np.max)
Out[20]:

The apply method can also be used to apply a function to each row. To do this, specify axis=1. Lambda
functions are very convenient in such scenarios. For example, if we need to select all states starting
with W, we can do it like this:

In [21]:
df[df['State'].apply(lambda state: state[0] == 'W')].head()
Out[21]:

The map method can be used to replace values in a column by passing a dictionary of the
form {old_value: new_value} as its argument:

In [22]:
d = {'No' : False, 'Yes' : True}
df['International plan'] = df['International plan'].map(d)
df.head()
Out[22]:

The same thing can be done with the replace method:

In [23]:
df = df.replace({'Voice mail plan': d})
df.head()
Out[23]:

Grouping
In general, grouping data in Pandas works as follows:

df.groupby(by=grouping_columns)[columns_to_show].function()

1. First, the groupby method divides the grouping_columns by their values. They become a new
index in the resulting dataframe.
2. Then, columns of interest are selected (columns_to_show). If columns_to_show is not
included, all non groupby clauses will be included.
3. Finally, one or several functions are applied to the obtained groups per selected columns.

Here is an example where we group the data according to the values of the Churn variable and display
statistics of three columns in each group:
In [24]:
columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes']
df.groupby(['Churn'])[columns_to_show].describe(percentiles=[])
Out[24]:

Let’s do the same thing, but slightly differently by passing a list of functions to agg():

In [25]:
columns_to_show = ['Total day minutes', 'Total eve minutes', 'Total night minutes']
df.groupby(['Churn'])[columns_to_show].agg([np.mean, np.std, np.min, np.max])
Out[25]:

Summary tables
Suppose we want to see how the observations in our sample are distributed in the context of two
variables - Churn and International plan. To do so, we can build a contingency table using
the crosstab method:

In [26]:
pd.crosstab(df['Churn'], df['International plan'])
Out[26]:

In [27]:
pd.crosstab(df['Churn'], df['Voice mail plan'], normalize=True)
Out[27]:
We can see that most of the users are loyal and do not use additional services (International
Plan/Voice mail). This will resemble pivot tables to those familiar with Excel. And, of course, pivot
tables are implemented in Pandas: the pivot_table method takes the following parameters:

 values – a list of variables to calculate statistics for,


 index – a list of variables to group data by,
 aggfunc – what statistics we need to calculate for groups, ex. sum, mean, maximum,
minimum or something else.
Let's take a look at the average number of day, evening, and night calls by area code:

In [28]:
df.pivot_table(['Total day calls', 'Total eve calls', 'Total night calls'], ['Area code'], aggfunc='m
ean')
Out[28]:

DataFrame transformations
Like many other things in Pandas, adding columns to a DataFrame is doable in many ways. For
example, if we want to calculate the total number of calls for all users, let's create the total_calls Series
and paste it into the DataFrame:

In [29]:
total_calls = df['Total day calls'] + df['Total eve calls'] + \
df['Total night calls'] + df['Total intl calls']
df.insert(loc=len(df.columns), column='Total calls', value=total_calls)
# loc parameter is the number of columns after which to insert the Series object
# we set it to len(df.columns) to paste it at the very end of the dataframe
df.head()
Out[29]:
It is possible to add a column more easily without creating an intermediate Series instance:

In [30]:
df['Total charge'] = df['Total day charge'] + df['Total eve charge'] + \
df['Total night charge'] + df['Total intl charge']
df.head()
Out[30]:

5 rows × 22 columns

To delete columns or rows, use the drop method, passing the required indexes and the axis parameter
(1 if you delete columns, and nothing or 0 if you delete rows). The inplace argument tells whether to
change the original DataFrame. With inplace=False, the drop method doesn't change the existing
DataFrame and returns a new one with dropped rows or columns. With inplace=True, it alters the
DataFrame.

In [31]:
# get rid of just created columns
df.drop(['Total charge', 'Total calls'], axis=1, inplace=True)
# and here’s how you can delete rows
df.drop([1, 2]).head()
Out[31]:
First attempt at predicting telecom churn
Let's see how churn rate is related to the International plan feature. We'll do this using
a crosstab contingency table and also through visual analysis with Seaborn.

In [32]:

pd.crosstab(df['Churn'], df['International plan'], margins=True)


Out[32]:

In [33]:
# some imports to set up plotting
import matplotlib.pyplot as plt
# pip install seaborn
import seaborn as sns
# Graphics in retina format are more sharp and legible
%config InlineBackend.figure_format = 'retina'
In [34]:
sns.countplot(x='International plan', hue='Churn', data=df);

We see that, with International Plan, the churn rate is much higher, which is an interesting
observation! Perhaps large and poorly controlled expenses with international calls are very conflict-
prone and lead to dissatisfaction among the telecom operator's customers.

Next, let's look at another important feature – Customer service calls. Let's also make a summary table
and a picture.

In [35]:
pd.crosstab(df['Churn'], df['Customer service calls'], margins=True)
Out[35]:

In [36]:
sns.countplot(x='Customer service calls', hue='Churn', data=df);

Although it's not so obvious from the summary table, it's easy to see from the above plot that the
churn rate increases sharply from 4 customer service calls and above.

Now let's add a binary feature to our DataFrame – Customer service calls > 3. And once again, let's see
how it relates to churn.

In [37]:
df['Many_service_calls'] = (df['Customer service calls'] > 3).astype('int')
pd.crosstab(df['Many_service_calls'], df['Churn'], margins=True)
Out[37]:

In [38]:
sns.countplot(x='Many_service_calls', hue='Churn', data=df);
Let's construct another contingency table that relates Churn with both International plan and freshly
created Many_service_calls.

In [39]:
pd.crosstab(df['Many_service_calls'] & df['International plan'] , df['Churn'])
Out[39]:

Therefore, predicting that a customer is not loyal (Churn=1) in the case when the number of calls to
the service center is greater than 3 and the International Plan is added (and predicting Churn=0
otherwise), we might expect an accuracy of 85.8% (we are mistaken only 464 + 9 times). This number,
85.8%, that we got through this very simple reasoning serves as a good starting point (baseline) for
the further machine learning models that we will build.

Check your Progress 1

Fill in the blanks

1. The ____ method shows basic statistical characteristics of each numerical feature.
2. The map method can be used to replace values in a column by passing a dictionary of the
form _______ as its argument.

Activity 1

Install Pandas library and implement all functions discussed in this unit.

Summary

Pandas is a Python library that provides extensive means for data analysis. In this unit we have
discussed the various functions and examples of Pandas used for visual analysis of tabular data. Data
scientists often work with data stored in table formats like .csv, .tsv, or .xlsx. Pandas makes it very
convenient to load, process, and analyse such tabular data using SQL-like queries.
Keywords

 DataFrame: It is a two-dimensional data structure, i.e., data is aligned in a tabular


fashion in rows and columns.
 Crosstab: A table showing the relationship between two or more variables. Where the
table only shows the relationship between two categorical variables, a crosstab is also
known as a contingency table

Self-Assessment Questions

1. Explain the apply( ) function with example.


2. Write a short note on grouping data in Pandas.
3. Explain the use of pivot_table method with example.

Answers to Check Your Progress

Check your Progress 1

Fill in the blanks

1. The describe method shows basic statistical characteristics of each numerical feature.
2. The map method can be used to replace values in a column by passing a dictionary of the
form {old_value: new_value} as its argument.

Suggested Reading

 "Merging DataFrames with pandas" - a tutorial by Max Plako within mlcourse.ai (full list of
tutorials is here)
 "Handle different dataset with dask and trying a little dask ML" - a tutorial by Irina Knyazeva
within mlcourse.ai
 GitHub repos: Pandas exercises and "Effective Pandas"
 scipy-lectures.org — tutorials on pandas, numpy, matplotlib and scikit-learn

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
4.0 International (CC BY 4.0) as requested by the work’s creator or licensees. This license is available
at https://creativecommons.org/licenses/by/4.0/
Unit 10

Aggregating and Analysing Data with dplyr

10.1 Introduction

10.2 What is dplyr?

10.3 Selecting columns and filtering rows

10.4 Pipes

10.5 Mutate

10.6 Split-apply-combine data analysis and the summarize () function

10.7 Reshaping data frames

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

 Describe what the dplyr package in R is used for.


 Apply common dplyr functions to manipulate data in R.
 Employ the ‘pipe’ operator to link together a sequence of functions.
 Employ the ‘mutate’ function to apply other chosen functions to existing columns and create
new columns of data.
 Employ the ‘split-apply-combine’ concept to split the data into groups, apply analysis to each
group, and combine the results.

10.1 INTRODUCTION

Bracket subsetting is handy, but it can be cumbersome and difficult to read, especially for complicated
operations. Luckily, the dplyr (https://cran.r-project.org/package=dplyr) package provides a number
of very useful functions for manipulating dataframes in a way that will reduce repetition, reduce the
probability of making errors, and probably even save you some typing. As an added bonus, you might
even find the dplyr grammar easier to read. In this unit we’re going to cover 6 of the most commonly
used functions as well as using pipes (%>%) to combine them.

 select()
 filter()
 group_by()
 summarize()
 mutate()

Packages in R are sets of additional functions that let you do more stuff in R. The functions we’ve been
using, like str(), come built into R; packages give you access to more functions. You need to install a
package and then load it to be able to use it.

install.packages("dplyr") ## install

You might get asked to choose a CRAN mirror – this is asking you to choose a site to download the
package from. The choice doesn’t matter too much; we would recommend choosing the RStudio
mirror.

library("dplyr") ## load

You only need to install a package once per computer, but you need to load it every time you open a
new R session and want to use that package.

10.2 WHAT IS DPLYR?

The dplyr is a package that tries to provide easy tools for the most common data manipulation tasks.
It is built to work directly with data frames. The thinking behind it was largely inspired by the package
plyr which has been in use for some time but suffered from being slow in some cases. dplyr addresses
this by porting much of the computation to C++. An additional feature is the ability to work with data
stored directly in an external database. The benefits of doing this are that the data can be managed
natively in a relational database, queries can be conducted on that database, and only the results of
the query returned.
This addresses a common problem with R in that all operations are conducted in memory and thus
the amount of data you can work with is limited by available memory. The database connections
essentially remove that limitation in that you can have a database of many 100s GB, conduct queries
on it directly and pull back just what you need for analysis in R.

10.3 SELECTING COLUMNS AND FILTERING ROWS

To select columns of a data frame, use select(). The first argument to this function is the data frame
(variants), and the subsequent arguments are the columns to keep.

select(variants, sample_id, REF, ALT, DP)

To select all columns except certain ones, put a “-“ in front of the variable to exclude it.

select(variants, -CHROM)

dplyr also provides useful functions to select columns based on their names. For instance, ends_with()
allows you to select columns that ends with specific letters. For instance, if you wanted to select
columns that end with the letter “B”:

select(variants, ends_with("B"))
To choose rows, we can use filter(). For instance, to keep the rows for the sample SRR2584863:

filter(variants, sample_id == "SRR2584863")

Note that this is equivalent to the base R code below, but is easier to read!

variants[variants$sample_id == "SRR2584863",]

filter() will keep all the rows that match the conditions that are provided. Here are a few examples:

## rows for which the reference genome has T or G

filter(variants, REF %in% c("T", "G"))

Output:

# A tibble: 340 x 29

sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF

<chr> <chr> <dbl> <lgl> <chr> <chr> <dbl> <lgl> <lgl> <dbl> <dbl>

1 SRR25848… CP00… 9.97e3 NA T G 91 NA FALSE NA NA

2 SRR25848… CP00… 2.63e5 NA G T 85 NA FALSE NA NA

3 SRR25848… CP00… 2.82e5 NA G T 217 NA FALSE NA NA

4 SRR25848… CP00… 1.73e6 NA G A 225 NA FALSE NA NA


5 SRR25848… CP00… 2.62e6 NA G T 31.9 NA FALSE NA NA

6 SRR25848… CP00… 3.00e6 NA G A 225 NA FALSE NA NA

7 SRR25848… CP00… 3.91e6 NA G T 225 NA FALSE NA NA

8 SRR25848… CP00… 9.97e3 NA T G 214 NA FALSE NA NA

9 SRR25848… CP00… 1.06e4 NA G A 225 NA FALSE NA NA

10 SRR25848… CP00… 6.40e4 NA G A 225 NA FALSE NA NA

# … with 330 more rows, and 18 more variables: DP <dbl>, VDB <dbl>,

# RPB <dbl>, MQB <dbl>, BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>,

# ICB <lgl>, HOB <lgl>, AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>,

# Indiv <chr>, gt_PL <dbl>, gt_GT <dbl>, gt_GT_alleles <chr>

## rows with QUAL values greater than or equal to 100

filter(variants, QUAL >= 100)

Output:

# A tibble: 666 x 29

sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF

<chr> <chr> <dbl> <lgl> <chr> <chr> <dbl> <lgl> <lgl> <dbl> <dbl>

1 SRR25848… CP00… 2.82e5 NA G T 217 NA FALSE NA NA

2 SRR25848… CP00… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9

3 SRR25848… CP00… 6.49e5 NA C T 210 NA FALSE NA NA

4 SRR25848… CP00… 1.33e6 NA C A 178 NA FALSE NA NA

5 SRR25848… CP00… 1.73e6 NA G A 225 NA FALSE NA NA

6 SRR25848… CP00… 2.33e6 NA AT ATT 167 NA TRUE 7 1

7 SRR25848… CP00… 2.41e6 NA A C 104 NA FALSE NA NA

8 SRR25848… CP00… 2.45e6 NA A C 225 NA FALSE NA NA

9 SRR25848… CP00… 2.67e6 NA A T 225 NA FALSE NA NA

10 SRR25848… CP00… 3.00e6 NA G A 225 NA FALSE NA NA

# … with 656 more rows, and 18 more variables: DP <dbl>, VDB <dbl>,

# RPB <dbl>, MQB <dbl>, BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>,

# ICB <lgl>, HOB <lgl>, AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>,

# Indiv <chr>, gt_PL <dbl>, gt_GT <dbl>, gt_GT_alleles <chr>


## rows that have TRUE in the column INDEL

filter(variants, INDEL)

Output:

# A tibble: 101 x 29

sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF

<chr> <chr> <dbl> <lgl> <chr> <chr> <dbl> <lgl> <lgl> <dbl> <dbl>

1 SRR25848… CP00… 4.33e5 NA CTTT… CTTT… 64 NA TRUE 12 1

2 SRR25848… CP00… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9

3 SRR25848… CP00… 2.10e6 NA ACAG… ACAG… 56 NA TRUE 2 0.667

4 SRR25848… CP00… 2.33e6 NA AT ATT 167 NA TRUE 71

5 SRR25848… CP00… 3.90e6 NA A AC 43.4 NA TRUE 21

6 SRR25848… CP00… 4.43e6 NA TGG T 228 NA TRUE 10 1

7 SRR25848… CP00… 1.48e5 NA AGGGG AGGG… 122 NA TRUE 81

8 SRR25848… CP00… 1.58e5 NA GTTT… GTTT… 19.5 NA TRUE 61

9 SRR25848… CP00… 1.73e5 NA CAA CA 180 NA TRUE 11 1

10 SRR25848… CP00… 1.75e5 NA GAA GA 194 NA TRUE 10 1

# … with 91 more rows, and 18 more variables: DP <dbl>, VDB <dbl>,

# RPB <dbl>, MQB <dbl>, BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>,

# ICB <lgl>, HOB <lgl>, AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>,

# Indiv <chr>, gt_PL <dbl>, gt_GT <dbl>, gt_GT_alleles <chr>

## rows that don't have missing data in the IDV column

filter(variants, !is.na(IDV))

Output:

# A tibble: 101 x 29

sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF

<chr> <chr> <dbl> <lgl> <chr> <chr> <dbl> <lgl> <lgl> <dbl> <dbl>

1 SRR25848… CP00… 4.33e5 NA CTTT… CTTT… 64 NA TRUE 12 1

2 SRR25848… CP00… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9

3 SRR25848… CP00… 2.10e6 NA ACAG… ACAG… 56 NA TRUE 2 0.667

4 SRR25848… CP00… 2.33e6 NA AT ATT 167 NA TRUE 71


5 SRR25848… CP00… 3.90e6 NA A AC 43.4 NA TRUE 21

6 SRR25848… CP00… 4.43e6 NA TGG T 228 NA TRUE 10 1

7 SRR25848… CP00… 1.48e5 NA AGGGG AGGG… 122 NA TRUE 81

8 SRR25848… CP00… 1.58e5 NA GTTT… GTTT… 19.5 NA TRUE 61

9 SRR25848… CP00… 1.73e5 NA CAA CA 180 NA TRUE 11 1

10 SRR25848… CP00… 1.75e5 NA GAA GA 194 NA TRUE 10 1

# … with 91 more rows, and 18 more variables: DP <dbl>, VDB <dbl>,

# RPB <dbl>, MQB <dbl>, BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>,

# ICB <lgl>, HOB <lgl>, AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>,

# Indiv <chr>, gt_PL <dbl>, gt_GT <dbl>, gt_GT_alleles <chr>

filter() allows you to combine multiple conditions. You can separate them using a, as arguments to the
function, they will be combined using the & (AND) logical operator. If you need to use the | (OR) logical
operator, you can specify it explicitly:

## this is equivalent to:

## filter(variants, sample_id == "SRR2584863" & QUAL >= 100)

filter(variants, sample_id == "SRR2584863", QUAL >= 100)

Output:

# A tibble: 19 x 29

sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF

<chr> <chr> <dbl> <lgl> <chr> <chr> <dbl> <lgl> <lgl> <dbl> <dbl>

1 SRR25848… CP00… 2.82e5 NA G T 217 NA FALSE NA NA

2 SRR25848… CP00… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9

3 SRR25848… CP00… 6.49e5 NA C T 210 NA FALSE NA NA

4 SRR25848… CP00… 1.33e6 NA C A 178 NA FALSE NA NA

5 SRR25848… CP00… 1.73e6 NA G A 225 NA FALSE NA NA

6 SRR25848… CP00… 2.33e6 NA AT ATT 167 NA TRUE 7 1

7 SRR25848… CP00… 2.41e6 NA A C 104 NA FALSE NA NA

8 SRR25848… CP00… 2.45e6 NA A C 225 NA FALSE NA NA

9 SRR25848… CP00… 2.67e6 NA A T 225 NA FALSE NA NA

10 SRR25848… CP00… 3.00e6 NA G A 225 NA FALSE NA NA

11 SRR25848… CP00… 3.34e6 NA A C 211 NA FALSE NA NA

12 SRR25848… CP00… 3.40e6 NA C A 225 NA FALSE NA NA


13 SRR25848… CP00… 3.48e6 NA A G 200 NA FALSE NA NA

14 SRR25848… CP00… 3.49e6 NA A C 225 NA FALSE NA NA

15 SRR25848… CP00… 3.91e6 NA G T 225 NA FALSE NA NA

16 SRR25848… CP00… 4.10e6 NA A G 225 NA FALSE NA NA

17 SRR25848… CP00… 4.20e6 NA A C 225 NA FALSE NA NA

18 SRR25848… CP00… 4.43e6 NA TGG T 228 NA TRUE 10 1

19 SRR25848… CP00… 4.62e6 NA A C 185 NA FALSE NA NA

# … with 18 more variables: DP <dbl>, VDB <dbl>, RPB <dbl>, MQB <dbl>,

# BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>, ICB <lgl>, HOB <lgl>,

# AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>, Indiv <chr>, gt_PL <dbl>,

# gt_GT <dbl>, gt_GT_alleles <chr>

## using `|` logical operator

filter(variants, sample_id == "SRR2584863", (INDEL | QUAL >= 100))

Output:

# A tibble: 22 x 29

sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF

<chr> <chr> <dbl> <lgl> <chr> <chr> <dbl> <lgl> <lgl> <dbl> <dbl>

1 SRR25848… CP00… 2.82e5 NA G T 217 NA FALSE NA NA

2 SRR25848… CP00… 4.33e5 NA CTTT… CTTT… 64 NA TRUE 12 1

3 SRR25848… CP00… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9

4 SRR25848… CP00… 6.49e5 NA C T 210 NA FALSE NA NA

5 SRR25848… CP00… 1.33e6 NA C A 178 NA FALSE NA NA

6 SRR25848… CP00… 1.73e6 NA G A 225 NA FALSE NA NA

7 SRR25848… CP00… 2.10e6 NA ACAG… ACAG… 56 NA TRUE 2 0.667

8 SRR25848… CP00… 2.33e6 NA AT ATT 167 NA TRUE 7 1

9 SRR25848… CP00… 2.41e6 NA A C 104 NA FALSE NA NA

10 SRR25848… CP00… 2.45e6 NA A C 225 NA FALSE NA NA

# … with 12 more rows, and 18 more variables: DP <dbl>, VDB <dbl>,

# RPB <dbl>, MQB <dbl>, BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>,

# ICB <lgl>, HOB <lgl>, AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>,


# Indiv <chr>, gt_PL <dbl>, gt_GT <dbl>, gt_GT_alleles <chr>

10.4 PIPES

But what if you wanted to select and filter? We can do this with pipes. Pipes, are a fairly recent addition
to R. Pipes let you take the output of one function and send it directly to the next, which is useful
when you need too many things to the same data set. It was possible to do this before pipes were
added to R, but it was much messier and more difficult. Pipes in R look like %>% and are made available
via the magrittr package, which is installed as part of dplyr. If you use RStudio, you can type the pipe
with Ctrl + Shift + M if you’re using a PC, or Cmd + Shift + M if you’re using a Mac.

variants %>%

filter(sample_id == "SRR2584863") %>%

select(REF, ALT, DP)

Output:

# A tibble: 25 x 3

REF ALT DP

<chr> <chr> <dbl>

1T G 4

2G T 6

3G T 10

4 CTTTTTTT CTTTTTTTT 12

5 CCGC CCGCGC 10

6C T 10

7C A 8

8G A 11

9 ACAGCCAGCCAGCCAGCCAGCCAG… ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCA… 3

10 AT ATT 7

# … with 15 more rows

In the above code, we use the pipe to send the variants dataset first through filter(), to keep rows
where sample_id matches a particular sample, and then through select() to keep only the REF, ALT,
and DP columns. Since %>% takes the object on its left and passes it as the first argument to the
function on its right, we don’t need to explicitly include the data frame as an argument to the filter()
and select() functions any more. We then pipe the results to the head() function so that we only see
the first six rows of data.

Some may find it helpful to read the pipe like the word “then”. For instance, in the above example, we
took the data frame variants, then we filtered for rows where sample_id was SRR2584863, then we
selected the REF, ALT, and DP columns, then we showed only the first six rows. The dplyr functions by
themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can
accomplish more complex manipulations of data frames.

If we want to create a new object with this smaller version of the data we can do so by assigning it a
new name:

SRR2584863_variants <- variants %>%

filter(sample_id == "SRR2584863") %>%

select(REF, ALT, DP)

This new object includes all of the data from this sample. Let’s look at it to confirm it’s what we want:

SRR2584863_variants

Output:

# A tibble: 25 x 3

REF ALT DP

<chr> <chr> <dbl>

1T G 4

2G T 6

3G T 10

4 CTTTTTTT CTTTTTTTT 12

5 CCGC CCGCGC 10

6C T 10

7C A 8

8G A 11

9 ACAGCCAGCCAGCCAGCCAGCCAG… ACAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCAGCCA… 3

10 AT ATT 7

# … with 15 more rows

10.5 MUTATE

Frequently you’ll want to create new columns based on the values in existing columns, for example to
do unit conversions or find the ratio of values in two columns. For this we’ll use the dplyr function
mutate().

We have a column titled “QUAL”. This is a Phred-scaled confidence score that a polymorphism exists
at this position given the sequencing data. Lower QUAL scores indicate low probability of a
polymorphism existing at that site. We can convert the confidence value QUAL to a probability value
according to the formula:

Probability = 1- 10 ^ -(QUAL/10)
Let’s add a column (POLPROB) to our variants dataframe that shows the probability of a polymorphism
at that site given the data. We’ll show only the first six rows of data.

variants %>%

mutate(POLPROB = 1 - (10 ^ -(QUAL/10)))

Output:

# A tibble: 801 x 30

sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF

<chr> <chr> <dbl> <lgl> <chr> <chr> <dbl> <lgl> <lgl> <dbl> <dbl>

1 SRR25848… CP00… 9.97e3 NA T G 91 NA FALSE NA NA

2 SRR25848… CP00… 2.63e5 NA G T 85 NA FALSE NA NA

3 SRR25848… CP00… 2.82e5 NA G T 217 NA FALSE NA NA

4 SRR25848… CP00… 4.33e5 NA CTTT… CTTT… 64 NA TRUE 12 1

5 SRR25848… CP00… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9

6 SRR25848… CP00… 6.49e5 NA C T 210 NA FALSE NA NA

7 SRR25848… CP00… 1.33e6 NA C A 178 NA FALSE NA NA

8 SRR25848… CP00… 1.73e6 NA G A 225 NA FALSE NA NA

9 SRR25848… CP00… 2.10e6 NA ACAG… ACAG… 56 NA TRUE 2 0.667

10 SRR25848… CP00… 2.33e6 NA AT ATT 167 NA TRUE 7 1

# … with 791 more rows, and 19 more variables: DP <dbl>, VDB <dbl>,

# RPB <dbl>, MQB <dbl>, BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>,

# ICB <lgl>, HOB <lgl>, AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>,

# Indiv <chr>, gt_PL <dbl>, gt_GT <dbl>, gt_GT_alleles <chr>,

# POLPROB <dbl>

We are interested in knowing the most common size for the indels. Let’s create a new column, called
“indel_size” that contains the size difference between the sequences and the reference genome. The
function, nchar() returns the number of letters in a string.

variants %>%

mutate(indel_size = nchar(ALT) - nchar(REF))

Output:

# A tibble: 801 x 30

sample_id CHROM POS ID REF ALT QUAL FILTER INDEL IDV IMF

<chr> <chr> <dbl> <lgl> <chr> <chr> <dbl> <lgl> <lgl> <dbl> <dbl>
1 SRR25848… CP00… 9.97e3 NA T G 91 NA FALSE NA NA

2 SRR25848… CP00… 2.63e5 NA G T 85 NA FALSE NA NA

3 SRR25848… CP00… 2.82e5 NA G T 217 NA FALSE NA NA

4 SRR25848… CP00… 4.33e5 NA CTTT… CTTT… 64 NA TRUE 12 1

5 SRR25848… CP00… 4.74e5 NA CCGC CCGC… 228 NA TRUE 9 0.9

6 SRR25848… CP00… 6.49e5 NA C T 210 NA FALSE NA NA

7 SRR25848… CP00… 1.33e6 NA C A 178 NA FALSE NA NA

8 SRR25848… CP00… 1.73e6 NA G A 225 NA FALSE NA NA

9 SRR25848… CP00… 2.10e6 NA ACAG… ACAG… 56 NA TRUE 2 0.667

10 SRR25848… CP00… 2.33e6 NA AT ATT 167 NA TRUE 7 1

# … with 791 more rows, and 19 more variables: DP <dbl>, VDB <dbl>,

# RPB <dbl>, MQB <dbl>, BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>,

# ICB <lgl>, HOB <lgl>, AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>,

# Indiv <chr>, gt_PL <dbl>, gt_GT <dbl>, gt_GT_alleles <chr>,

# indel_size <int>

When you want to create a new variable that depends on multiple conditions, the function
case_when() in combination with mutate() is very useful. Our current dataset has a column that tells
use whether each mutation is an indel, but we don’t know if it’s an insertion or a deletion. Let’s create
a new variable, called mutation_type that will take the values: insertion, deletion, or point depending
on the value found in the indel_size column. We will save this data frame in a new variable, called
variants_indel.

variants_indel <- variants %>%

mutate(

indel_size = nchar(ALT) - nchar(REF),

mutation_type = case_when(

indel_size > 0 ~ "insertion",

indel_size < 0 ~ "deletion",

indel_size == 0 ~ "point"

))

When case_when() is used within mutate(), each row is evaluated for the condition listed in the order
listed. The first condition that returns TRUE will be used to fill the content of the new column (here
mutation_type) with the value listed on the right side of the ~ is used. If none of the conditions are
met, the function returns NA (missing data).
We can check that we captured all possibilities by looking for missing data in the new mutation_type
column, and confirm that no row matches this condition:

variants_indel %>%

filter(is.na(mutation_type))

Output:

# A tibble: 0 x 31

# … with 31 variables: sample_id <chr>, CHROM <chr>, POS <dbl>, ID <lgl>,

# REF <chr>, ALT <chr>, QUAL <dbl>, FILTER <lgl>, INDEL <lgl>,

# IDV <dbl>, IMF <dbl>, DP <dbl>, VDB <dbl>, RPB <dbl>, MQB <dbl>,

# BQB <dbl>, MQSB <dbl>, SGB <dbl>, MQ0F <dbl>, ICB <lgl>, HOB <lgl>,

# AC <dbl>, AN <dbl>, DP4 <chr>, MQ <dbl>, Indiv <chr>, gt_PL <dbl>,

# gt_GT <dbl>, gt_GT_alleles <chr>, indel_size <int>,

# mutation_type <chr>

10.6 SPLIT-APPLY-COMBINE DATA ANALYSIS AND THE SUMMARIZE() FUNCTION

Many data analysis tasks can be approached using the “split-apply-combine” paradigm: split the data
into groups, apply some analysis to each group, and then combine the results. dplyr makes this very
easy through the use of the group_by() function, which splits the data into groups. When the data is
grouped in this way summarize() can be used to collapse each group into a single-row summary.
summarize() does this by applying an aggregating or summary function to each group.

For example, if we wanted to group by mutation_type and find the average size for the insertions and
deletions:

variants_indel %>%

group_by(mutation_type) %>%

summarize(

mean_size = mean(indel_size)

Output:

# A tibble: 3 x 2

mutation_type mean_size

<chr> <dbl>

1 deletion -1.38

2 insertion 1.61

3 point 0
We can have additional columns by adding arguments to the summarize() function. For instance, if we
also wanted to know the median indel size:

variants_indel %>%

group_by(mutation_type) %>%

summarize(

mean_size = mean(indel_size),

median_size = median(indel_size)

Output:

# A tibble: 3 x 3

mutation_type mean_size median_size

<chr> <dbl> <dbl>

1 deletion -1.38 -1

2 insertion 1.61 1

3 point 0 0

So to view the highest filtered depth (DP) for each sample:

variants_indel %>%

group_by(sample_id) %>%

summarize(max(DP))

Output:

# A tibble: 3 x 2

sample_id `max(DP)`

<chr> <dbl>

1 SRR2584863 20

2 SRR2584866 79

3 SRR2589044 16

Callout: missing data and built-in functions

R has many built-in functions like mean(), median(), min(), and max() that are useful to compute
summary statistics. These are called “built-in functions” because they come with R and don’t require
that you install any additional packages. By default, all R functions operating on vectors that contains
missing data will return NA. It’s a way to make sure that users know they have missing data, and make
a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the
easiest way to ignore NA (the missing data) is to use na.rm = TRUE (rm stands for remove).
It is often useful to calculate how many observations are present in each group. The function n() helps
you do that:

variants_indel %>%

group_by(mutation_type) %>%

summarize(

n = n()

Output:

# A tibble: 3 x 2

mutation_type n

<chr> <int>

1 deletion 39

2 insertion 62

3 point 700

Because it’s a common operation, the dplyr verb, count() is a “shortcut” that combines these 2
commands:

variants_indel %>%

count(mutation_type)

Output:

# A tibble: 3 x 2

mutation_type n

<chr> <int>

1 deletion 39

2 insertion 62

3 point 700

group_by() (and therfore count()) can also take multiple column names.

10.7 RESHAPING DATA FRAMES

While the tidy format is useful to analyze and plot data in R, it can sometimes be useful to transform
the “long” tidy format, into the wide format. This transformation can be done with the spread()
function provided by the tidyr package (also part of the tidyverse).

spread() takes a data frame as the first argument, and two arguments: the column name that will
become the columns and the column name that will become the cells in the wide data.
variants_wide <- variants_indel %>%

count(sample_id, mutation_type) %>%

spread(mutation_type, n)

variants_wide

Output:

# A tibble: 3 x 4

sample_id deletion insertion point

<chr> <int> <int> <int>

1 SRR2584863 1 5 19

2 SRR2584866 37 54 675

3 SRR2589044 1 3 6

The opposite operation of spread() is taken care by gather(). We specify the names of the new
columns, and here add -sample_id as this column shouldn’t be affected by the reshaping:

variants_wide %>%

gather(mutation_type, n, -sample_id)

Output:

# A tibble: 9 x 3

sample_id mutation_type n

<chr> <chr> <int>

1 SRR2584863 deletion 1

2 SRR2584866 deletion 37

3 SRR2589044 deletion 1

4 SRR2584863 insertion 5

5 SRR2584866 insertion 54

6 SRR2589044 insertion 3

7 SRR2584863 point 19

8 SRR2584866 point 675

9 SRR2589044 point 6

Exporting

We can export this new dataset using write_csv():

write_csv(variants_indel, "variants_indel.csv")
Check your Progress 1

Match the following

1. ends_with() a. keep all the rows that match the conditions that are provided
2. filter() b. splits the data into groups
3. group_by() c. collapse each group into a single-row summary
4. summarize() d. allows to select columns that ends with specific letters

Activity 1

1. Create a table that contains all the columns with the letter “i” except for the columns “Indiv”
and “FILTER”, and the column “POS”. Hint: look at the help for the function ends_with() we’ve
just covered.
2. Select all the mutations that occurred between the positions 1e6 (one million) and 2e6
(included) that are not indels and have QUAL greater than 200.

Summary

The dplyr package provides a number of very useful functions for manipulating data frames in a way
that will reduce repetition, reduce the probability of making errors, and probably. In this unit we have
discussed this in detail. We have also discussed the most commonly used functions like, select(),
filter(), group_by(), summarize() and mutate().

Keywords

 Probability: Probability is a numerical description of how likely an event is to occur or how


likely it is that a proposition is true.
 Vectors: It is a basic data structure in R, which contains element of the same type. The data
types can be logical, integer, double, character, complex or raw.
 Tidy data: Tidy data is a specific way of organizing data into a consistent format which plugs
into the tidyverse set of packages for R.

Self-Assessment Questions

1. Based on the probability scores we calculated when we first introducted mutate(), classify
each mutation in 3 categories: high (> 0.95), medium (between 0.7 and 0.95), and low (< 0.7),
and create a table with sample_id as rows, the 3 levels of quality as columns, and the number
of mutations in the cells.
2. Starting with the variants dataframe, use pipes to subset the data to include only observations
from SRR2584863 sample, where the filtered depth (DP) is at least 10. Retain only the columns
REF, ALT, and POS.
3. Write a short note on pipe.
4. Explain the use of mutate() function with example.

Answers to Check Your Progress


Check your Progress 1
Match the following
1. - d
2. - a
3. - b
4. - c

Suggested Reading

1. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, by Hadley Wickham,
Garrett Grolemund
2. Efficient R Programming: A Practical Guide to Smarter Programming by Colin Gillespie, Robin
Lovelace
3. Advanced R by Hadley Wickham
4. R Programming Fundamentals: Deal with data using various modeling techniques by Kaelen
Medeiros
5. Learning R Programming by Kun Ren

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
4.0 International (CC BY 4.0) as requested by the work’s creator or licensees. This license is available
at https://creativecommons.org/licenses/by/4.0/
Unit 11

Data Visualisation in Python – Matplotlib

11.1 Introduction

11.2 Data Visualisation

11.3 Simple Plot

11.4 Figures, Subplots, Axes and Ticks

11.5 Other Types of Plots

11.6 Quick References

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

 Understand and implement different types of plots using Matplotlib library functions

11.1 INTRODUCTION

Matplotlib is probably the most used Python package for 2D-graphics. It provides both a quick way to
visualize data from Python and publication-quality figures in many formats. We are going to explore
matplotlib in interactive mode covering most common cases in this unit.

11.2 DATA VISUALISATION

11.2.1 IPython, Jupyter, and Matplotlib modes

The Jupyter notebook and the IPython enhanced interactive Python, are tuned for the scientific-
computing workflow in Python, in combination with Matplotlib:

For interactive matplotlib sessions, turn on the matplotlib mode. When using the IPython console,
use:

In [1]:

%matplotlib

Jupyter notebook:

In the notebook, insert, at the beginning of the notebook the following magic:

%matplotlib inline

11.2.2 pyplot

pyplot provides a procedural interface to the matplotlib object-oriented plotting library. It is modeled
closely after Matlab™. Therefore, the majority of plotting commands in pyplot have Matlab™ analogs
with similar arguments. Important commands are explained with interactive examples.

from matplotlib import pyplot as plt

11.3 SIMPLE PLOT

In this section, we want to draw the cosine and sine functions on the same plot. Starting from the
default settings, we’ll enrich the figure step by step to make it nicer. First step is to get the data for
the sine and cosine functions:

import numpy as np

X = np.linspace(-np.pi, np.pi, 256, endpoint=True)

C, S = np.cos(X), np.sin(X)

X is now a numpy array with 256 values ranging from -π to +π (included). C is the cosine (256 values)
and S is the sine (256 values).

To run the example, you can type them in an IPython interactive session:
$ ipython --matplotlib

This brings us to the IPython prompt:

IPython 0.13 -- An enhanced Interactive Python.

? -> Introduction to IPython's features.

%magic -> Information about IPython's 'magic' % functions.

help -> Python's own help system.

object? -> Details about 'object'. ?object also works, ?? prints more.

You can also download each of the examples and run it using regular python, but you will lose
interactive data manipulation:

$ python plot_exercise_1.py

11.3.1. Plotting with default settings

Matplotlib comes with a set of default settings that allow customizing all kinds of properties. You can
control the defaults of almost every property in matplotlib: figure size and dpi, line width, color and
style, axes, axis and grid properties, text and font properties and so on.

import numpy as np

import matplotlib.pyplot as plt

X = np.linspace(-np.pi, np.pi, 256, endpoint=True)

C, S = np.cos(X), np.sin(X)

plt.plot(X, C)

plt.plot(X, S)

plt.show()

11.3.2. Instantiating defaults

In the script below, we’ve instantiated (and commented) all the figure settings that influence the
appearance of the plot. The settings have been explicitly set to their default values, but now you can
interactively play with the values to explore their affect (see Line properties and Line styles below).
import numpy as np

import matplotlib.pyplot as plt

# Create a figure of size 8x6 inches, 80 dots per inch

plt.figure(figsize=(8, 6), dpi=80)

# Create a new subplot from a grid of 1x1

plt.subplot(1, 1, 1)

X = np.linspace(-np.pi, np.pi, 256, endpoint=True)

C, S = np.cos(X), np.sin(X)

# Plot cosine with a blue continuous line of width 1 (pixels)

plt.plot(X, C, color="blue", linewidth=1.0, linestyle="-")

# Plot sine with a green continuous line of width 1 (pixels)

plt.plot(X, S, color="green", linewidth=1.0, linestyle="-")

# Set x limits

plt.xlim(-4.0, 4.0)

# Set x ticks

plt.xticks(np.linspace(-4, 4, 9, endpoint=True))

# Set y limits

plt.ylim(-1.0, 1.0)

# Set y ticks

plt.yticks(np.linspace(-1, 1, 5, endpoint=True))

# Save figure using 72 dots per inch

# plt.savefig("exercise_2.png", dpi=72)

# Show result on screen

plt.show()
11.3.3. Changing colors and line widths

First step, we want to have the cosine in blue and the sine in red and a slightly thicker line for both of
them. We’ll also slightly alter the figure size to make it more horizontal.

...

plt.figure(figsize=(10, 6), dpi=80)

plt.plot(X, C, color="blue", linewidth=2.5, linestyle="-")

plt.plot(X, S, color="red", linewidth=2.5, linestyle="-")

...

11.3.4. Setting limits

Current limits of the figure are a bit too tight and we want to make some space in order to clearly see
all data points.

...

plt.xlim(X.min() * 1.1, X.max() * 1.1)

plt.ylim(C.min() * 1.1, C.max() * 1.1)

...

11.3.5. Setting ticks

Current ticks are not ideal because they do not show the interesting values (+/-π,+/-π/2) for sine and
cosine. We’ll change them such that they show only these values.
...

plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])

plt.yticks([-1, 0, +1])

...

11.3.6. Setting tick labels

Ticks are now properly placed but their label is not very explicit. We could guess that 3.142 is π but it
would be better to make it explicit. When we set tick values, we can also provide a corresponding label
in the second argument list. Note that we’ll use latex to allow for nice rendering of the label.

...

plt.xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi], [r'$-\pi$', r'$-\pi/2$', r'$0$', r'$+\pi/2$', r'$+\pi$'])

plt.yticks([-1, 0, +1], [r'$-1$', r'$0$', r'$+1$'])

...

11.3.7. Moving spines

Spines are the lines connecting the axis tick marks and noting the boundaries of the data area. They
can be placed at arbitrary positions and until now, they were on the border of the axis. We’ll change
that since we want to have them in the middle. Since there are four of them (top/bottom/left/right),
we’ll discard the top and right by setting their color to none and we’ll move the bottom and left ones
to coordinate 0 in data space coordinates.

...

ax = plt.gca() # gca stands for 'get current axis'

ax.spines['right'].set_color('none')

ax.spines['top'].set_color('none')

ax.xaxis.set_ticks_position('bottom')

ax.spines['bottom'].set_position(('data',0))

ax.yaxis.set_ticks_position('left')

ax.spines['left'].set_position(('data',0))

...

11.3.8. Adding a legend

Let’s add a legend in the upper left corner. This only requires adding the keyword argument label (that
will be used in the legend box) to the plot commands.

...

plt.plot(X, C, color="blue", linewidth=2.5, linestyle="-", label="cosine")

plt.plot(X, S, color="red", linewidth=2.5, linestyle="-", label="sine")

plt.legend(loc='upper left')

...
11.3.9. Annotate some points

Let’s annotate some interesting points using the annotate command. We chose the 2π/3 value and
we want to annotate both the sine and the cosine. We’ll first draw a marker on the curve as well as a
straight dotted line. Then, we’ll use the annotate command to display some text with an arrow.

...

t = 2 * np.pi / 3

plt.plot([t, t], [0, np.cos(t)], color='blue', linewidth=2.5, linestyle="--")

plt.scatter([t, ], [np.cos(t), ], 50, color='blue')

plt.annotate(r'$cos(\frac{2\pi}{3})=-\frac{1}{2}$',

xy=(t, np.cos(t)), xycoords='data',

xytext=(-90, -50), textcoords='offset points', fontsize=16,

arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))

plt.plot([t, t],[0, np.sin(t)], color='red', linewidth=2.5, linestyle="--")

plt.scatter([t, ],[np.sin(t), ], 50, color='red')

plt.annotate(r'$sin(\frac{2\pi}{3})=\frac{\sqrt{3}}{2}$',

xy=(t, np.sin(t)), xycoords='data',

xytext=(+10, +30), textcoords='offset points', fontsize=16,

arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2"))

...
11.3.10. Devil is in the details

The tick labels are now hardly visible because of the blue and red lines. We can make them bigger and
we can also adjust their properties such that they’ll be rendered on a semi-transparent white
background. This will allow us to see both the data and the labels.

...

for label in ax.get_xticklabels() + ax.get_yticklabels():

label.set_fontsize(16)

label.set_bbox(dict(facecolor='white', edgecolor='None', alpha=0.65))

...

11.4 FIGURES, SUBPLOTS, AXES AND TICKS

A “figure” in matplotlib means the whole window in the user interface. Within this figure there can be
“subplots”.

So far we have used implicit figure and axes creation. This is handy for fast plots. We can have more
control over the display using figure, subplot, and axes explicitly. While subplot positions the plots in
a regular grid, axes allows free placement within the figure. Both can be useful depending on your
intention. We’ve already worked with figures and subplots without explicitly calling them. When we
call plot, matplotlib calls gca() to get the current axes and gca in turn calls gcf() to get the current
figure. If there is none it calls figure() to make one, strictly speaking, to make a subplot(111). Let’s look
at the details.

11.4.1. Figures

A figure is the windows in the GUI that has “Figure #” as title. Figures are numbered starting from 1 as
opposed to the normal Python way starting from 0. This is clearly MATLAB-style. There are several
parameters that determine what the figure looks like:

The defaults can be specified in the resource file and will be used most of the time. Only the number
of the figure is frequently changed. As with other objects, you can set figure properties also setp or
with the set_something methods.

When you work with the GUI you can close a figure by clicking on the x in the upper right corner. But
you can close a figure programmatically by calling close. Depending on the argument it closes (1) the
current figure (no argument), (2) a specific figure (figure number or figure instance as argument), or
(3) all figures ("all" as argument).

plt.close(1) # Closes figure 1

11.4.2. Subplots

With subplot you can arrange plots in a regular grid. You need to specify the number of rows and
columns and the number of the plot. Note that the gridspec command is a more powerful alternative.

11.4.3. Axes

Axes are very similar to subplots but allow placement of plots at any location in the figure. So if we
want to put a smaller plot inside a bigger one we do so with axes.
11.4.4. Ticks

Well formatted ticks are an important part of publishing-ready figures. Matplotlib provides a totally
configurable system for ticks. There are tick locators to specify where ticks should appear and tick
formatters to give ticks the appearance you want. Major and minor ticks can be located and formatted
independently from each other. Per default minor ticks are not shown, i.e. there is only an empty list
for them because it is as NullLocator (see below).

Tick Locators

Tick locators control the positions of the ticks. They are set as follows:

ax = plt.gca()

ax.xaxis.set_major_locator(eval(locator))

There are several locators for different kind of requirements:

All of these locators derive from the base class matplotlib.ticker.Locator. You can make your own
locator deriving from it. Handling dates as ticks can be especially tricky. Therefore, matplotlib provides
special locators in matplotlib.dates.

11.5 Other Types of Plots

11.5.1. Regular Plots

Starting from the code below, try to reproduce the graphic taking care of filled areas:

n = 256

X = np.linspace(-np.pi, np.pi, n, endpoint=True)

Y = np.sin(2 * X)

plt.plot(X, Y + 1, color='blue', alpha=1.00)

plt.plot(X, Y - 1, color='blue', alpha=1.00)


11.5.2. Scatter Plots

Starting from the code below, try to reproduce the graphic taking care of marker size, color and
transparency.

n = 1024

X = np.random.normal(0,1,n)

Y = np.random.normal(0,1,n)

plt.scatter(X,Y)

11.5.3. Bar Plots

Starting from the code below, try to reproduce the graphic by adding labels for red bars.

n = 12

X = np.arange(n)

Y1 = (1 - X / float(n)) * np.random.uniform(0.5, 1.0, n)

Y2 = (1 - X / float(n)) * np.random.uniform(0.5, 1.0, n)

plt.bar(X, +Y1, facecolor='#9999ff', edgecolor='white')

plt.bar(X, -Y2, facecolor='#ff9999', edgecolor='white')

for x, y in zip(X, Y1):

plt.text(x + 0.4, y + 0.05, '%.2f' % y, ha='center', va='bottom')

plt.ylim(-1.25, +1.25)
11.5.4. Contour Plots

Starting from the code below, try to reproduce the graphic taking care of the colormap (see Colormaps
below).

def f(x, y):

return (1 - x / 2 + x ** 5 + y ** 3) * np.exp(-x ** 2 -y ** 2)

n = 256

x = np.linspace(-3, 3, n)

y = np.linspace(-3, 3, n)

X, Y = np.meshgrid(x, y)

plt.contourf(X, Y, f(X, Y), 8, alpha=.75, cmap='jet')

C = plt.contour(X, Y, f(X, Y), 8, colors='black', linewidth=.5)

11.5.5. Imshow

Starting from the code below, try to reproduce the graphic taking care of colormap, image
interpolation and origin.

def f(x, y):

return (1 - x / 2 + x ** 5 + y ** 3) * np.exp(-x ** 2 - y ** 2)
n = 10

x = np.linspace(-3, 3, 4 * n)

y = np.linspace(-3, 3, 3 * n)

X, Y = np.meshgrid(x, y)

plt.imshow(f(X, Y))

11.5.6. Pie Charts

Starting from the code below, try to reproduce the graphic taking care of colors and slices size.

Z = np.random.uniform(0, 1, 20)

plt.pie(Z)

11.5.7. Quiver Plots

Starting from the code below, try to reproduce the graphic taking care of colors and orientations.

n=8

X, Y = np.mgrid[0:n, 0:n]

plt.quiver(X, Y)
11.5.8. Grids

Starting from the code below, try to reproduce the graphic taking care of line styles.

axes = plt.gca()

axes.set_xlim(0, 4)

axes.set_ylim(0, 3)

axes.set_xticklabels([])

axes.set_yticklabels([])

11.5.9. Multi Plots

Starting from the code below, try to reproduce the graphic.

plt.subplot(2, 2, 1)

plt.subplot(2, 2, 3)

plt.subplot(2, 2, 4)
11.5.10. Polar Axis

Starting from the code below, try to reproduce the graphic.

plt.axes([0, 0, 1, 1])

N = 20

theta = np.arange(0., 2 * np.pi, 2 * np.pi / N)

radii = 10 * np.random.rand(N)

width = np.pi / 4 * np.random.rand(N)

bars = plt.bar(theta, radii, width=width, bottom=0.0)

for r, bar in zip(radii, bars):

bar.set_facecolor(plt.cm.jet(r / 10.))

bar.set_alpha(0.5)

11.5.11. 3D Plots

Starting from the code below, try to reproduce the graphic.

from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = Axes3D(fig)

X = np.arange(-4, 4, 0.25)

Y = np.arange(-4, 4, 0.25)

X, Y = np.meshgrid(X, Y)

R = np.sqrt(X**2 + Y**2)

Z = np.sin(R)

ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='hot')

11.6 QUICK REFERENCES

Here is a set of tables that show main properties and styles.

11.6.1 Line Properties


11.6.2 Line Styles

11.6.3 Markers

11.6.4 Colormaps

All colormaps can be reversed by appending _r. For instance, gray_r is the reverse of gray.
Check your Progress 1
State True or False
1. Matplotlib comes with a set of default settings and we are not allow to customize them.
2. A “figure” in matplotlib means the whole window in the user interface.

Activity 1

Install Matplotlib and implement all the functions as discussed in the unit.

Summary

Matplotlib provides both a quick way to visualize data from Python and publication-quality figures in
many formats. In this unit we have explore matplotlib in interactive mode covering most commonly
used functions with example.

Keywords

 Jupyter notebook: It is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations and narrative text.
 IPython: IPython (Interactive Python) is a command shell for interactive computing in multiple
programming languages.
 DPI: It stands for Dots per Inch which technically means printer dots per inch.
 Annotation: An extra information associated with a particular point in a document or other
piece of information.

Self-Assessment Questions

1. Explain the functions used for simple plot.


2. What function is used to annotate some points of plot? Give example.
3. State different types of plot. Explain any 3 in detail with functions.

Answers to Check Your Progress


Check your Progress 1
State True or False
1. False
2. True
Suggested Reading

1. Interactive Applications Using Matplotlib by Benjamin V. Root


2. Matplotlib for Python Developers by Sandro Tosi
3. Python and Matplotlib Essentials for Scientists and Engineers by Matt A Wood
4. matplotlib Plotting Cookbook by Alexandre Devert

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
4.0 International (CC BY 4.0) as requested by the work’s creator or licensees. This license is available
at https://creativecommons.org/licenses/by/4.0/
Unit 12

Introduction to scikit-learn

12.1 Introduction

12.2 Data in scikit-learn

12.3 Basic principles of machine learning with scikit-learn

12.4 Supervised Learning: Classification of Handwritten Digits

12.5 Supervised Learning: Regression of Housing Data

12.6 Measuring prediction performance

12.7 Unsupervised Learning: Dimensionality Reduction and Visualization

12.8 Parameter selection, Validation, and Testing

Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives

After going through this unit, you will be able to:

 Understand the features of scikit-learn


 Demonstrate supervised and unsupervised learning with scikit-learn

12.1 INTRODUCTION

Machine Learning is about building programs with tunable parameters that are adjusted automatically
so as to improve their behavior by adapting to previously seen data.
Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be
seen as building blocks to make computers learn to behave more intelligently by
somehow generalizing rather than just storing and retrieving data items like a database system would
do.
Classification Problem:

We’ll take a look at two very simple machine learning tasks here. The first is a classification task: the
figure shows a collection of two-dimensional data, colored according to two different class labels. A
classification algorithm may be used to draw a dividing boundary between the two clusters of points:
By drawing this separating line, we have learned a model which can generalize to new data: if you
were to drop another point onto the plane which is unlabeled, this algorithm could
now predict whether it’s a blue or a red point.
Regression Problem:
The next simple task we’ll look at is a regression task: a simple best-fit line to a set of data. Again, this
is an example of fitting a model to data, but our focus here is that the model can make generalizations
about new data. The model has been learned from the training data, and can be used to predict the
result of test data: here, we might be given an x-value, and the model would allow us to predict the y
value.
12.2 DATA IN SCIKIT-LEARN
The data matrix
Machine learning algorithms implemented in scikit-learn expect data to be stored in a two-
dimensional array or matrix. The arrays can be either numpy arrays, or in some
cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]

 n_samples: The number of samples: each sample is an item to process (e.g. classify). A sample
can be a document, a picture, a sound, a video, an astronomical object, a row in database or
CSV file, or whatever you can describe with a fixed set of quantitative traits.
 n_features: The number of features or distinct traits that can be used to describe each item
in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-
valued in some cases.

The number of features must be fixed in advance. However it can be very high dimensional (e.g.
millions of features) with most of them being zeros for a given sample. This is a case
where scipy.sparse matrices can be useful, in that they are much more memory-efficient than numpy
arrays.

A Simple Example: the Iris Dataset

The application problem: As an example of a simple dataset, let us a look at the iris data stored by
scikit-learn. Suppose we want to recognize species of irises. The data consists of measurements of
three different species of irises:

Remember that there must be a fixed number of features for each sample, and feature number i must
be a similar kind of quantity for each sample.
Loading the Iris Data with Scikit-learn
Scikit-learn has a very straightforward set of data on these iris species. The data consist of the
following:

 Features in the Iris dataset:


 sepal length (cm),
 sepal width (cm)
 petal length (cm)
 petal width (cm)
 Target classes to predict:
 Setosa
 Versicolour
 Virginica

scikit-learn embeds a copy of the iris CSV file along with a function to load it into numpy arrays:
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
Note: Import sklearn , scikit-learn is imported as sklearn
The features of each sample flower are stored in the data attribute of the dataset:
>>> print(iris.data.shape)
(150, 4)
>>> n_samples, n_features = iris.data.shape
>>> print(n_samples)
150
>>> print(n_features)
4
>>> print(iris.data[0])
[ 5.1 3.5 1.4 0.2]
The information about the class of each sample is stored in the target attribute of the dataset:
>>> print(iris.target.shape)
(150,)
>>> print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0000000000000111111111111111111111111
1111111111111111111111111122222222222
2222222222222222222222222222222222222
2 2]
The names of the classes are stored in the last attribute, namely target_names:
>>> print(iris.target_names)
['setosa' 'versicolor' 'virginica']
This data is four-dimensional, but we can visualize two of the dimensions at a time using a scatter plot:
12.3 BASIC PRINCIPLES OF MACHINE LEARNING WITH SCIKIT-LEARN

12.3.1. Introducing the scikit-learn estimator object


Every algorithm is exposed in scikit-learn via an ‘’Estimator’’ object. For instance a linear regression
is: sklearn.linear_model.LinearRegression
from sklearn.linear_model import LinearRegression
Estimator parameters: All the parameters of an estimator can be set when it is instantiated:
>>> model = LinearRegression(normalize=True)
>>> print(model.normalize)
True
>>> print(model)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
Fitting on data
Let’s create some simple data with numpy:
>>> import numpy as np
>>> x = np.array([0, 1, 2])
>>> y = np.array([0, 1, 2])
>>> X = x[:, np.newaxis] # The input data for sklearn is 2D: (samples == 3 x features == 1)
>>> X
array([[0],
[1],
[2]])
>>> model.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)
Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data
at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:
>>> model.coef_
array([ 1.])
12.3.2. Supervised Learning: Classification and regression
In Supervised Learning, we have a dataset consisting of both features and labels. The task is to
construct an estimator which is able to predict the label of an object given the set of features. A
relatively simple example is predicting the species of iris given a set of measurements of its flower.
This is a relatively simple task. Some more complicated examples are:

 given a multicolor image of an object through a telescope, determine whether that object is
a star, a quasar, or a galaxy.
 given a photograph of a person, identify the person in the photo.
 given a list of movies a person has watched and their personal rating of the movie, recommend
a list of movies they would like (So-called recommender systems: a famous example is
the Netflix Prize).

What these tasks have in common is that there is one or more unknown quantities associated with
the object which needs to be determined from other observed quantities.
Supervised learning is further broken down into two categories, classification and regression. In
classification, the label is discrete, while in regression, the label is continuous. For example, in
astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a classification
problem: the label is from three distinct categories. On the other hand, we might wish to estimate the
age of an object based on such observations: this would be a regression problem, because the label
(age) is a continuous quantity.
Classification: K nearest neighbors (kNN) is one of the simplest learning strategies: given a new,
unknown observation, look up in your reference database which ones have the closest features and
assign the predominant class. Let’s try it out on our iris classification problem:
from sklearn import neighbors, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
knn = neighbors.KNeighborsClassifier(n_neighbors=1)
knn.fit(X, y)
# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?
print(iris.target_names[knn.predict([[3, 5, 4, 2]])])
A plot of the sepal space and the prediction of the KNN
Regression: The simplest possible regression setting is the linear regression one:

A plot of a simple linear regression


12.3.3. A recap on Scikit-learn’s estimator interface
Scikit-learn strives to have a uniform interface across all methods, and we’ll see examples of these
below. Given a scikit-learn estimator object named model, the following methods are available:
In all Estimators:

 model.fit() : fit training data. For supervised learning applications, this accepts two arguments:
the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this
accepts only a single argument, the data X (e.g. model.fit(X)).
In supervised estimators:
 model.predict() : given a trained model, predict the label of a new set of data. This method
accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the
learned label for each object in the array.
 model.predict_proba() : For classification problems, some estimators also provide this
method, which returns the probability that a new observation has each categorical label. In
this case, the label with the highest probability is returned by model.predict().
 model.score() : for classification or regression problems, most (all?) estimators implement a
score method. Scores are between 0 and 1, with a larger score indicating a better fit.
In unsupervised estimators:

 model.transform() : given an unsupervised model, transform new data into the new basis. This
also accepts one argument X_new, and returns the new representation of the data based on
the unsupervised model.
 model.fit_transform() : some estimators implement this method, which more efficiently
performs a fit and a transform on the same input data.
12.3.4. Regularization: what it is and why it is necessary
Preferring simpler models
Train errors: Suppose you are using a 1-nearest neighbor estimator. How many errors do you expect
on your train set?

 Train set error is not a good measurement of prediction performance. You need to leave out
a test set.
 In general, we should accept errors on the train set.

An example of regularization: The core idea behind regularization is that we are going to prefer models
that are simpler, for a certain definition of ‘’simpler’‘, even if they lead to more errors on the train set.
As an example, let’s generate with a 9th order polynomial, with noise:

And now, let’s fit a 4th order and a 9th order polynomial to the data.
With your naked eyes, which model do you prefer, the 4th order one, or the 9th order one?
Let’s look at the ground truth:

Regularization is ubiquitous in machine learning. Most scikit-learn estimators have a parameter to


tune the amount of regularization. For instance, with k-NN, it is ‘k’, the number of nearest neighbors
used to make the decision. k=1 amounts to no regularization: 0 error on the training set, whereas large
k will push toward smoother decision boundaries in the feature space.

Simple versus complex models for classification


For classification models, the decision boundary that separates the class expresses the complexity of
the model. For instance, a linear model, that makes a decision based on a linear combination of
features, is more complex than a non-linear one.

12.4 SUPERVISED LEARNING: CLASSIFICATION OF HANDWRITTEN DIGITS

12.4.1. The nature of the data


In this section we’ll apply scikit-learn to the classification of handwritten digits. This will go a bit
beyond the iris classification we saw before: we’ll discuss some of the metrics which can be used in
evaluating the effectiveness of a classification model.
>>> from sklearn.datasets import load_digits
>>> digits = load_digits()

Let us visualize the data and remind us what we’re looking at (click on the figure for the full code):
# plot the digits: each image is 8x8 pixels
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')

12.4.2. Visualizing the Data on its principal components


A good first-step for many problems is to visualize the data using a Dimensionality
Reduction technique. We’ll start with the most straightforward one, Principal Component Analysis
(PCA).
PCA seeks orthogonal linear combinations of the features which show the greatest variance, and as
such, can help give you a good idea of the structure of the data set.
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2)
>>> proj = pca.fit_transform(digits.data)
>>> plt.scatter(proj[:, 0], proj[:, 1], c=digits.target)
<matplotlib.collections.PathCollection object at ...>
>>> plt.colorbar()
<matplotlib.colorbar.Colorbar object at ...>

12.4.3. Gaussian Naive Bayes Classification


For most classification problems, it’s nice to have a simple, fast method to provide a quick baseline
classification. If the simple and fast method is sufficient, then we don’t have to waste CPU cycles on
more complex models. If not, we can use the results of the simple method to give us clues about our
data.
One good method to keep in mind is Gaussian Naive Bayes (sklearn.naive_bayes.GaussianNB).
Gaussian Naive Bayes fits a Gaussian distribution to each training label independantly on each feature,
and uses this to quickly give a rough classification. It is generally not sufficiently accurate for real-world
data, but can perform surprisingly well, for instance on text data.
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.model_selection import train_test_split
>>> # split the data into training and validation sets
>>> X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)
>>> # train the model
>>> clf = GaussianNB()
>>> clf.fit(X_train, y_train)
GaussianNB(priors=None)
>>> # use the model to predict the labels of the test data
>>> predicted = clf.predict(X_test)
>>> expected = y_test
>>> print(predicted)
[1 7 7 7 8 2 8 0 4 8 7 7 0 8 2 3 5 8 5 3 7 9 6 2 8 2 2 7 3 5...]
>>> print(expected)
[1 0 4 7 8 2 2 0 4 3 7 7 0 8 2 3 4 8 5 3 7 9 6 3 8 2 2 9 3 5...]
As above, we plot the digits with the predicted labels to get an idea of how well the classification is
working.

12.4.4. Quantitative Measurement of Performance


We’d like to measure the performance of our estimator without having to resort to plotting examples.
A simple method might be to simply compare the number of matches:
>>> matches = (predicted == expected)
>>> print(matches.sum())
367
>>> print(len(matches))
450
>>> matches.sum() / float(len(matches))
0.81555555555555559
We see that more than 80% of the 450 predictions match the input. But there are other more
sophisticated metrics that can be used to judge the performance of a classifier: several are available
in the sklearn.metrics submodule.
One of the most useful metrics is the classification_report, which combines several measures and
prints a table with the results:
>>> from sklearn import metrics
>>> print(metrics.classification_report(expected, predicted))
precision recall f1-score support
0 1.00 0.91 0.95 46
1 0.76 0.64 0.69 44
2 0.85 0.62 0.72 47
3 0.98 0.82 0.89 49
4 0.89 0.86 0.88 37
5 0.97 0.93 0.95 41
6 1.00 0.98 0.99 44
7 0.73 1.00 0.84 45
8 0.50 0.90 0.64 49
9 0.93 0.54 0.68 48
avg / total 0.86 0.82 0.82 450
Another enlightening metric for this sort of multi-label classification is a confusion matrix: it helps us
visualize which labels are being interchanged in the classification errors:
>>> print(metrics.confusion_matrix(expected, predicted))
[[42 0 0 0 3 0 0 1 0 0]
[ 0 28 0 0 0 0 0 1 13 2]
[ 0 3 29 0 0 0 0 0 15 0]
[ 0 0 2 40 0 0 0 2 5 0]
[ 0 0 1 0 32 1 0 3 0 0]
[ 0 0 0 0 0 38 0 2 1 0]
[ 0 0 1 0 0 0 43 0 0 0]
[ 0 0 0 0 0 0 0 45 0 0]
[ 0 3 1 0 0 0 0 1 44 0]
[ 0 3 0 1 1 0 0 7 10 26]]
We see here that in particular, the numbers 1, 2, 3, and 9 are often being labeled 8.
12.5 SUPERVISED LEARNING: REGRESSION OF HOUSING DATA
Here we’ll do a short example of a regression problem: learning a continuous value from a set of
features.
12.5.1. A quick look at the data
We’ll use the simple Boston house prices set, available in scikit-learn. This records measurements of
13 attributes of housing markets around Boston, as well as the median price. The question is: can you
predict the price of a new market given its attributes?
>>> from sklearn.datasets import load_boston
>>> data = load_boston()
>>> print(data.data.shape)
(506, 13)
>>> print(data.target.shape)
(506,)
We can see that there are just over 500 data points.
The DESCR variable has a long description of the dataset:
>>> print(data.DESCR)
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
-B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
...
It often helps to quickly visualize pieces of the data using histograms, scatter plots, or other plot types.
With matplotlib, let us show a histogram of the target values: the median price in each neighborhood:
>>> plt.hist(data.target)
(array([...

Let’s have a quick look to see if some features are more relevant than others for our problem:
>>> for index, feature_name in enumerate(data.feature_names):
... plt.figure()
... plt.scatter(data.data[:, index], data.target)
<Figure size...
This is a manual version of a technique called feature selection.
Sometimes, in Machine Learning it is useful to use feature selection to decide which features are the
most useful for a particular problem. Automated methods exist which quantify this sort of exercise of
choosing the most informative features.

12.5.2. Predicting Home Prices: a Simple Linear Regression


Now we’ll use scikit-learn to perform a simple linear regression on the housing data. There are many
possibilities of regressors to use. A particularly simple one is LinearRegression: this is basically a
wrapper around an ordinary least squares calculation.
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
>>> from sklearn.linear_model import LinearRegression
>>> clf = LinearRegression()
>>> clf.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
>>> predicted = clf.predict(X_test)
>>> expected = y_test
>>> print("RMS: %s" % np.sqrt(np.mean((predicted - expected) ** 2)))
RMS: 5.0059...

We can plot the error: expected as a function of predicted:


>>> plt.scatter(expected, predicted)
<matplotlib.collections.PathCollection object at ...>

The prediction at least correlates with the true price, though there are clearly some biases. We could
imagine evaluating the performance of the regressor by, say, computing the RMS residuals between
the true and predicted price. There are some subtleties in this.

12.6 MEASURING PREDICTION PERFORMANCE

12.6.1. A quick test on the K-neighbors classifier


Here we’ll continue to look at the digits data, but we’ll switch to the K-Neighbors classifier. The K
neighbors classifier is an instance-based classifier. The K-neighbors classifier predicts the label of an
unknown point based on the labels of the K nearest points in the parameter space.
>>> # Get the data
>>> from sklearn.datasets import load_digits
>>> digits = load_digits()
>>> X = digits.data
>>> y = digits.target
>>> # Instantiate and train the classifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> clf = KNeighborsClassifier(n_neighbors=1)
>>> clf.fit(X, y)
KNeighborsClassifier(...)
>>> # Check the results using metrics
>>> from sklearn import metrics
>>> y_pred = clf.predict(X)
>>> print(metrics.confusion_matrix(y_pred, y))
[[178 0 0 0 0 0 0 0 0 0]
[ 0 182 0 0 0 0 0 0 0 0]
[ 0 0 177 0 0 0 0 0 0 0]
[ 0 0 0 183 0 0 0 0 0 0]
[ 0 0 0 0 181 0 0 0 0 0]
[ 0 0 0 0 0 182 0 0 0 0]
[ 0 0 0 0 0 0 181 0 0 0]
[ 0 0 0 0 0 0 0 179 0 0]
[ 0 0 0 0 0 0 0 0 174 0]
[ 0 0 0 0 0 0 0 0 0 180]]
Apparently, we’ve found a perfect classifier! But this is misleading for the reasons we saw before: the
classifier essentially “memorizes” all the samples it has already seen. To really test how well this
algorithm does, we need to try some samples it hasn’t yet seen.
This problem also occurs with regression models. In the following we fit another instance-based model
named “decision tree” to the Boston Housing price dataset we introduced previously:
>>> from sklearn.datasets import load_boston
>>> from sklearn.tree import DecisionTreeRegressor
>>> data = load_boston()
>>> clf = DecisionTreeRegressor().fit(data.data, data.target)
>>> predicted = clf.predict(data.data)
>>> expected = data.target
>>> plt.scatter(expected, predicted)
<matplotlib.collections.PathCollection object at ...>
>>> plt.plot([0, 50], [0, 50], '--k')
[<matplotlib.lines.Line2D object at ...]
Here again the predictions are seemingly perfect as the model was able to perfectly memorize the
training set.
12.6.2. A correct approach: Using a validation set
Learning the parameters of a prediction function and testing it on the same data is a methodological
mistake: a model that would just repeat the labels of the samples that it has just seen would have a
perfect score but would fail to predict anything useful on yet-unseen data.
To avoid over-fitting, we have to define two different sets:

 a training set X_train, y_train which is used for learning the parameters of a predictive model
 a testing set X_test, y_test which is used for evaluating the fitted predictive model

In scikit-learn such a random split can be quickly computed with the train_test_split() function:
>>> from sklearn import model_selection
>>> X = digits.data
>>> y = digits.target
>>>X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,...
test_size=0.25, random_state=0)
>>> print("%r, %r, %r" % (X.shape, X_train.shape, X_test.shape))
(1797, 64), (1347, 64), (450, 64)
Now we train on the training data, and test on the testing data:
>>> clf = KNeighborsClassifier(n_neighbors=1).fit(X_train, y_train)
>>> y_pred = clf.predict(X_test)
>>> print(metrics.confusion_matrix(y_test, y_pred))
[[37 0 0 0 0 0 0 0 0 0]
[ 0 43 0 0 0 0 0 0 0 0]
[ 0 0 43 1 0 0 0 0 0 0]
[ 0 0 0 45 0 0 0 0 0 0]
[ 0 0 0 0 38 0 0 0 0 0]
[ 0 0 0 0 0 47 0 0 0 1]
[ 0 0 0 0 0 0 52 0 0 0]
[ 0 0 0 0 0 0 0 48 0 0]
[ 0 0 0 0 0 0 0 0 48 0]
[ 0 0 0 1 0 1 0 0 0 45]]
>>> print(metrics.classification_report(y_test, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 37
1 1.00 1.00 1.00 43
2 1.00 0.98 0.99 44
3 0.96 1.00 0.98 45
4 1.00 1.00 1.00 38
5 0.98 0.98 0.98 48
6 1.00 1.00 1.00 52
7 1.00 1.00 1.00 48
8 1.00 1.00 1.00 48
9 0.98 0.96 0.97 47
avg / total 0.99 0.99 0.99 450
The averaged f1-score is often used as a convenient measure of the overall performance of an
algorithm. It appears in the bottom row of the classification report; it can also be accessed directly:
>>> metrics.f1_score(y_test, y_pred, average="macro")
0.991367...
The over-fitting we saw previously can be quantified by computing the f1-score on the training data
itself:
>>> metrics.f1_score(y_train, clf.predict(X_train), average="macro")
1.0
12.6.3. Model Selection via Validation

We have applied Gaussian Naives, support vectors machines, and K-nearest neighbors classifiers to
the digits dataset. Now that we have these validation tools in place, we can ask quantitatively which
of the three estimators works best for this dataset.
 With the default hyper-parameters for each estimator, which gives the best f1 score on
the validation set? Recall that hyperparameters are the parameters set when you instantiate
the classifier: for example, the n_neighbors in clf = KNeighborsClassifier(n_neighbors=1)
>>> from sklearn.naive_bayes import GaussianNB
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn.svm import LinearSVC
>>> X = digits.data
>>> y = digits.target
>>> X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
... test_size=0.25, random_state=0)
>>> for Model in [GaussianNB, KNeighborsClassifier, LinearSVC]:
... clf = Model().fit(X_train, y_train)
... y_pred = clf.predict(X_test)
... print('%s: %s' %
... (Model.__name__, metrics.f1_score(y_test, y_pred, average="macro")))
GaussianNB: 0.8332741681...
KNeighborsClassifier: 0.9804562804...
LinearSVC: 0.93...
 For each classifier, which value for the hyperparameters gives the best results for the digits
data? For LinearSVC, use loss='l2' and loss='l1'. For KNeighborsClassifier we
use n_neighbors between 1 and 10. Note that GaussianNB does not have any adjustable
hyperparameters.
LinearSVC(loss='l1'): 0.930570687535
LinearSVC(loss='l2'): 0.933068826918
-------------------
KNeighbors(n_neighbors=1): 0.991367521884
KNeighbors(n_neighbors=2): 0.984844206884
KNeighbors(n_neighbors=3): 0.986775344954
KNeighbors(n_neighbors=4): 0.980371905382
KNeighbors(n_neighbors=5): 0.980456280495
KNeighbors(n_neighbors=6): 0.975792419414
KNeighbors(n_neighbors=7): 0.978064579214
KNeighbors(n_neighbors=8): 0.978064579214
KNeighbors(n_neighbors=9): 0.978064579214
KNeighbors(n_neighbors=10): 0.975555089773
12.6.4. Cross-validation
Cross-validation consists in repetitively splitting the data in pairs of train and test sets, called ‘folds’.
Scikit-learn comes with a function to automatically compute score on all these folds. Here we
do KFold with k=5.
>>> clf = KNeighborsClassifier()
>>> from sklearn.model_selection import cross_val_score
>>> cross_val_score(clf, X, y, cv=5)
array([ 0.9478022 , 0.9558011 , 0.96657382, 0.98039216, 0.96338028])
We can use different splitting strategies, such as random splitting:
>>> from sklearn.model_selection import ShuffleSplit
>>> cv = ShuffleSplit(n_splits=5)
>>> cross_val_score(clf, X, y, cv=cv)
array([...])

There exists many different cross-validation strategies in scikit-learn.

12.6.5. Hyperparameter optimization with cross-validation


Consider regularized linear models, such as Ridge Regression, which uses l2 regularization, and Lasso
Regression, which uses l1 regularization. Choosing their regularization parameter is important.
Let us set these parameters on the Diabetes dataset, a simple regression problem. The diabetes data
consists of 10 physiological variables (age, sex, weight, blood pressure) measure on 442 patients, and
an indication of disease progression after one year:
>>> from sklearn.datasets import load_diabetes
>>> data = load_diabetes()
>>> X, y = data.data, data.target
>>> print(X.shape)
(442, 10)
With the default hyper-parameters: we compute the cross-validation score:
>>> from sklearn.linear_model import Ridge, Lasso
>>> for Model in [Ridge, Lasso]:
... model = Model()
... print('%s: %s' % (Model.__name__, cross_val_score(model, X, y).mean()))
Ridge: 0.409427438303
Lasso: 0.353800083299
Basic Hyperparameter Optimization
We compute the cross-validation score as a function of alpha, the strength of the regularization
for Lasso and Ridge. We choose 20 values of alpha between 0.0001 and 1:
>>> alphas = np.logspace(-3, -1, 30)
>>> for Model in [Lasso, Ridge]:
... scores = [cross_val_score(Model(alpha), X, y, cv=3).mean()
... for alpha in alphas]
... plt.plot(alphas, scores, label=Model.__name__)
[<matplotlib.lines.Line2D object at ...

Automatically Performing Grid Search


sklearn.grid_search.GridSearchCV is constructed with an estimator, as well as a dictionary of
parameter values to be searched. We can find the optimal parameters this way:
>>> from sklearn.grid_search import GridSearchCV
>>> for Model in [Ridge, Lasso]:
... gscv = GridSearchCV(Model(), dict(alpha=alphas), cv=3).fit(X, y)
... print('%s: %s' % (Model.__name__, gscv.best_params_))
Ridge: {'alpha': 0.062101694189156162}
Lasso: {'alpha': 0.01268961003167922}
Built-in Hyperparameter Search
For some models within scikit-learn, cross-validation can be performed more efficiently on large
datasets. In this case, a cross-validated version of the particular model is included. The cross-validated
versions of Ridge and Lasso are RidgeCV and LassoCV, respectively. Parameter search on these
estimators can be performed as follows:
>>> from sklearn.linear_model import RidgeCV, LassoCV
>>> for Model in [RidgeCV, LassoCV]:
... model = Model(alphas=alphas, cv=3).fit(X, y)
... print('%s: %s' % (Model.__name__, model.alpha_))
RidgeCV: 0.0621016941892
LassoCV: 0.0126896100317
We see that the results match those returned by GridSearchCV
Nested cross-validation
How do we measure the performance of these estimators? We have used data to set the
hyperparameters, so we need to test on actually new data. We can do this by
running cross_val_score() on our CV objects. Here there are 2 cross-validation loops going on, this is
called ‘nested cross validation’:
for Model in [RidgeCV, LassoCV]:
scores = cross_val_score(Model(alphas=alphas, cv=3), X, y, cv=3)
print(Model.__name__, np.mean(scores))

12.7 UNSUPERVISED LEARNING: DIMENSIONALITY REDUCTION AND VISUALIZATION


Unsupervised learning is applied on X without y: data without labels. A typical use case is to find hidden
structure in the data.
12.7.1. Dimensionality Reduction: PCA
Dimensionality reduction derives a set of new artificial features smaller than the original feature set.
Here we’ll use Principal Component Analysis (PCA), a dimensionality reduction that strives to retain
most of the variance of the original data. We’ll use sklearn.decomposition.PCA on the iris dataset:
>>> X = iris.data
>>> y = iris.target
PCA computes linear combinations of the original features using a truncated Singular Value
Decomposition of the matrix X, to project the data onto a base of the top singular vectors.
>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2, whiten=True)
>>> pca.fit(X)
PCA(..., n_components=2, ...)
Once fitted, PCA exposes the singular vectors in the components_ attribute:
>>> pca.components_
array([[ 0.36158..., -0.08226..., 0.85657..., 0.35884...],
[ 0.65653..., 0.72971..., -0.17576..., -0.07470...]])
Other attributes are available as well:
>>> pca.explained_variance_ratio_
array([ 0.92461..., 0.05301...])
Let us project the iris dataset along those first two dimensions:
>>> X_pca = pca.transform(X)
>>> X_pca.shape
(150, 2)
PCA normalizes and whitens the data, which means that the data is now centered on both
components with unit variance:
>>> X_pca.mean(axis=0)
array([ ...e-15, ...e-15])
>>> X_pca.std(axis=0, ddof=1)
array([ 1., 1.])
Furthermore, the samples components do no longer carry any linear correlation:
>>> np.corrcoef(X_pca.T)
array([[ 1.00000000e+00, ...],
[ ..., 1.00000000e+00]])
With a number of retained components 2 or 3, PCA is useful to visualize the dataset:
>>> target_ids = range(len(iris.target_names))
>>> for i, c, label in zip(target_ids, 'rgbcmykw', iris.target_names):
... plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1],
... c=c, label=label)
<matplotlib.collections.PathCollection ...

Note that this projection was determined without any information about the labels (represented by
the colors): this is the sense in which the learning is unsupervised. Nevertheless, we see that the
projection gives us insight into the distribution of the different flowers in parameter space:
notably, iris setosa is much more distinct than the other two species.
12.7.2. Visualization with a non-linear embedding: tSNE
For visualization, more complex embeddings can be useful (for statistical analysis, they are harder to
control). sklearn.manifold.TSNE is such a powerful manifold learning method. We apply it to
the digits dataset, as the digits are vectors of dimension 8*8 = 64. Embedding them in 2D enables
visualization:
>>> # Take the first 500 data points: it's hard to see 1500 points
>>> X = digits.data[:500]
>>> y = digits.target[:500]
>>> # Fit and transform with a TSNE
>>> from sklearn.manifold import TSNE
>>> tsne = TSNE(n_components=2, random_state=0)
>>> X_2d = tsne.fit_transform(X)
>>> # Visualize the data
>>> plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y)
<matplotlib.collections.PathCollection object at ...>

fit_transform: As TSNE cannot be applied to new data, we need to use its fit_transform method.
sklearn.manifold.TSNE separates quite well the different classes of digits even though it had no access
to the class information.
12.8 PARAMETER SELECTION, VALIDATION, AND TESTING
12.8.1. Hyperparameters, Over-fitting, and Under-fitting
The issues associated with validation and cross-validation are some of the most important aspects of
the practice of machine learning. Selecting the optimal model for your data is vital, and is a piece of
the problem that is not often appreciated by machine learning practitioners.
The central question is: If our estimator is underperforming, how should we move forward?
 Use simpler or more complicated model?
 Add more features to each observed data point?
 Add more training samples?

The answer is often counter-intuitive. In particular, sometimes using a more complicated model will
give worse results. Also, sometimes adding training data will not improve your results. The ability to
determine what steps will improve your model is what separates the successful machine learning
practitioners from the unsuccessful.
Bias-variance trade-off: illustration on a simple regression problem
Let us start with a simple 1D regression problem. This will help us to easily visualize the data and the
model, and the results generalize easily to higher-dimensional datasets. We’ll explore a simple linear
regression problem, with sklearn.linear_model.
X = np.c_[ .5, 1].T
y = [.5, 1]
X_test = np.c_[ 0, 2].T
Without noise, as linear regression fits the data perfectly
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)
plt.plot(X, y, 'o')
plt.plot(X_test, regr.predict(X_test))

In real life situation, we have noise (e.g. measurement noise) in our data:
np.random.seed(0)
for _ in range(6):
noisy_X = X + np.random.normal(loc=0, scale=.1, size=X.shape)
plt.plot(noisy_X, y, 'o')
regr.fit(noisy_X, y)
plt.plot(X_test, regr.predict(X_test))
As we can see, our linear model captures and amplifies the noise in the data. It displays a lot of
variance.
We can use another linear estimator that uses regularization, the Ridge estimator. This estimator
regularizes the coefficients by shrinking them to zero, under the assumption that very high
correlations are often spurious. The alpha parameter controls the amount of shrinkage used.
regr = linear_model.Ridge(alpha=.1)
np.random.seed(0)
for _ in range(6):
noisy_X = X + np.random.normal(loc=0, scale=.1, size=X.shape)
plt.plot(noisy_X, y, 'o')
regr.fit(noisy_X, y)
plt.plot(X_test, regr.predict(X_test))
plt.show()

As we can see, the estimator displays much less variance. However it systematically under-estimates
the coefficient. It displays a biased behavior.
This is a typical example of bias/variance tradeoff: non-regularized estimator are not biased, but they
can display a lot of variance. Highly-regularized models have little variance, but high bias. This bias is
not necessarily a bad thing: what matters is choosing the tradeoff between bias and variance that
leads to the best prediction performance. For a specific dataset there is a sweet spot corresponding
to the highest complexity that the data can support, depending on the amount of noise and of
observations available.
12.8.2. Visualizing the Bias/Variance Tradeoff
Given a particular dataset and a model (e.g. a polynomial), we’d like to understand whether bias
(underfit) or variance limits prediction, and how to tune the hyperparameter (here d, the degree of
the polynomial) to give the best fit.
On a given data, let us fit a simple polynomial regression model with varying degrees:

In the above figure, we see fits for three different values of d. For d = 1, the data is under-fit. This
means that the model is too simplistic: no straight line will ever be a good fit to this data. In this case,
we say that the model suffers from high bias. The model itself is biased, and this will be reflected in
the fact that the data is poorly fit. At the other extreme, for d = 6 the data is over-fit. This means that
the model has too many free parameters (6 in this case) which can be adjusted to perfectly fit the
training data. If we add a new point to this plot, though, chances are it will be very far from the curve
representing the degree-6 fit. In this case, we say that the model suffers from high variance. The
reason for the term “high variance” is that if any of the input points are varied slightly, it could result
in a very different model.
In the middle, for d = 2, we have found a good mid-point. It fits the data fairly well, and does not suffer
from the bias and variance problems seen in the figures on either side. What we would like is a way
to quantitatively identify bias and variance, and optimize the meta-parameters (in this case, the
polynomial degree d) in order to determine the best algorithm.

Polynomial regression with scikit-learn: A polynomial regression is built by


pipelining PolynomialFeatures and a LinearRegression:
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.linear_model import LinearRegression
>>> model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression())

Validation Curves
Let us create a dataset like in the example above:
>>> def generating_func(x, err=0.5):
... return np.random.normal(10 - 1. / (x + 0.1), err)
>>> # randomly sample more data
>>> np.random.seed(1)
>>> x = np.random.random(size=200)
>>> y = generating_func(x, err=1.)

Central to quantify bias and variance of a model is to apply it on test data, sampled from the same
distribution as the train, but that will capture independent noise:
>>> xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.4)
Validation curve: A validation curve consists in varying a model parameter that controls its complexity
(here the degree of the polynomial) and measures both error of the model on training data, and on
test data (eg with cross-validation). The model parameter is then adjusted so that the test error is
minimized:
We use sklearn.model_selection.validation_curve() to compute train and test error, and plot it:
>>> from sklearn.model_selection import validation_curve
>>> degrees = np.arange(1, 21)
>>> model = make_pipeline(PolynomialFeatures(), LinearRegression())
>>> # Vary the "degrees" on the pipeline step "polynomialfeatures"
>>> train_scores, validation_scores = validation_curve(
... model, x[:, np.newaxis], y,
... param_name='polynomialfeatures__degree',
... param_range=degrees)
>>> # Plot the mean train score and validation score across folds
>>> plt.plot(degrees, validation_scores.mean(axis=1), label='cross-validation')
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(degrees, train_scores.mean(axis=1), label='training')
[<matplotlib.lines.Line2D object at ...>]
>>> plt.legend(loc='best')
<matplotlib.legend.Legend object at ...>
This figure shows why validation is important. On the left side of the plot, we have very low-degree
polynomial, which under-fit the data. This leads to a low explained variance for both the training set
and the validation set. On the far right side of the plot, we have a very high degree polynomial, which
over-fits the data. This can be seen in the fact that the training explained variance is very high, while
on the validation set, it is low. Choosing d around 4 or 5 gets us the best tradeoff.
The astute reader will realize that something is amiss here: in the above plot, d = 4 gives the best
results. But in the previous plot, we found that d = 6 vastly over-fits the data. What’s going on here?
The difference is the number of training points used. In the previous example, there were only eight
training points. In this example, we have 100. As a general rule of thumb, the more training points
used, the more complicated model can be used. But how can you determine for a given model
whether more training points will be helpful? A useful diagnostic for this are learning curves.

Learning Curves:
A learning curve shows the training and validation score as a function of the number of training points.
Note that when we train on a subset of the training data, the training score is computed using this
subset, not the full training set. This curve gives a quantitative view into how beneficial it will be to
add training samples.
scikit-learn provides sklearn.model_selection.learning_curve():
>>> from sklearn.model_selection import learning_curve
>>> train_sizes, train_scores, validation_scores = learning_curve(
... model, x[:, np.newaxis], y, train_sizes=np.logspace(-1, 0, 20))
>>> # Plot the mean train score and validation score across folds
>>> plt.plot(train_sizes, validation_scores.mean(axis=1), label='cross-validation')
[<matplotlib.lines.Line2D object at ...>]
>>> plt.plot(train_sizes, train_scores.mean(axis=1), label='training')
[<matplotlib.lines.Line2D object at ...>]
Note that the validation score generally increases with a growing training set, while the training
score generally decreases with a growing training set. As the training size increases, they will converge
to a single value.
From the above discussion, we know that d = 1 is a high-bias estimator which under-fits the data. This
is indicated by the fact that both the training and validation scores are low. When confronted with
this type of learning curve, we can expect that adding more training data will not help: both lines
converge to a relatively low score.
When the learning curves have converged to a low score, we have a high bias model.
A high-bias model can be improved by:

 Using a more sophisticated model (i.e. in this case, increase d)


 Gather more features for each sample.
 Decrease regularization in a regularized model.

Increasing the number of samples, however, does not improve a high-bias model.
Now let’s look at a high-variance (i.e. over-fit) model:

Here we show the learning curve for d = 15. From the above discussion, we know that d = 15 is a high-
variance estimator which over-fits the data. This is indicated by the fact that the training score is much
higher than the validation score. As we add more samples to this training set, the training score will
continue to decrease, while the cross-validation error will continue to increase, until they meet in the
middle.
Learning curves that have not yet converged with the full training set indicate a high-variance, over-
fit model.
A high-variance model can be improved by:

 Gathering more training samples.


 Using a less-sophisticated model (i.e. in this case, make d smaller)
 Increasing regularization.

In particular, gathering more features for each sample will not help the results.

Check your Progress 1

Fill in the blanks.

1. In _______ the label is continuous.


2. ______ given an unsupervised model, transform new data into the new basis.
3. ____ metrics combines several measures and prints a table with the results.
4. The K-neighbors classifier predicts the label of an unknown point based on the labels of
the _______ in the parameter space.
5. ________ derives a set of new artificial features smaller than the original feature set.

Activity 1

1. sklearn.manifold has many other non-linear embeddings. Try them out on the digits dataset.
Could you judge their quality without knowing the labels y?
2. There are many other types of regressors available in scikit-learn: Use the
GradientBoostingRegressor class to fit the housing data. (Hint:You can copy and paste some
of the code discussed in the unit, replacing LinearRegression with GradientBoostingRegressor)

Summary

Scikit-learn is an open source machine learning library that supports supervised and unsupervised
learning. In this unit we have built machine learning models using scikit-learn including classification,
regression, k-means examples.

Keywords

 Supervised learning: It is the machine learning task of learning a function that maps an input
to an output based on example input-output pairs.
 Unsupervised learning: It is a machine learning technique, where you do not need to
supervise the model. Instead, you need to allow the model to work on its own to discover
information.
 Classification: Classification is technique to categorize our data into a desired and distinct
number of classes where we can assign label to each class.
 Regression algorithms: Regression algorithms predict the output values based on input
features from the data fed in the system.
 Hyperparameter: In machine learning, a hyperparameter is a parameter whose value is set
before the learning process begins.
 Learning Curve: It shows the validation and training score of an estimator for varying numbers
of training samples. It is a tool to find out how much a machine learning model benefits from
adding more training data and whether the estimator suffers more from a variance error or a
bias error.

Self-Assessment Questions

1. Why did we split the data into training and validation sets?
2. As the number of training samples are increased, what do you expect to see for the training
score? For the validation score? Would you expect the training score to be higher or lower
than the validation score?
3. Explain regression with example.
4. What is learning curve? How the same can be done using scikit-learn?

Answers to Check Your Progress


Check your Progress 1

Fill in the blanks.

1. In regression the label is continuous.


2. model.transform() given an unsupervised model, transform new data into the new basis.
3. classification_report metrics combines several measures and prints a table with the results.
4. The K-neighbors classifier predicts the label of an unknown point based on the labels of
the K nearest points in the parameter space.
5. Dimensionality reduction derives a set of new artificial features smaller than the original
feature set.

Suggested Reading

1. Learning scikit-learn: Machine Learning in Python by Raul Garreta, Guillermo Moncecchi


2. Mastering Machine Learning with scikit-learn by Gavin Hackeling
3. Hands-on Scikit-Learn for Machine Learning Applications: Data Science Fundamentals with
Python by David Paper
4. Applied Supervised Learning with Python by Benjamin Johnston, Ishita Mathur
5. Machine Learning with scikit-learn Quick Start Guide by Kevin Jolly
6. scikit-learn Cookbook by Trent Hauck

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/
Unit 13

Web Scraping in Python – Beautiful Soup

13.1 Introduction
13.2 Beautiful Soup
Summary
Keywords
Self-Assessment Questions
Answers to Check Your Progress
Suggested Reading
Objectives:
After going through this unit, you will be able to:

 Understand the concept of Web Scraping


 Use Beautiful Soup library for Web Scraping

13.1 INTRODUCTION
To source data for data science projects, you’ll often rely on SQL and NoSQL databases, APIs, or ready-
made CSV data sets. The problem is that you can’t always find a data set on your topic, databases are
not kept current and APIs are either expensive or have usage limits. If the data you’re looking for is on
a web page, however, then the solution to all these problems is web scraping. In this unit, we will
collect and parse a web page in order to grab textual data and write the information we have gathered
to a CSV file.

13.2 BEAUTIFUL SOUP

Beautiful Soup is a Python library that allows for quick turnaround on web scraping projects. Currently
available as Beautiful Soup 4 and compatible with both Python 2.7 and Python 3, Beautiful Soup
creates a parse tree from parsed HTML and XML documents (including documents with non-closed
tags or tag soup and other malformed markup). Install Beautiful Soup module on your system.

In this unit, we’ll be working with data from the official website of the National Gallery of Art in the
United States. The National Gallery is an art museum located on the National Mall in Washington, D.C.
It holds over 120,000 pieces dated from the Renaissance to the present day done by more than 13,000
artists. We would like to search the Index of Artists, which, at the time of updating this unit, is available
via the Internet Archive’s Wayback Machine at the following URL:

https://web.archive.org/web/20170131230332/https://www.nga.gov/collection/an.shtm

Beneath the Internet Archive’s header, you’ll see a page that looks like this:

Since we’ll be doing this project in order to learn about web scraping with Beautiful Soup, we don’t
need to pull too much data from the site, so let’s limit the scope of the artist data we are looking to
scrape. Let’s therefore choose one letter — in our example we’ll choose the letter Z — and we’ll see
a page that looks like this:
In the page above, we see that the first artist listed at the time of writing is Zabaglia, Niccola, which is
a good thing to note for when we start pulling data. We’ll start by working with this first page, with
the following URL for the letter Z:

https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ1.htm

It is important to note for later how many pages total there are for the letter you are choosing to list,
which you can discover by clicking through to the last page of artists. In this case, there are 4 pages
total, and the last artist listed at the time of writing is Zykmund, Václav. The last page of Z artists has
the following URL:

https://web.archive.org/web/20121010201041/http://www.nga.gov/collection/anZ4.htm

However, you can also access the above page by using the same Internet Archive numeric string of
the first page:

https://web.archive.org/web/20121007172955/http://www.nga.gov/collection/anZ4.htm

This is important to note because we’ll be iterating through these pages later in this unit.

To begin to familiarize yourself with how this web page is set up, you can take a look at its DOM, which
will help you understand how the HTML is structured. In order to inspect the DOM, you can open your
browser’s Developer Tools.

Importing the Libraries

To begin our coding project, let’s activate our Python 3 programming environment. Make sure you’re
in the directory where your environment is located, and run the following command:

$ . my_env/bin/activate
With our programming environment activated, we’ll create a new file, with nano for instance. You can
name your file whatever you would like, we’ll call it nga_z_artists.py in this unit.

$ nano nga_z_artists.py

Within this file, we can begin to import the libraries we’ll be using — Requests and Beautiful Soup.

The Requests library allows you to make use of HTTP within your Python programs in a human
readable way, and the Beautiful Soup module is designed to get web scraping done quickly.

We will import both Requests and Beautiful Soup with the import statement. For Beautiful Soup, we’ll
be importing it from bs4, the package in which Beautiful Soup 4 is found.

nga_z_artists.py

# Import libraries

import requests

from bs4 import BeautifulSoup

With both the Requests and Beautiful Soup modules imported, we can move on to working to first
collect a page and then parse it.

Collecting and Parsing a Web Page

The next step we will need to do is collect the URL of the first web page with Requests. We’ll assign
the URL for the first page to the variable page by using the method requests.get().

nga_z_artists.py

import requests

from bs4 import BeautifulSoup

# Collect first page of artists’ list

page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov
/collection/anZ1.htm' )

We’ll now create a BeautifulSoup object, or a parse tree. This object takes as its arguments the
page.text document from Requests (the content of the server’s response) and then parses it from
Python’s built-in html.parser.

nga_z_artists.py

import requests

from bs4 import BeautifulSoup

page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov
/collection/anZ1.htm')

# Create a BeautifulSoup object

soup = BeautifulSoup(page.text, 'html.parser')


With our page collected, parsed, and set up as a BeautifulSoup object, we can move on to collecting
the data that we would like.

Pulling Text from a Web Page

For this project, we’ll collect artists’ names and the relevant links available on the website. You may
want to collect different data, such as the artists’ nationality and dates. Whatever data you would like
to collect, you need to find out how it is described by the DOM of the web page.

To do this, in your web browser, right-click — or CTRL + click on macOS — on the first artist’s name,
Zabaglia, Niccola. Within the context menu that pops up, you should see a menu item similar to Inspect
Element (Firefox) or Inspect (Chrome).

Once you click on the relevant Inspect menu item, the tools for web developers should appear within
your browser. We want to look for the class and tags associated with the artists’ names in this list.
We’ll see first that the table of names is within <div> tags where class="BodyText". This is important
to note so that we only search for text within this section of the web page. We also notice that the
name Zabaglia, Niccola is in a link tag, since the name references a web page that describes the artist.
So we will want to reference the <a> tag for links. Each artist’s name is a reference to a link.

To do this, we’ll use Beautiful Soup’s find() and find_all() methods in order to pull the text of the artists’
names from the BodyText <div>.

nga_z_artists.py

import requests

from bs4 import BeautifulSoup

# Collect and parse first page

page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov
/collection/anZ1.htm')

soup = BeautifulSoup(page.text, 'html.parser')

# Pull all text from the BodyText div

artist_name_list = soup.find(class_='BodyText')

# Pull text from all instances of <a> tag within BodyText div

artist_name_list_items = artist_name_list.find_all('a')

Next, at the bottom of our program file, we will want to create a for loop in order to iterate over all
the artist names that we just put into the artist_name_list_items variable.

We’ll print these names out with the prettify() method in order to turn the Beautiful Soup parse tree
into a nicely formatted Unicode string.

nga_z_artists.py

...

artist_name_list = soup.find(class_='BodyText')

artist_name_list_items = artist_name_list.find_all('a')

# Create for loop to print out all artists' names

for artist_name in artist_name_list_items:

print(artist_name.prettify())

Let’s run the program as we have it so far:

python nga_z_artists.py

Once we do so, we’ll receive the following output:


Output

<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">

Zabaglia, Niccola

</a>

...

<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=3427">

Zao Wou-Ki

</a>

<a href="/web/20121007172955/https://www.nga.gov/collection/anZ2.htm">

Zas-Zie

</a>

<a href="/web/20121007172955/https://www.nga.gov/collection/anZ3.htm">

Zie-Zor

</a>

<a href="/web/20121007172955/https://www.nga.gov/collection/anZ4.htm">

<strong>

next

<br/>

page

</strong>

</a>

What we see in the output at this point is the full text and tags related to all of the artists’ names
within the <a> tags found in the <div class="BodyText"> tag on the first page, as well as some
additional link text at the bottom. Since we don’t want this extra information, let’s work on removing
this in the next section.

Removing Superfluous Data

So far, we have been able to collect all the link text data within one <div> section of our web page.
However, we don’t want to have the bottom links that don’t reference artists’ names, so let’s work to
remove that part.
In order to remove the bottom links of the page, let’s again right-click and Inspect the DOM. We’ll see
that the links on the bottom of the <div class="BodyText"> section are contained in an HTML table:
<table class="AlphaNav">:

We can therefore use Beautiful Soup to find the AlphaNav class and use the decompose() method to
remove a tag from the parse tree and then destroy it along with its contents.

We’ll use the variable last_links to reference these bottom links and add them to the program file:

nga_z_artists.py

import requests

from bs4 import BeautifulSoup

page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov
/collection/anZ1.htm')

soup = BeautifulSoup(page.text, 'html.parser')

# Remove bottom links

last_links = soup.find(class_='AlphaNav')

last_links.decompose()

artist_name_list = soup.find(class_='BodyText')

artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:

print(artist_name.prettify())

Now, when we run the program with the python nga_z_artist.py command, we’ll receive the following
output:

Output

<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=11630">
Zabaglia, Niccola

</a>

<a href="/web/20121007172955/https://www.nga.gov/cgi-bin/tsearch?artistid=34202">

Zaccone, Fabian

</a>

...

<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=11631">

Zanotti, Giampietro

</a>

<a href="/web/20121007172955/http://www.nga.gov/cgi-bin/tsearch?artistid=3427">

Zao Wou-Ki

</a>

At this point, we see that the output no longer includes the links at the bottom of the web page, and
now only displays the links associated with artists’ names.

Until now, we have targeted the links with the artists’ names specifically, but we have the extra tag
data that we don’t really want. Let’s remove that in the next section.

Pulling the Contents from a Tag

In order to access only the actual artists’ names, we’ll want to target the contents of the <a> tags
rather than print out the entire link tag.

We can do this with Beautiful Soup’s .contents, which will return the tag’s children as a Python list
data type.

Let’s revise the for loop so that instead of printing the entire link and its tag, we’ll print the list of
children (i.e. the artists’ full names):

nga_z_artists.py

import requests

from bs4 import BeautifulSoup

page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov
/collection/anZ1.htm')

soup = BeautifulSoup(page.text, 'html.parser')

last_links = soup.find(class_='AlphaNav')

last_links.decompose()

artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')

# Use .contents to pull out the <a> tag’s children

for artist_name in artist_name_list_items:

names = artist_name.contents[0]

print(names)

Note that we are iterating over the list above by calling on the index number of each item.

We can run the program with the python command to view the following output:

Output

Zabaglia, Niccola

Zaccone, Fabian

Zadkine, Ossip

...

Zanini-Viola, Giuseppe

Zanotti, Giampietro

Zao Wou-Ki

We have received back a list of all the artists’ names available on the first page of the letter Z.

However, what if we want to also capture the URLs associated with those artists? We can extract URLs
found within a page’s <a> tags by using Beautiful Soup’s get('href') method.

From the output of the links above, we know that the entire URL is not being captured, so we will
concatenate the link string with the front of the URL string (in this case https://web.archive.org/).

These lines we’ll also add to the for loop:

nga_z_artists.py

...

for artist_name in artist_name_list_items:

names = artist_name.contents[0]

links = 'https://web.archive.org' + artist_name.get('href')

print(names)

print(links)

When we run the program above, we’ll receive both the artists’ names and the URLs to the links that
tell us more about the artists:
Output

Zabaglia, Niccola

https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-
bin/tsearch?artistid=11630

Zaccone, Fabian

https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-
bin/tsearch?artistid=34202

...

Zanotti, Giampietro

https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-
bin/tsearch?artistid=11631

Zao Wou-Ki

https://web.archive.org/web/20121007172955/https://www.nga.gov/cgi-
bin/tsearch?artistid=3427

Although we are now getting information from the website, it is currently just printing to our terminal
window. Let’s instead capture this data so that we can use it elsewhere by writing it to a file.

Writing the Data to a CSV File

Collecting data that only lives in a terminal window is not very useful. Comma-separated values (CSV)
files allow us to store tabular data in plain text, and is a common format for spreadsheets and
databases.

First, we need to import Python’s built-in csv module along with the other modules at the top of the
Python programming file:

import csv

Next, we’ll create and open a file called z-artist-names.csv for us to write to (we’ll use the variable f
for file here) by using the 'w' mode. We’ll also write the top row headings: Name and Link which we’ll
pass to the writerow() method as a list:

f = csv.writer(open('z-artist-names.csv', 'w'))

f.writerow(['Name', 'Link'])

Finally, within our for loop, we’ll write each row with the artists’ names and their associated links:

f.writerow([names, links])

You can see the lines for each of these tasks in the file below:

nga_z_artists.py

import requests
import csv

from bs4 import BeautifulSoup

page = requests.get('https://web.archive.org/web/20121007172955/http://www.nga.gov
/collection/anZ1.htm')

soup = BeautifulSoup(page.text, 'html.parser')

last_links = soup.find(class_='AlphaNav')

last_links.decompose()

# Create a file to write to, add headers row

f = csv.writer(open('z-artist-names.csv', 'w'))

f.writerow(['Name', 'Link'])

artist_name_list = soup.find(class_='BodyText')

artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:

names = artist_name.contents[0]

links = 'https://web.archive.org' + artist_name.get('href')

# Add each artist’s name and associated link to a row

f.writerow([names, links])

When you run the program now with the python command, no output will be returned to your
terminal window. Instead, a file will be created in the directory you are working in called z-artist-
names.csv.

Depending on what you use to open it, it may look something like this:

z-artist-names.csv

Name,Link

"Zabaglia,
Niccola",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-
bin/tsearch?artistid=11630

"Zaccone, Fabian",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-
bin/tsearch?artistid=34202

"Zadkine, Ossip",https://web.archive.org/web/20121007172955/http://www.nga.gov/cgi-
bin/tsearch?artistid=3475w

...

Or, it may look more like a spreadsheet:


In either case, you can now use this file to work with the data in more meaningful ways since the
information you have collected is now stored in your computer’s memory.

Retrieving Related Pages

We have created a program that will pull data from the first page of the list of artists whose last names
start with the letter Z. However, there are 4 pages in total of these artists available on the website.

In order to collect all of these pages, we can perform more iterations with for loops. This will revise
most of the code we have written so far, but will employ similar concepts.

To start, we’ll want to initialize a list to hold the pages:

pages = []

We will populate this initialized list with the following for loop:

for i in range(1, 5):

url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov
/collection/anZ' + str(i) + '.htm'

pages.append(url)

Earlier in this unit, we noted that we should pay attention to the total number of pages there are that
contain artists’ names starting with the letter Z (or whatever letter we’re using). Since there are 4
pages for the letter Z, we constructed the for loop above with a range of 1 to 5 so that it will iterate
through each of the 4 pages.

For this specific web site, the URLs begin with the string

https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ and then are


followed with a number for the page (which will be the integer i from the for loop that we convert to
a string) and end with .htm. We will concatenate these strings together and then append the result to
the pages list.

In addition to this loop, we’ll have a second loop that will go through each of the pages above. The
code in this for loop will look similar to the code we have created so far, as it is doing the task we
completed for the first page of the letter Z artists for each of the 4 pages total. Note that because we
have put the original program into the second for loop, we now have the original loop as a nested for
loop contained in it.

The two for loops will look like this:

pages = []

for i in range(1, 5):

url =
'https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ' + str(i)
+ '.htm'

pages.append(url)

for item in pages:

page = requests.get(item)

soup = BeautifulSoup(page.text, 'html.parser')

last_links = soup.find(class_='AlphaNav')

last_links.decompose()

artist_name_list = soup.find(class_='BodyText')

artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:

names = artist_name.contents[0]

links = 'https://web.archive.org' + artist_name.get('href')

f.writerow([names, links])

In the code above, you should see that the first for loop is iterating over the pages and the second for
loop is scraping data from each of those pages and then is adding the artists’ names and links line by
line through each row of each page.

These two for loops come below the import statements, the CSV file creation and writer (with the line
for writing the headers of the file), and the initialization of the pages variable (assigned to a list).

Within the greater context of the programming file, the complete code looks like this:

nga_z_artists.py

import requests

import csv

from bs4 import BeautifulSoup

f = csv.writer(open('z-artist-names.csv', 'w'))

f.writerow(['Name', 'Link'])
pages = []

for i in range(1, 5):

url = 'https://web.archive.org/web/20121007172955/https://www.nga.gov
/collection/anZ' + str(i) + '.htm'

pages.append(url)

for item in pages:

page = requests.get(item)

soup = BeautifulSoup(page.text, 'html.parser')

last_links = soup.find(class_='AlphaNav')

last_links.decompose()

artist_name_list = soup.find(class_='BodyText')

artist_name_list_items = artist_name_list.find_all('a')

for artist_name in artist_name_list_items:

names = artist_name.contents[0]

links = 'https://web.archive.org' + artist_name.get('href')

f.writerow([names, links])

Since this program is doing a bit of work, it will take a little while to create the CSV file. Once it is done,
the output will be complete, showing the artists’ names and their associated links from Zabaglia,
Niccola to Zykmund, Václav.

Being Considerate

When scraping web pages, it is important to remain considerate of the servers you are grabbing
information from.

Check to see if a site has terms of service or terms of use that pertains to web scraping. Also, check to
see if a site has an API that allows you to grab data before scraping it yourself.

Be sure to not continuously hit servers to gather data. Once you have collected what you need from a
site, run scripts that will go over the data locally rather than burden someone else’s servers.

Additionally, it is a good idea to scrape with a header that has your name and email so that a website
can identify you and follow up if they have any questions. An example of a header you can use with
the Python Requests library is as follows:

import requests

headers = {

'User-Agent': 'Your Name, example.com',


'From': '[email protected]'

url = 'https://example.com'

page = requests.get(url, headers = headers)

Using headers with identifiable information ensures that the people who go over a server’s logs can
reach out to you.

Check your Progress 1

State True or False

1. When scraping web pages, it is important to remain considerate of the servers you are
grabbing information from.
2. The decompose() method is used to remove a tag from the parse tree and then destroy it
along with its contents.

Activity 1

Visit the following URL and understand the example given:

https://www.dataquest.io/blog/web-scraping-beautifulsoup/

Summary

In this unit, we discussed the use of Python and Beautiful Soup to scrape data from a website. We
stored the text that we gathered within a CSV file. You can continue working on this project by
collecting more data and making your CSV file more robust. For example, you may want to include the
nationalities and years of each artist. You can also use what you have learned to scrape data from
other websites.

Keywords

 Web Scraping: Also termed Screen Scraping, Web Data Extraction, Web Harvesting etc., is a
technique employed to extract large amounts of data from websites
 HTML: Hypertext Markup Language is the standard markup language for documents designed
to be displayed in a web browser.

Self-Assessment Questions

1. Explain the concept of Web Scraping.


2. State the important features of Beautiful Soup and its functions.

Answers to Check Your Progress

Check your Progress 1

State True or False

1. True
2. True
Suggested Reading

1. Getting Started with Beautiful Soup by Vineeth G. Nair


2. Website Scraping with Python: Using BeautifulSoup and Scrapy by Gábor László Hajba

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/
Unit 14

Introduction to (Py)Spark

14.1 Introduction
14.2 Introduction to Spark / PySpark
14.3 MLlib Machine Learning Library
Summary

Keywords

Self-Assessment Questions

Answers to Check Your Progress

Suggested Reading
Objectives:

After going through this unit, you will be able to:

• Discuss Spark and its related aspects


• Understand Lambda function in python
• Implement python code using map, filter and reduce

14.1 INTRODUCTION
Apache Spark is a general engine for big data analysis, processing and computations. It provides
several advantages over MapReduce, it is faster, easier to use, offers simplicity and runs virtually
everywhere. It has built in tools for SQL, machine learning, streaming which makes it a very popular
and one of the most asked tools in IT industry. Spark is written in Scala programming language. Apache
Spark has API’s for Python, Scala, Java and R, though the most used languages with Spark are the
former two.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms,
interactive queries and streaming. Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools. In this unit we will learn how to use
python API with Spark.
14.2 INTRODUCTION TO SPARK / PYSPARK
PySpark is a python API for spark released by Apache Spark community to support python with Spark.
Using PySpark, one can easily integrate and work with RDD in python programming language too.
There are numerous features that make PySpark such an amazing framework when it comes to
working with huge datasets. Whether it is to perform computations on large data sets or to just
analyze them, Data engineers are turning to this tool. Following are some of the said features
Key features of PySpark

 Real time computations: Because of the in-memory processing in PySpark framework, it shows
low latency
 Polyglot: PySpark framework is compatible with various languages like Scala, Java, Python and
R, which makes it one of the most preferable frameworks for processing huge datasets
 Caching and disk persistence: PySpark framework provides powerful caching and very good
disk persistence
 Fast processing: PySpark framework is way faster than other traditional frameworks for big
data processing
 Works well with RDD: Python programming language is dynamically typed which helps when
working with RDD.
Spark is a general purpose cluster computing framework:

 It provides efficient in-memory computations for large data sets


 It distributes computation and data across multiple computers.

Data Distribution

To distribute data, Spark uses a framework called Resilient Distributed Datasets (RDDs).
For instance, if you read a file with Spark, it will automatically create a RDD. But a RDD is immutable so
to modify it a new RDD needs to be created. That is very helpful for reproducibility!
Computation Distribution
To distribute computation, Spark API (Application Program Interface) available in multiple
programming languages (Scala, Java, Python and R) provides several operators such as map and
reduce.
The figure below shows the Spark Software Layers.

Spark Core contains the basic functionality of Spark; in particular the APIs that define RDDs and the
operations and actions that can be undertaken upon them (map, filter, reduce, etc.).
The rest of Spark’s libraries are built on top of the RDD and Spark Core:

 Spark SQL for SQL and structured data processing. Every database table is represented as an
RDD and Spark SQL queries are transformed into Spark operations.
 MLlib is a library of common machine learning algorithms implemented as Spark operations
on RDDs. This library contains scalable learning algorithms like classifications, regressions, etc.
that require iterative operations across large data sets. MLlib will be deprecated to ML, the
newly developed Spark Machine Learning toolset. ML provides higher-level API built on top
of DataFrames for constructing ML pipelines. Currently, not all Machine Learning algorithms
implemented in MLlib are yet available in Spark ML. The pipeline concept is mostly inspired
by the scikit-learn project.
 GraphX is a collection of algorithms and tools for manipulating graphs and performing parallel
graph operations and computations.
 Spark Streaming for scalable, high-throughput, fault-tolerant stream processing of real-time
data.

Spark Initialization: Spark Context

Spark applications are run as independent sets of processes, coordinated by a Spark Context in a driver
program.

It may be automatically created for instance if you call pyspark from the shells (the Spark context is
then called sc).
But we haven’t set it up automatically in the Galaxy eduPortal, so you need to define it:
from pyspark import SparkContext
sc = SparkContext('local', 'pyspark tutorial')

 the driver (first argument) can be local[*], spark://”, **yarn, etc. What is available for you
depends on how Spark has been deployed on the machine you use.
 the second argument is the application name and is a human readable string you choose.

Because we do not specify any number of tasks for local, it means we will be using one only. To use a
maximum of 2 tasks in parallel:
from pyspark import SparkContext
sc = SparkContext('local[2]', 'pyspark tutorial')
If you wish to use all the available resource, you can simply use ‘*’ i.e.
from pyspark import SparkContext
sc = SparkContext('local[*]', 'pyspark tutorial')
Please note that within one session, you cannot define several Spark context! So if you have tried the
3 previous SparkContext examples, don’t be surprised to get an error!
Deployment of Spark:
It can be deployed on:

 a single machine such as your laptop (local)


 a set of pre-defined machines (stand-alone)
 a dedicated Hadoop-aware scheduler (YARN/Mesos)
 “cloud”, e.g. Amazon EC2

The development workflow is that you start small (local) and scale up to one of the other solutions,
depending on your needs and resources. At UIO, we have the Abel cluster where Spark is available.
Often, you don’t need to change any code to go between these methods of deployment!
map/reduce
Let’s start from our previous map example where the goal was to convert temperature from Celcius
to Kelvin. Here it is how it translates in PySpark.
temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]
rdd_temp_c = sc.parallelize(temp_c)
rdd_temp_K = rdd_temp_c.map(lambda x: x + 273.15).collect()
print(rdd_temp_K)
You recognize the map function (please note it is not the pure python map function but
PySpark map function). It acts here as the transformation function while collect is the action.
It pulls all elements of the RDD to the driver.
Remark: It is often a very bad idea to pull all the elements of the RDD to the driver because we
potentially handle very large amount of data. So instead we prefer to use take as you can specify how
many elements you wish to pull from the RDD.
For instance to pull the first 3 elements only:
temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]
rdd_temp_c = sc.parallelize(temp_c)
rdd_temp_K = rdd_temp_c.map(lambda x: x + 273.15).take(3)
print(rdd_temp_K)
Now let’s take another example where we use map as the transformation and reduce for the action.
# we define a list of integers
numbers = [1, 4, 6, 2, 9, 10]
rdd_numbers=sc.parallelize(numbers)
# Use reduce to combine numbers
rdd_reduce = rdd_numbers.reduce(lambda x,y: "(" + str(x) + ", " + str(y) + ")")
print(rdd_reduce)
Create a RDD from a file
Most of the time, we need to process data we have stored as “standard” files. Here we learn how to
create a RDD from a file. Import a file from the Data Library “ in a new history (call it for instance
“Gutenberg”):

 Tab “Shared Data” –> “Data Libraries” –> “Project Gutenberg”


 Select a file (or the three of them if you wish)
 Click to “To History” and import to a new History that you call “Gutenberg”.
 Go to your newly created History (click on the green box that appears on your screen)
 Open a new jupyter notebook from your file and change the kernel from python 2 to python
3.

from pyspark import SparkContext

sc = SparkContext('local[2]', 'pyspark tutorial')

lines_rdd = sc.textFile(get(1))

The method “textFile” load the file passed as an argument and returns a RDD. Please note that you
may add a second argument to specify the minimum number of partitions for your RDD. If not
specified, you let Spark decides.
In the following example, we load a text file as a RDD and counts how many times does each word
appear.
from pyspark import SparkContext
import str
sc = SparkContext('local[2]', 'pyspark tutorial')
def noPunctuations(text):
"""Removes punctuation and convert to lower case
Args:
text (str): A string.
Returns:
str: The cleaned up string.
"""
return text.translate(str.maketrans("","",string.punctuation)).lower()
lines_rdd = sc.textFile(get(1), 1)
counts = lines_rdd.map(noPunctuations).flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(lambda x, y: x+y)
for (word, count) in counts.collect():
print(word,count)
map vs. flatMap and reduce vs. reduceByKey:
In the previous example, we used flatMap as a transformation function and reduceByKey as an action.
map: It returns a new RDD by applying a function to each element of the RDD. Function in map can
return only one item. flatMap: It returns a new RDD by applying a function to each element of the
RDD, but output is flattened. Also, function in flatMap can return a list of elements (0 or more)
sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()
[[1, 2], [1, 2, 3], [1, 2, 3, 4]]
sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()
[1, 2, 1, 2, 3, 1, 2, 3, 4]
reduceByKey is very often used as it combines values with the same key. In our example, we wanted
to count the number of occurrence of the same word. A simple reduce would not differentiate the
different words and would count the total number of words.
Spark SQL
Spark SQL is a component on top of Spark Core that facilitates processing of structured and semi-
structured data and the integration of several data formats as source (Hive, Parquet, JSON).
It allows to transform RDDs using SQL (Structured Query Language).
To start Spark SQL within your notebook, you need to create a SQL context.
For this exercise, import a JSON file in a new history “World Cup”. You can find the historical World
cup player dataset in JSON format in our Data Library named “Historical world cup player data “.
Then create a new python 3 (change kernel if set by default to python 2) jupyter notebook from this
file:

from pyspark import SparkContext

from pyspark.sql import SQLContext

sc = SparkContext('local', 'Spark SQL')

sqlc = SQLContext(sc)
We can read the JSON file we have in our history and create a DataFrame (Spark SQL has a json reader
available):

players = sqlc.read.json(get(1))

# Print the schema in a tree format

players.printSchema()

Select only the "FullName" column

players.select("FullName").show(20)
Then we can create a view of our DataFrame. The lifetime of this temporary table is tied to the
SparkSession that was used to create this DataFrame.
players.registerTempTable("players")
We can then query our view; for instance to get the names of all the Teams:
sqlc.sql("select distinct Team from players").show(10)

And have the full SQL possibilities to create SQL query:

# Select the teams names from 2014 only

team2014 = sqlc.sql("select distinct Team from players where Year == 2014")

# The results of SQL queries are Dataframe objects.

# rdd method returns the content as an :class:`pyspark.RDD` of :class:`Row`.


teamNames = team2014.rdd.map(lambda p: "Name: " + p.Team).collect()

for name in teamNames:

print(name)
Pandas
When in PySpark, there is also an easy option to convert Spark DataFrame to Pandas dataframe. Pandas
dataframes can also be converted to Spark dataframes.
players.toPandas().head()
14.3 MLLIB MACHINE LEARNING LIBRARY

Spark MLlib is a distributed machine-learning framework on top of Spark Core that, due in large part
to the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based
implementation used by Apache Mahout (according to benchmarks done by the MLlib developers
against the alternating least squares (ALS) implementations, and before Mahout itself gained a Spark
interface), and scales better than Vowpal Wabbit. Many common machine learning and statistical
algorithms have been implemented and are shipped with MLlib which simplifies large scale machine
learning pipelines, including:

 summary statistics, correlations, stratified sampling, hypothesis testing, random data


generation[25]
 classification and regression: support vector machines, logistic regression, linear regression,
decision trees, naive Bayes classification, Decision Tree, Random Forest, Gradient-Boosted
Tree
 collaborative filtering techniques including alternating least squares (ALS)
 cluster analysis methods including k-means, and latent Dirichlet allocation (LDA)
 dimensionality reduction techniques such as singular value decomposition (SVD), and
principal component analysis (PCA)
 feature extraction and transformation functions
 optimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)

Lambda function in python

Python supports the creation of anonymous functions (i.e. functions defined without a name), using
a construct called “lambda”.
The general structure of a lambda function is:
lambda <args>: <expr>
Let’s take a python function to double the value of a scalar:

def f (x):

return x**2
For instance to use this function:

print(f(2))
4
The same function can be written as lambda function:
g = lambda x: x**2
And you call it:

print(g(2))

4
As you can see both functions do exactly the same and can be used in the same ways.

 Note that the lambda definition does not include a “return” statement – it always contains a
single expression which is returned.
 Also note that you can put a lambda definition anywhere a function is expected, and you don’t
have to assign it to a variable at all.
 Lambda functions come from functional programming languages and the Lambda Calculus.
Since they are so small they may be written on a single line.
 This is not exactly the same as lambda in functional programming languages, but it is a very
powerful concept that’s well integrated into Python.

Conditional expression in Lambda functions


You can use conditional expression in a lambda function or/and have more than one input argument.
For example:

f = lambda x,y: ["PASS",x,y] if x>3 and y<100 else ["FAIL",x,y]

print(f(4,50))

['FAIL', 4, 200]
To summarize: Lambda functions = Anonymous functions
Map, filter and reduce in python
Map
Map takes a function f and an array as input parameters and outputs an array where f is applied to
every element. In this respect, using map is equivalent to for loops.
For instance, to convert a list of temperatures in Celsius to a list of temperature in Kelvin:

temp_c = [10, 3, -5, 25, 1, 9, 29, -10, 5]

temp_K = list(map(lambda x: x + 273.15, temp_c))

list(temp_K)

map() is a function with two arguments:

r = map(func, seq)
The first argument func is the name of a function and the second a sequence (e.g. a list) seq. map()
applies the function func to all the elements of the sequence seq. It returns a new list with the
elements changed by func.
Filter
As the name suggests, filter can be used to filter your data. It tests each element of your input data
and returns a subset of it for which a condition given by a function is TRUE. It does not modify your
input data.

numbers = range(-15, 15)

less_than_zero = list(filter(lambda x: x < 0, numbers))

print(less_than_zero)
[-15, -14, -13, -12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1]
Reduce
Reduce takes a function f and an array as input. The function f gets two input parameters that work
on individual elements of the array. Reduce combines every two elements of the array using the
function f. Let’s take an example:

# we define a list of integers

numbers = [1, 4, 6, 2, 9, 10]

# Define a new function combine

# Convert x and y to strings and create a tuple from x,y

def combine(x,y):

return "(" + str(x) + ", " + str(y) + ")"

# Use reduce to apply combine to numbers

from functools import reduce

print(numbers)

reduce(combine,numbers)

 To use the python reduce function, you need to import it from functools.
 To define combine, we haven’t used a lambda function. With a Lambda function, we would
have:

# we define a list of integers

numbers = [1, 4, 6, 2, 9, 10]

# Use reduce to combine numbers


from functools import reduce

print(numbers)

reduce(lambda x,y: "(" + str(x) + ", " + str(y) + ")",numbers)

Check your Progress 1

Fill in the blanks.

1. _____ for scalable, high-throughput, fault-tolerant stream processing of real-time data.


2. Spark applications are run as independent sets of processes, coordinated by a _____ in a
driver program.
3. ____ returns a new RDD by applying a function to each element of the RDD, but output is
flattened.
4. Spark MLlib is a ______ framework on top of Spark Core.

Activity 1

Try to implement map, reduce and filter, as discussed in the unit.

Summary

Apache Spark is a general engine for big data analysis, processing and computations. It provides
several advantages over MapReduce, it is faster, easier to use, offers simplicity and runs virtually
everywhere. In this unit we have discussed the spark and how to use python API with Spark. We have
also discussed about map, reduce and filter.

Keywords

• Resilient Distributed Datasets: It is a fundamental data structure of Spark. It is an


immutable distributed collection of objects.
• Distributed machine-learning: It refers to multi-node machine learning algorithms and
systems that are designed to improve performance, increase accuracy, and scale to larger
input data sizes.

Self-Assessment Questions

1. Explain the features of PySpark.


2. Write a shot note on map, reduce and filter.
3. What is Spark MLlib?

Answers to Check Your Progress


Check your Progress 1
Fill in the blanks.

1. Spark Streaming for scalable, high-throughput, fault-tolerant stream processing of real-time


data.
2. Spark applications are run as independent sets of processes, coordinated by a Spark
Context in a driver program.
3. flatMap returns a new RDD by applying a function to each element of the RDD, but output is
flattened.
4. Spark MLlib is a distributed machine-learning framework on top of Spark Core.

Suggested Reading

1. Learning PySpark by Tomasz Drabas, Denny Lee


2. Spark for Python Developers by Amit Nandi
3. Data Analytics with Spark Using Python by Jeffrey Aven
4. Machine Learning with Spark and Python by Michael Bowles
5. Learning Spark: Lightning-Fast Big Data Analysis by Holden Karau, Andy Konwinski, Patrick
Wendell, Matei Zaharia
6. https://annefou.github.io/pyspark/03-pyspark_context/

The text is adapted by Symbiosis Centre for Distance Learning under a Creative Commons Attribution-
ShareAlike 4.0 International (CC BY-SA 4.0) as requested by the work’s creator or licensees. This license
is available at https://creativecommons.org/licenses/by-sa/4.0/

You might also like