0% found this document useful (0 votes)

38 views47 pages

Example 1

The document discusses a project analyzing a dataset from UNICEF to understand donor behavior. It describes the dataset, data pre-processing steps, different analysis methods considered, key findings which identified clusters of donors with different pledge to cash ratios. It provides recommendations to UNICEF on measures to improve fundraising based on the insights.

Uploaded by

Giorgio Aduso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views47 pages

Example 1

Uploaded by

Giorgio Aduso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Group 4

TDT4259 - UNICEF Dataset

Olav Håberg Dimmen

Lars Martin Hodne
Eirik Leikvoll
Andine Luick
Eric Young

November 20, 2020

Table of Contents

Table of Contents i

List of Acronyms iii

List of Tables iii

List of Figures iv

1 Introduction 1
1.1 UNICEF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Project task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 The team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.1 Roles and responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4
2.1 Project objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 How we made our choice . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Data-strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Purpose and approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 What is design thinking . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Deviations from design thinking . . . . . . . . . . . . . . . . . . . . . . 7
2.2.4 Empathize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.5 Define . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.6 Ideate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.7 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.8 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.9 Data integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Method 13
3.1 The dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 The Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Data quality and integrity . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Choosing the method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

i
4 Analysis 22
4.1 Data subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 Choosing k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Cluster characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.3 Data set linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Cash-to-pledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Nothing-to-pledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Briefly on gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Interpretation and recommendations 35

5.1 Recommended measures & implementation plan . . . . . . . . . . . . . . . . . 35
5.1.1 Implementation schedule . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Answering the hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Limitations and future analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.3 Future analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

References 42

ii
List of Acronyms

BAM Business Analytics Model.

CRISP-DM Cross-Industry Standard Process for Data Mining.

NGO Non-Governmental Organization.

SSB Statistisk Sentralbyrå.

UN United Nations.

UNICEF United Nations Children’s Fund.

List of Tables

1.1 A brief description of the group’s members . . . . . . . . . . . . . . . . . . . . 3

2.1 Roles relating to our data-strategy . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Columns with over 50% empty fields . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 The feature vectors used to cluster, and our associated hypotheses . . . . . . 21

4.1 Average daily use of Internet and TV in minutes for various age groups. . . . 29

5.1 The measures from our analysis, with required actions, and underlying analyt-
ical insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Measures, ordered by priority, using a risk-averse, risk-neutral, and risk-seeking
strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

iii
List of Figures

2.1 Our business analytics leverage matrix of UNICEF’s hypotheses . . . . . . . . 6

2.2 The Business Model Canvas for UNICEF Norway . . . . . . . . . . . . . . . . 8
2.3 Rich Picture Diagram of UNICEF’s brand framework . . . . . . . . . . . . . . 9

3.1 The different data tables and their relations . . . . . . . . . . . . . . . . . . . 15

3.2 The formula of the Silhouette Coefficient . . . . . . . . . . . . . . . . . . . . . 18
3.3 The Davies-Bouldin score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 The Calinski-Harabasz index’s formula . . . . . . . . . . . . . . . . . . . . . . 18

4.1 Visualization of giver segment sizes . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Plot of the clustering scores for different values of k. . . . . . . . . . . . . . . 24
4.3 Visualization of the defining features of the clusters along the three feature axis. 25
4.4 Visualization of the relative size of clusters within the subsets. . . . . . . . . . 26
4.5 The market channel distribution for the cash-to-pledge and nothing-to-pledge
subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Violin plot of the cash-to-pledge subset . . . . . . . . . . . . . . . . . . . . . . 28
4.7 The marketing channels’ distribution in the clusters returned within the cash-
to-pledge category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Violin plot of the nothing-to-pledge subset . . . . . . . . . . . . . . . . . . . . 32
4.9 The marketing channels’ distribution in the clusters returned in the nothing-
to-pledge category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.1 The total income for each channel’s pledges . . . . . . . . . . . . . . . . . . . 37

5.2 The proposed measures plotted with their business value and implementation
feasibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

iv
1 | Introduction

This report details our project in the NTNU course TDT4259. During the project we were
asked to analyse UNICEF’s data on individual donors to offer insights and measures meant
to increase the funds which support their charitable work. This report covers every stage of
the data mining process, including the development of our data strategy, the methods used,
the results, and our analysis and interpretation of the results.

1.1 UNICEF

United Nations Children’s Fund (UNICEF) is the world’s largest children’s fund. UNICEF
was established to provide relief for struggling children and mothers in the aftermath of
World War II. They are currently present in 192 countries around the world. UNICEF’s main
objective is to provide humanitarian and developmental aid to impoverished children world-
wide, with a particular emphasis on underdeveloped countries. In 2019, UNICEF provided safe
facilities for nearly 28 million births, treated 4.1 million children for severe acute malnutrition,
intervened to prevent 5.1 million cases of child marriage, gave access to clean drinking water
for 18.3 million people, and much more [1]. UNICEF’s primary mission is to work towards a
world where no children die from preventable causes, and every child has a safe and healthy
childhood.

UNICEF is a United Nations (UN) agency headquartered in New York City with seven regional
offices across Europe, Asia, Africa and Central America. Additionally, there are UNICEF Na-
tional Committees in 36 developed countries which operate as independent Non-Governmental
Organizations (NGOs) at arms length from the UN controlled agency. The Norwegian Na-
tional Committee (UNICEF Norway) represents UNICEF in Norway, raising awareness of
children’s rights issues, promoting children’s rights domestically, and raising funds to support
UNICEF’s international projects.

1
1.2 Motivation

UNICEF relies entirely on private and governmental donations to fund their operations. While
approximately two thirds of UNICEF’s funds originate from governmental contributions, pri-
vate donations from individuals, businesses and other institutions remain a significant source
of UNICEF’s funding [1]. Consequently, UNICEF is engaged in many activities meant to
increase private contributions, including social-media campaigns, celebrity ambassadorships,
as well as supporting volunteer fundraisers. With the emergence of business intelligence,
UNICEF has been able to use a data driven approach to inform their strategic decision mak-
ing, and they would like to extend the benefit they get from their data.

This project will focus on UNICEF Norway’s individual givers and their associated data.
Individual givers primarily donate money either through one time donations, or recurring
pledge donations. In their effort to enhance the lives of children worldwide, UNICEF Norway
seeks to use their data to increase funding from both existing givers and non-givers. UNICEF
Norway’s Fundraising Department for Individual givers has identified a few key hypotheses
(their wording) surrounding individual donors:

• How can we avoid losing regular donors and pledge donors?

• How can we make one-time donors regular donors?
• Which marketing activities should be directed at various donor segments?
• How can we get higher donations? What characterizes and motivates high value givers?
• Which cold lists are most relevant for direct marketing activities?

1.3 Project task

UNICEF Norway has enlisted the help of the authors of this paper to analyze their informa-
tion on individual givers, dating back to 2017. The analysis should produce a sophisticated
prediction model meant to answer one or more of the above questions. A successful model
should highlight measures that UNICEF can either monetize or use to cut costs. UNICEF
wants the group to document significant insights and validate them through relevant research,
and information from Statistisk Sentralbyrå (SSB), if possible. UNICEF emphasized that a
narrow prescriptive analysis is of greater interest than a broader, but less complex analysis—
the group is not expected to answer more than one or two of the above questions. UNICEF
and the group agree that the use of machine learning methods will be important for the
project’s success, though the choice of method is left to the group.

2
1.4 The team

The team consists of 5 students studying at NTNU in Trondheim. The group’s members have
both shared and individual expertise. Table 1.1 highlights some of the relevant experience
and motivating factors of each member.
Member About
Eirik Eirik has a Bachelor’s degree in Informatics, and is pursuing a masters degree in Artifi-
cial Intelligence. His main academic interests are deep learning and visual computing.
Lars Martin Lars Martin has a background in Computer Science and Artificial intelligence. He is
looking to expand his knowledge of machine learning and its applications.
Eric Eric studies Industrial Ecology where he is working on mapping the battery minerals
supply chain from a systems perspective. Eric has a bachelors of Philosophy, and sees
the need for quantitative evidence to support reasoned conclusions.
Andine Andine has a Bachelor’s degree in Computer Engineering. She is now interested in
learning more about data analysis.
Olav Olav has a Bachelor’s degree in Informatics, and is currently pursuing software archi-
tecture and information systems with a keen interest in security. He is also looking
into how enterprises can improve their routines.

Table 1.1: A brief description of the group’s members

1.4.1 Roles and responsibilities

Data analytics was a mostly novel topic for the group’s members. Thus, the group decided
to withhold any wide-reaching role-assignments during the project’s beginning. Instead, the
group agreed to explore the step-by-step data analytics process together, such that everyone
could get a lasting impression of each of the process’ various stages. For each stage, a joint
task board was updated by the group to organize the upcoming work. Members were then
free to choose the tasks which they felt matched their skill set and piqued their interest the
most. As the project went on, the group found it necessary to assign roles as part of the data
strategy (see 2.2.9 Roles and responsibilities). However, we kept our commitment to letting
everyone experience each step of the data analytics process in detail, so members were usually
not bound by their roles unless the current step of the process asked for it.

3
2 | Background

2.1 Project objectives

As our primary objective, the group decided to test one of UNICEF’s suggested hypotheses:

Project hypothesis: What marketing activities should be directed at which segments of

donors.

This hypothesis was to be answered by completing a list of project and outcome goals, which
were made by the group based on our findings in the empathy and define stage of a modified
design thinking process (see 2.2.1 Purpose and approach for details). The purpose of creating
these goals was to decompose the hypothesis into clearly defined deliverables with specified
requirements and expectations.

Project goals

These goals relate to the deliverables which should be complete at the project’s conclusion:

• The project must deliver a model that offers actionable insights based on the available
data from UNICEF and other trusted sources
• The group must present and argue for key measures based on the model’s findings
• The model must be expandable with new data
• The model should be a machine learning based model. The group must document and
support their choice of model.
• The group must demonstrate that the model meets quantified performance targets based
on an agreed upon test methodology
• Significant findings must be visualized through explanatory figures

Outcome goals

These goals relate to the effect we want to see from our project deliverables once they have
been put to use:

4
• The model must offer new insights which UNICEF can use to more effectively reach
donors and non-donors through marketing activities.
• The project’s visualizations should be intuitive and reduce the time spent on looking
through the model’s output.
• The method / model is used in additional donor analysis by UNICEF

2.1.1 How we made our choice

Our choice of hypotheses was based primarily on our own background, the given dataset,
and how we believed those two factors would combine to be able to solve the various hy-
potheses. When considering these factors, the aforementioned hypothesis came up on top for
a few reasons: first and foremost the data seemed very relevant to this hypothesis. As we
describe further in 3.1 The dataset, the data contains fields specifically about the marketing
channel which accompanies individual givers’ gifts. Campaigns are also referenced in dona-
tions, which we believed would give insights into which campaigns produced donations from
different groups of people.

The second motivation behind our choice was that the group’s experience with machine learn-
ing seemed to match well with this specific hypothesis. Ideas for possible models were nu-
merous, even before the hypothesis had been selected. These factors combined to guide our
choice, Figure 2.1 shows the various hypotheses plotted in a business analytics leverage matrix
(as described in [2]). Based on our skill-set and our impression of the data set, our hypothesis
of choice scored high on analytics feasibility, while placing second on business value. We
should point out that this represents our own evaluation of value and feasibility, and does not
purport to represent UNICEF Norway’s views. Ideally, this process would be done with more
involvement from the organization, but circumstances required us to make our best, educated
approximation.

5
Figure 2.1: Our business analytics leverage matrix of UNICEF’s hypotheses

2.2 Data-strategy

Taking time to plan out a project before diving into the work is a necessary step in all fields.
As businesses rush to make use of data analytics technology, many projects fail at the design-
of-process phase [3]. To avoid this trap, we have used an intentional, planned process which
is based on established methodologies in the data science industry. We took the principles
of design thinking as our guiding strategy, to which we added certain elements from Cross-
Industry Standard Process for Data Mining (CRISP-DM), and Business Analytics Model
(BAM).

These three methodologies share key elements which combined well to make a high resolution,
structured approach to designing our data strategy. Design thinking is a broad concept with
application in many fields. The central concepts of design thinking are to empathize with
the user to identify solutions tailored to their needs, as well as a hands-on approach to
finding solutions via repetitive cycles of prototyping and testing. We have modeled our use of
design thinking on Hasso Plattner Institute of Design at Standford University’s version of the
design thinking process [4]. Additionally, several techniques described in a 2018 case study of
business analytics application to a foodbank charity[2] have been useful in describing UNICEF
Norway’s business problem and determining applicable data techniques for their situation.

6
2.2.1 Purpose and approach

The purpose behind our data-strategy is to help us answer our project objectives convincingly
and effectively. To accomplish this goal, we believed a data-strategy grounded in design
thinking would be appropriate. This conclusion was based on the following observations:

• Design thinking puts the human perspective up front. By adopting such a perspective
we should be more able to see the needs of UNICEF.
• Design thinking emphasises the value of an understanding of the goals, needs, chal-
lenges and motivation of UNICEF before searching for solutions. Applying this to data
analytics should give us a better foundation to base or solution on.
• Design thinking values thorough and well-prepared testing of prototypes to document
performance and find areas to improve. This will help us design a rigorous system
to assess the performance of our model based on the needs of UNICEF which were
identified earlier.
• The simplicity of the design thinking process made it stand out as an approachable
method for our first data analytics project.
• Design thinking already enjoys a frequent and successful presence in data analytics, as
documented by [5].

2.2.2 What is design thinking

Design thinking is an iterative process with origins from the design industry. The process
consists of attempting to understand the user, questioning our assumptions, and redefining
problems to try to identify alternative strategies and solutions [4]. There are many variants
of the design thinking process, which span from three to seven phases. According to Hasso
Plattner Institute of Design at Stanford, the five phases of design thinking are as follows: (1)
Empathize, (2) Define, (3) Ideate, (4) Prototype, and (5) Test. The phases do not have to
follow a specific order and can be done in parallel. The goal of the design thinking process is
to improve products by analyzing and understanding how users interact with products and
investigating the conditions in which they operate.

2.2.3 Deviations from design thinking

Certain aspects of design thinking were not followed, either due to incompatibility with the
project or by choice. First and foremost we did not have the benefit of working closely together
with UNICEF, e.g. witnessing how data powers their work, getting first-hand accounts of
what they want from data, and performing interviews, ideation and testing together. Design
thinking also emphasises the development of ideas which defy convention during the ideate
stage, with some sources proclaiming the crazier the better. The group however, considered
a more conservative approach to be the best fit for our project given our inexperience on
the topic and our impression that UNICEF’s hypothesis could be effectively answered with
existing methods. Below, we present the various stages of our design thinking strategy.

7
2.2.4 Empathize

In order to design our data strategy to the needs of the people involved, we had to understand
the point of view of both UNICEF Norway and the individual donors who support them. The
empathize stage of design thinking is specifically made to fulfil this need. Our goals during
this stage was to understand how UNICEF Norway and their donors operate, how data is
involved and used, and what requirements UNICEF Norway has relating to the data and
our analysis of it. During the later stages of design thinking, this information helped us
expand our understanding of the hypothesis we selected. We will continue this section with
a summary of our findings during the empathize stage.

As explained in Section 1.1, UNICEF is the world’s largest children’s fund and their mission
is to make sure no child undergoes preventable hardships. UNICEF relies on voluntary con-
tributions to uphold this mission. UNICEF Norway’s Fundraising Department for Individual
givers are constantly looking to raise more revenue to support their cause.

Figure 2.2: The Business Model Canvas for UNICEF Norway. Yellow relates to UNICEF’s donors,
blue their beneficiaries

Drawing on the BAM methodology we started with a rich picture diagram of UNICEF’s brand
framework and the relationship between UNICEF Norway and their donors, attempting to
represent the motivation that drives people to contribute. The brand framework is based

8
on an online survey with nationally representatives across age, gender and region with 803
respondents [6].

Figure 2.3: Rich Picture Diagram of UNICEF’s brand framework

Explanation of the Rich Picture Diagram’s flow: Awareness is the base that brand knowledge is
built on. A brand’s growth and decline (penetration) starts at awareness. Aided, spontaneous
and top of mind awareness are the 3 stratified measures of awareness used across countless
brands to identify the strength of their overall brand with consumers. Mental and physical
availability are used as measurements of the strength of brand penetration in the traditional
consumer world. Mental availability refers to the probability of a potential supporter noticing,
recognizing, and thinking of your brand in a relevant situation. To better understand a brand’s
mental availability, UNICEF tries to identify deep-seated mental structures which potential
donors have as associations with UNICEF. These can be literal structures, like UNICEF name
and logo, or more indirect and complex, such as associations with children in need. Brand
assets should be simple, consistent, easy to remember and should trigger instinctive responses

9
when seen. This is because 95% of decisions are subconscious and most of them are made
with the emotional brain [6]. Strong awareness is not enough to bring along strong emotional
connotations, only communication can.

UNICEF and data

With the emergence of data analytics, UNICEF has seen the huge impacts that data driven
decision making can have on their bottom line. For us to help UNICEF make well-founded
decisions based on their data, it was an important part of the empathize step to understand
how data is used at UNICEF. We had the benefit of questioning a business analyst at UNICEF
with first hand experience on this topic. Here are some highlights from what we learned:

• UNICEF collects internal data from Payments / ERP systems, and connections with
payments in CRM system. Personal data is given by the customers and Google Ana-
lytics. External data comes from Market research, reports and SSB.
• Data is typically analyzed through reports and business intelligence systems (Qlik,
Power BI) and in Salesforce Dashboards.
• Results from the analysis should Improve the business results; offer insight, measures,
predictions; and prioritize what to do, or what not to do.
• An ideal model should give data-driven knowledge / better insight for the stakeholder,
so the most optimal decision can be taken.

2.2.5 Define

In the define stage we used the knowledge we gained while empathizing to define UNICEF’s
needs both in general and relating to our project. This information then helped us decide
upon the long term and short term goals of our project, formalized as project goals and
outcome goals. These goals were made together as a group, and then approved by UNICEF.
The results of the define stage are found in 2.1 Project objectives.

2.2.6 Ideate

Once we gained an understanding of UNICEF, their goals and problems, as well as an un-
derstanding of the data, we were ready to move to the the ideate step. This step involved
generating as many ideas as possible for how the data could be analysed to create solutions
for the business problem. The methods we considered during this step are covered in 3.3
Choosing the method.

2.2.7 Prototype

The prototype stage is where the fruits of the ideate step were implemented. Our most
important efforts in this stage was to select the frameworks and environments through which

10
we would implement our prototypes. Our goal was that a complete prototype would, if
it passed testing, be in a state that was largely ready for delivery to UNICEF. This was
important to avoid repeating work after a prototype had been approved. Due to the nature
of the project, our prototyping was not limited to the method. Careful pre-processing of the
data was essential to the successful execution of the project, possibly even more important
than the method itself. Consequently, hypothesizing which subsets of the data were relevant
to our analysis, was a significant body of work during prototyping, with similar effects on
testing.

2.2.8 Test

The test stage is where our prototyping is evaluated to make sure they possess the capabilities
that UNICEF require. Naturally, this made it a very significant part of the data strategy,
requiring well prepared procedures, metrics and measurements of success. Consequently, we
spent a significant amount of time fine-tuning our test protocols. This was done in stages
as we came closer and closer to a prototype. In the first stage (during define) we identified
general benchmarks and targets based on the project objectives that any solution would have
to meet. Then, once ideate was complete, a more detailed test methodology with procedures
and tests targeted directly at our prototype(s) was made. See Section 3.2 for details on our
testing and evaluation methods.

2.2.9 Data integrity

Ensuring that our data maintained accuracy and consistency during our analysis was an
important challenge to overcome to meet our project’s goals. Contaminated, inaccurate,
redundant, or missing data could cripple our analysis no matter how sophisticated it might
be. Consequently, we made significant efforts to understand the dataset and addressing its
shortcomings. These efforts are discussed in length in Section 3.1.2.

Roles and responsibilities

New roles were needed to complete our data-strategy. These roles were not very big in
scope, and generally did not keep their assignee away from other activities on a frequent
basis. Nonetheless, these roles were important to ensure sound data management. Table 2.1
presents and explains the new roles alongside their assignee.

11
Member Responsibility Explanation
Eirik Data quality Responsible for ensuring that the data is fit for use.
Lars Martin Data analytics Responsible for organizing the planning and execution of the data
analysis.
Olav Data availability Responsible for ensuring the entire group has access to essential data.
Andine Testing Responsible for organizing the planning and execution of the testing
and validation of the method.
Eric Interpretation and Responsible for managing the interpretation of the results from our
communication analysis and the communication and visualization of these findings.

Table 2.1: Roles relating to our data-strategy

12
3 | Method

3.1 The dataset

The provided data contains six tables detailing donors and the marketing campaigns directed
at these. After looking over this data, we made a few small modifications to make a new
default dataset for the project. These modifications were limited to removing completely
empty attributes, like the address field, or constant fields like currency.

3.1.1 The Tables

This section includes explanations of all the tables included in the dataset. See Figure 3.1 for
a relational diagram of the dataset.

Donor File Contacts

This table contains information about a giver’s age, date of birth, gender and donations,
which tells us when the giver first donated and their previous donations. We also see the sum
of their donations, and how many times in total they have donated. The giver ID and the
contact ID are unique IDs for a giver; the giver ID is used in the user interface in UNICEF’s
systems, and the contact ID is used internally.

Payment File Cash

Here we also have the contact/giver IDs, so we can link this table (many to one) to the
donor file contacts table. One row here describes one cash payment/donation to UNICEF,
including the amount given and which medium was used for paying (bank account, postal,
etc..). Here we can also get to know which campaign lead to this donation, and which channel
this campaign used. The opportunity ID is a unique key for each of these payments. It is
always connected with a contact ID and a campaign name.

13
Payment File Pledge

This table gives an accord of pledge payments (recurring payments). We have one row per
payment, so most pledges are listed multiple times. Again, we have the contact/giver IDs,
the unique opportunity ID, and the name of the campaign associated with the pledge. Here,
amount describes how much money was given on the listed date. The pledge type is either
listed as UNICEF Fadder or SMS Livredder. The recurring donation ID allows us aggregate
all pledge gifts given in a specific pledge.

Pledge File Contacts

This table gives an overview of the pledges listed in the Payment File Pledge table. Here we
get the date the pledge was established and the date the pledge closed, if applicable. The date
of the first and last payments, how much money is given on each donation, and how often
one donates, is also listed. Additionally, we get the number of payments transacted in the
pledge, and the total amount given. The recurring donation ID is unique for a given pledge.
Importantly, the pledges also include a field, Marketing Channel, which describes the channel
that compelled the donor to start pledging (e.g. post, SMS, DRTV, etc.).

Campaign File

This table contains campaigns, their names and, if relevant, their parent campaign name
and ID. We also get the start and end date of the campaign. Many campaigns are ongoing
so the end date is not always present. The campaign goal only signifies whether it is an
emergency campaign; according to our UNICEF contact, other values are not relevant. Some
campaigns have a campaign type, which signifies the medium used to reach potential donors
(such as Direct Mail, SMS and others). The activity field says which approach the campaign
has towards gathering donations, whether it be celebrations, company partnerships or others.
Many campaigns also have specific themes, such as improving education, children’s health in
specific countries, protection, or similar. There are also campaign texts, which serve as slogans
for the campaign. The agency field describes who is responsible for hosting the campaign,
either in-house for UNICEF’s employees/volunteers or agency for third parties (or empty).

Campaign With Campaign Members

This table lists campaigns with campaign members. These campaigns include donors as
members, which will give the donor a MemberID for the campaign. This ID is unique only
within the campaign. Other than this, all fields are equal to the campaigns in the campaign
file. For our purposes, the more bare-bones campaign file was sufficient.

14
Figure 3.1: The different data tables and their relations

3.1.2 Data quality and integrity

In order to measure the quality of our data, we have chosen five metrics[7] to go by.

Accuracy

We presume the data to be accurate. This is because the data originates from UNICEF’s
internal storage, upon which the whole business relies on the accuracy of donation information.
However, for the sake of caution, we also looked for outliers in the data, but only found one
outlier which caused concern, namely a one-year old pledge giver. However, the remaining
data for this giver seemed believable, so we decided to keep the data. The donor account was
likely made by family members on behalf of their young relative.

Relevance

Our primary objective, to test what marketing activities should be directed at which segments
of donors, heavily relies on the relation between marketing campaigns and donors. Our
data contains information about what campaign initially led a donor into donating, which
marketing channel was used, how much the donor has donated, and what type of donation

15
Table Column Empty
Campaign Parent Campaign 96%
Campaign Type 95%
Agency 92%
End Date 86%
Campaign Goal 83%
Campaign Theme 62%
Payment Cash Channel 93%
Campaign with members Campaign Type 84%
Campaign Goal 78%

Table 3.1: Columns with over 50% empty fields

was made (pledge or one time donation). Consequently, the data is very relevant to the task.

Completeness

Some fields in the provided data are frequently empty. We analyzed 1000 samples from each
table to identify lacking fields. In table 3.1, we list the columns with more than 50% empty
data.

Timeliness

Our data only contains information about donors who have given after 2016. For each donor,
we are provided with data for specific donations given after 2016, alongside cumulative data
about their contributions prior to 2017. Since the most recent pledges are the most important
to our analysis, the timeliness of the data is acceptable, however, data about specific donations
before 2017 would still be useful. Also note that the total paid amount is just for donations
after 2015.

Reliability

After our initial removal of empty fields, we uploaded the data to a shared drive. If something
is ever altered here, we would know to look into it (as Google Drive keeps track of alterations).
This ensures our data is not changed over the course’s progression, so the core data will always
be the same. While we will certainly not use all rows or columns, the data itself will not be
changed from what is on the drive. We also recognize that some data is captured more reliably
than other data. The amount of money given for donations is critical for UNICEF, and
discrepancies here could give the organization a lot of trouble. Meanwhile, which campaign
a given donation belongs to might not be as critical, and assumptions might be made from
UNICEF’s side based on time and location. Thus, fields which relate to the financial aspects of
donations can be considered more reliable than data about marketing channels, for example.

16
3.2 Evaluation criteria

It was necessary to develop a set of criteria to evaluate the effectiveness of our analysis.
Definitive criteria of effectiveness in any analysis are useful in determining the relative value
of different methods. The relative value of different approaches allowed us to focus on the
most effective methods. A model may be the best of its peers and nevertheless not be worth
implementing if it is too complicated to implement. Thus to accurately assess value it is
important to consider the totality of what the method brings to the table. At the same time
it could be that there are several models that each have high absolute value and may each have
their own place, then one might begin considering the complexity and value of combinations
of methods. This section will discuss the criteria that we have decided to use to evaluate
candidate models.

The best evaluation of any model is to measure its ability to achieve its purpose in the real
world. In this situation that would mean being implemented in marketing campaigns by
UNICEF and showing measurable improvement in UNICEF’s marketing effectiveness. This
should be the eventual validation of our model, should UNICEF decide to use it. In the
short term, however, we needed criteria that could be used immediately by us to evaluate our
models, and to present to UNICEF to give them confidence in our recommendations. This
meant developing a testing methodology that could be applied using only the data we already
have.

A large practical hurdle in designing a test criteria for prediction is that the data we have only
contains positive responses to UNICEF campaigns. Without any data that tells us the poten-
tial donors who were contacted and did not donate, we could not test a model that predicts
whether a particular type of individual would respond or not to a particular type of campaign.
Given what data we had, we instead had to evaluate whether our model was able to identify
features of individual givers which could be related to the effectiveness of various marketing
activities. We therefore decided to use the clusters given by each method, and evaluating the
distribution of marketing channels for the cluster’s members compared to the distribution of
the complete subset used in clustering. Large deviations in these distributions would imply
that the model has identified giver characteristics which correspond to a marketing channel
preference.

Clustering is an unsupervised type of machine learning, which does not assume what sort
of classifications we want the model to output. Clustering effectiveness is then a measure
of the efficacy of the distinctions that the model draws in the data. Does it separate the
input data into meaningful groups which are internally consistent (low variance), yet distinct
(high variance) from one another? Another measure of the effectiveness of clustering is using
measures which compare clusters to one another. The Silhouette Coefficient gives a high score
to dense and well separated clusters [8]. It is defined using two scores:

• The mean distance between a sample and all other points in the same class
• The mean distance between a sample and all other points in the nearest cluster

17
The score is a value between -1 and 1. -1 meaning poor clustering, 0 indicates overlapping
clusters, while 1 indicates dense clustering. The Silhouette Coefficient is normally applied to
the results of a cluster analysis. This coefficient is generally higher for convex clusters than
other clustering methods, but since density based clusters were quickly disqualified when
selecting methods, this was not problematic. Figure 3.2 describes the Silhouette coefficient’s
formula.
b−a
s=
max(a, b)

Figure 3.2: a: the mean distance between a sample and all other points in the same class. b: the
mean distance between a sample and all other points in the next nearest cluster

The Davies-Bouldin index is an alternative metric for evaluating clustering algorithms that is
simpler than the computation of Silhouette scores [8]. A lower Davies-Bouldin index indicates
that a model has dense and well separated clusters. This index refers to the average similarity
between clusters. In this case, the similarity is a measure that compares the distance between
clusters with the size of the clusters themselves. With zero as the lowest possible score, values
closer to zero indicates a better model. The Davies-Bouldin index is generally higher for
convex clusters, the same as the Silhouette Coefficient. See Figure 3.3 for the formula.
si + sj
Rij =
dij
k
1X
DB = maxRij
k
i=1

Figure 3.3: si : the average distance between each point of cluster i and the centroid of that cluster.
di j: the distance between cluster centroids i and j

Finally, the Calinski-Harabasz index gives higher scores to a models with more separated
clusters. The formula gives the ratio of the between-clusters dispersion and the inter-cluster
dispersion [8]. Dispersion is defined as the sum of distances squared. See Figure 3.4 for the
formal description.
tr(Bk ) nE − k
s= ×
tr(Wk ) k−1

Figure 3.4: The score, s, is defined as the ratio of the between-clusters dispersion mean (Bk ) and the
within-cluster dispersion (Wk ) for a set of data E of size nE .

After the clustering has been done, we tested if there was a robust connection between the
clusters of donors and the distribution of campaign features that were associated with the
respective donations. From start to finish, this process entailed the following:

18
1. Calculate the distribution of marketing channels for the whole group of donors.
2. Cluster the donors into k donor clusters, each of which is a distinct donor segment.
3. Calculate the distribution of campaign classes associated with each cluster.
4. Compare the difference between the proportional distribution of the campaign classes
of the entire subset against each individual cluster.
5. Compare the difference between the proportional distributions of the campaign classes
between each other.
6. Analyze and explain differences based on cluster characteristics and research.

A larger variation in the campaign distributions of the clustered data should mean that the
clusters are distinct from one another in ways that correlate with donor responses to different
campaigns.

3.3 Choosing the method

Our choice of method took inspiration from the field of market segmentation and market
clustering. These fields are central in marketing activities and offered established methods
which have seen success in problems very similar to ours through many decades. We made
the choice early on that our method would be retrieved from a software library as opposed to
implementing the method ourselves. The following factors were key in that decision:

• Libraries offer pre-made methods, thus saving time and guaranteeing correctness.
• Libraries typically offer a suite of methods, which will enable us to test multiple methods
and see which fits best.
• Libraries often include functionality beyond the methods themselves, including pre-
processing, visualization, and testing.
• Libraries usually include documentation, guides, and established communities which
can be helpful if we encounter any issues.

With this in mind we made the following requirements, before selecting the plug-in:

• A library with multiple promising methods should be heavily favoured

• The library must be documented
• The library must feature clustering algorithms
• Ideally, the library should include visualization and testing functionality, or be trivially
compatible with libraries with those functions

19
With these criteria established, or choices were quickly narrowed down to Scikit-learn and
Pycaret. We eventually chose Sci-Kit learn as it had by far the best documentation, a large
community to help with potential issues, and it was familiar to many in the group. With this
library selected, we began preliminary testing to select the specific clustering method among
Scikit’s 10-or-so candidates. Multiple methods were immediately discarded as their inductive
bias rendered them unable to produce good results on the tested subset. With a reduced
number of methods, more in-depth analysis could be made, following our extensive evaluation
procedure. In the end, the seasoned K-means algorithm produced, by far, the most consistent
results both on heuristics and visual inspections.

3.4 Tools

Here we will briefly explain the tools we used for data management and analysis.

Power BI

Power BI Desktop is a free tool primarily for data visualization and analytics. It provides a
simple interface for viewing large amounts of data, and automatically relates columns across
tables (if the column name is the same). Power BI provides visual objects. These are a range
of different diagrams, scripting blocks, or matrices. Power BI allows us to input whatever
columns we want to, and gives simple point-and-click interfaces for aggregation.

Python

For the analysis, we used python, and loaded the data by reading the .csv files. The python
packages we had to install to get started with the data were pandas and matplotlib. To
implement the machine learning methods, we used Scikit-learn, a free machine learning library.
It offered classification, regression and clustering algorithms, of which the latter is what we
used in our analysis. For visualization of the results, we used pyplot (a group of functionality
within matplotlib) and seaborn, a visualization library for statistical data.

3.5 Data pre-processing

With the preparatory work behind us, data pre-processing could begin. Our goal was to ex-
tract meaningful feature-vectors from the dataset which could form the basis of the clustering
we were after. Our approach was data driven: we started to devise hypotheses about which
grouping of data would produce meaningful feature vectors to cluster on. For k-means to
be effective, we also had to consider the dimensionality of the vectors and correlation among
its fields, both of which should be kept low. This resulted in three subsets, listed in Table
3.2, based on three hypotheses. The subsets use features exclusively from the Donor File
Contacts, Pledge File Contacts, and Payment File Cash tables.

20
Description Hypothesis Fields
Cash-to-pledge The motivations for cash donors to transition into Age, Total number of cash
giver pledge donations are diverse and linked to their gifts before pledging. Aver-
profile. This includes the duration and contribu- age value of cash gifts before
tions of their pledge(s) pledging
Nothing-to- The motivations for non-givers to transition into Age, Total number of pledge
pledge giver pledge donations are diverse and linked to their gifts, Average value of pledge
profile. This includes the duration and contribu- gifts
tions of their pledge(s)
Nothing to one The motivations for non-givers to start giving cash Age, Total number of cash
(or more) time donations are diverse and linked to their profile. gifts, Average value of cash
giver This includes the frequency and amount of dona- gifts
tions

Table 3.2: The feature vectors used to cluster, and our associated hypotheses

One question we needed to answer was whether excluding gender from the clustering would
detract from the analysis (we had seen indications of differences between the giving tendencies
of each gender). Since measuring distance between genders is not possible, including it in the
vectors themselves was not an option. Thus, for the first subset we tested, we made three
versions, one for each gender, and a third which combined them both. The results from this
test showed that gender did not add to the analytic insights, as seen in Section 4.3. Thus,
moving forward, clustering was done on both genders together.

21
4 | Analysis

In this section we go through each of the feature vectors we extracted from the dataset, and
highlight the immediate findings from our method. We also offer an explanation of how we
decided the number of clusters.

4.1 Data subsets

In our attempt to segment the giver base of UNICEF into meaningful distinct categories, we
started by dividing givers according to their gift history. UNICEF givers make either cash
gifts, pledge gifts, or both. UNICEF would like to increase the number of pledge givers, so
we chose to analyse how UNICEF reach out to potential givers in the future to increase the
number of pledge gifts. When analysing the existing data of pledge givers we realized that
for those givers who had a history of making cash gifts before they made any pledge gifts, we
could make use of their cash donation history in addition to their demographic information
to classify them. In this way, the cash-to-pledge giver encompasses those givers which have
given at least one cash gift before they started their first pledge. Of the givers who are
making pledge gifts, approximately 20% started out as cash givers and were persuaded to
later upgrade to a pledge gift. As Figure 4.1 shows, UNICEF has a substantial pool of over
63000 cash givers who are not currently making pledge gifts. These givers can be targeted
with the campaign insights derived from the cash-to-pledge analysis. Meanwhile, the pledge
givers who did not make a cash gift before pledging were classified on the basis of their pledge
gifts alone. This group of 62781 givers have been termed the nothing-to-pledge givers and the
analysis of these givers should provide insights UNICEF can use to target new givers in the
future. The nothing-to-cash subset had to be dismissed due to shortcomings of the data, this
is discussed later in Section 4.1.3.

22
Figure 4.1: Area of figure segment and numbers indicate number of givers in UNICEF data subsets.
Overlapping area of givers who made both cash and pledge gifts is divided according to which gift
type was made first

Choosing cluster features

Both the cash-to-pledge and the nothing-to-pledge givers were clustered according to their
demographic information and giving history. For the cash-to-pledge givers their giving history
features were based on their cash gifts before they became pledge givers, while the nothing-to-
pledge givers used their pledge gift history. The total number of gifts given and the average
amount per gift were taken to be the characteristics of gifts that were most orthogonal to
one another, and when combined represent the value of the individual giver to UNICEF.
The demographic information we had available about the givers were gender, postal code and
age. After some testing using these features the gender and postal code were found to not
increase the utility of the segmentation and so were excluded from the clustering. Thus the
three features of age, total gifts given, and average gift value were used as the basis for the
clustering algorithm.

4.1.1 Choosing k

Before any clustering can happen, k, the number of clusters must be selected. To do this we
made use of both visualization, intuition and heuristics. A common heuristic is the elbow
method, where a score representing the quality of the clustering is calculated for the data
set over a range of k values. When the scores are plotted against k, it is possible to observe
the increase in quality for each additional step of k. An elbow like shape indicates a point of
diminishing returns. As there is no definitive metric for evaluating the clusters, we sampled
three metrics, the silhouette score, the Calsinki-Harabasz index, and the Davies Bouldin score.
Figure 4.2 presents the elbow plots of the 3 methods for the two subsets.

23
Figure 4.2: Plot of the clustering scores for different values of k.

From these plots a choice of k = 4 presented itself to us as the preferable option on the cash-
to-pledge subset. This method does not provide a definite right answer and for each data set
there were higher k values that may also have made some sense, however, the lower number of
clusters was chosen as it appears significant in all three indexes. Additionally, our own visual
inspection of the clusters made it clear that cluster categories were most apparent with k = 4.
The nothing-to-pledge data seemed to lean towards a k of 9, though our visualizations for
k = 4 were promising. Thus, for the sake of consistency and reduced complexity, we continued
with k = 4 for both subsets.

4.1.2 Cluster characteristics

Here we present the distinguishing features of each of the 4 clusters returned from our method.
The clustering was done on Min-Max-scaled feature vectors[9] consisting of the age of the giver,
the average value of the cash gifts made by that giver, and total number of cash gifts made
by that giver. Figure 4.3 presents how the clusters members are distributed for each pair of
the three dimensions clustered on.

24
Figure 4.3: Visualization of the defining features of the clusters along the three feature axis. For both
subsets the clustering categories are visible along the age and number of gifts axis

The plot shows that the algorithm clustered dominantly on the number of gifts and the age of
the giver, while the average gift amount did not appear to influence the clustering on either
subset. From this we decide to name our clusters according to their dominant features. Young
givers with many gifts (YM cluster), young givers with fewer gifts (YF), older givers with few
gifts (OF), and older givers with many gifts (OM). In some cases we may add a suffix, c or p,
to distinguish if the giver transitioned from a cash giver or non giver. Figure 4.4 shows the
relative size of each cluster (only relative within the subsets, not across), plotted on coordinate
system with Mean Age and number of gifts as the axes.

25
Figure 4.4: Visualization of the relative size of clusters within the subsets. Note that the size between
subsets is not proportional

4.1.3 Data set linking

After the data was clustered, the clusters were linked back with the source data in order to
get the marketing channel information needed for analysis.

Cash-to-pledge and Nothing-to-pledge

In the cash-to-pledge and nothing-to-pledge subsets, the clusters are linked with the marketing
information from all the pledges that a person in the cluster has committed. This was done
instead of only including the marketing information from the first pledge because the provided
data only contained pledges that were still active after 2016.

Nothing-to-cash

In the nothing-to-cash data subset, linking was difficult because the marketing channels as-
sociated with the cash payments were not useful, as the values were either too general, vague
or overlapping (e.g. the Inspired Gifts, Digital channels, and Web store channel). A different
field did contain information analogous to the marketing channels associated with pledges,
though this data was so sparse that it could not support a generalisable analysis. Because of
these setbacks, we were unable to continue with the nothing-to-cash subset.

26
4.2 Results

4.2.1 Cash-to-pledge

Examining the distribution of the features within the clusters fills in the picture of the char-
acteristics. Figure 4.5 presents the distribution of the features within the subsets as a whole,
which we use as point of comparison for the distribution within the clusters. The violin plot
in Figure 4.6 is split by the recorded gender of the giver with the male givers on the left in
blue and the female on the right in red. The plot displays the distribution within the clusters,
compared to the complete subset, for each feature clustered on.

Figure 4.5: The market channel distribution for the cash-to-pledge and nothing-to-pledge subsets

27
Figure 4.6: Violin plot: The mean value is marked with the white dot. The thicker black ’box’ marks
the middle 50% while the black centre line shows the maximum and minimum value. The distribution
of the values is given via a kernel density estimate function, split by the recorded gender of the giver.

The age distribution is relatively uniform both before and after clustering, with the clusters
splitting into two pairs above and below age 50, with a small amount of overlap between the
Young Many givers and the Old Few Givers. The Number of gifts given, however, is much
less uniform: a uniform distribution is initially present, but only up to around 70 gifts (giver
or take 10), where many givers are centered. This trait is reflected in the clustering. The
Average_gift dimension is the least descriptive here, as the majority of pledges group around
the mean of 250 NOK—though YF givers show substantially less variance than other clusters.

28
On the other hand, OM givers show a significant amount gifts above the mean alongside high
variance, though this is largely driven by a few high givers.

Now comes the real test, adding the marketing channel, which was done by including all chan-
nels associated with the pledges of each cluster member. Figure 4.7 presents the distributions
within each cluster (columns on the same row sum to 1), alongside the expected age and
number of gifts for each cluster. To compare, Figure 4.5 presents the channels’ distribution
on the complete subset. Table 4.1 presents data from Norsk Mediebarometer 2019 [10] which
was used to support our analysis. The data, captured by SSB, presents average use of internet
and TV for various age groups in Norway.
Age Avg Daily TV-use Avg Daily Internet use
25-44 51 minutes 228 minutes
45-64 102 minutes 141 minutes
65-79 160 minutes 54 minutes

Table 4.1: Average daily use of Internet and TV in minutes for various age groups. Captured by SSB
for Norsk Mediebarometer 2019 [10]

29
Figure 4.7: The marketing channels’ distribution in the clusters returned within the cash-to-pledge
category

We make the following key observations:

30
• Face-to-face (F2F) is far more prevalent with Few-pledge givers. We believe this may be
explained by the lack of initiative on the giver’s part in the F2F channel. Since givers
do not encounter F2F representatives on every street corner, many-givers will have a
hard time concentrating their gifts through this channel. Luckily, this does not stop
them from donating elsewhere. Few-givers, however, will not take initiative to donate
as frequently as Many-givers, thus F2F is more dominant.
• SMS is used by all groups, but YM givers and OF givers clearly uses this channel
more frequently than the other groups. Still, the difference are less striking than other
channels. This may indicate that the age and number of gifts of a giver is not enough
to reliably decide whether SMS is appropriate or not.
• Post is only used substantially by OM givers. We believe this may be because these
givers, on average, started giving many years ago when mail was more ubiquitous. This,
alongside their age could make post the best channel to approach OM givers. Another
explanation is that highly dedicate givers are more likely to open mail from UNICEF,
due to their commitment. This is supported by the fact that YM givers score higher on
the Post-channel than OF givers.
• Web is used far more by young givers than older ones, likely due to increased comfort
with this web-technology. SSB’s data in Table 4.1 substantiates this hypothesis.
• Telemarketing and TV-events are are more successful at targeting Many-givers. We
believe telemarketing is more effective for this group because they are less likely to end
the call once they hear it is from UNICEF than Few-givers. We are not quite sure why
OM givers are so likely to use the TM channel, but a possible explanation is deliberate
targeting from the telemarketers working for UNICEF.
• OF givers show a strong preference for F2F, SMS, and DRTV (Direct response TV
ads) channels. Their F2F and SMS channels are consistent with previous observations,
e.g. that Few-givers often respond to F2F. Their affinity for DRTV is very distinct,
however. Our hypothesis is that these givers spend more time than most watching
TV, which makes DRTV an effective channel to reach this group. This hypothesis is
supported by the data from Norsk Mediebarometer 2019, seen in Table 4.1.

4.2.2 Nothing-to-pledge

The nothing-to-pledge data set encompasses those givers whose first contribution to UNICEF
was a pledge. The feature vectors used for clustering include the giver’s age, number of
gifts and average gift amount. After analysing these vectors, a k of four was again the best
choice. Multiple similarities were indeed discovered with the cash-to-pledge category; average
gift did not impact the clustering in any significant way, while and age and number of gifts
did. Thus, our giver labels—OF, OM, YM, YF—return, with the occasional n or c suffix
for clarity. Figure 4.8 displays the distribution for the cluster features, compared to the
distribution within the subset. Figure 4.9 presents the market channel distribution for each
cluster, alongside the cluster’s expected age and number of gifts. Figure 4.5 gives a point of
comparison with the market channel distribution of the complete subset.

31
Figure 4.8: Violin plot: The mean value is marked with the white dot. The thicker black ’box’ marks
the middle 50% while the black centre line shows the maximum and minimum value. The distribution
of the values is given via a kernel density estimate function, split by the recorded gender of the giver.

32
Figure 4.9: The marketing channels’ distribution in the clusters returned in the nothing-to-pledge
category
33
We make the following key observations:

• Post (not shown) was only used for 30 pledges.

• Compared to cash-to-pledge, the TV event channel shows the largest difference. For
Many-givers, this channel is by far the most common.
• Telemarketing is strikingly absent compared to cash-to-pledge givers. This may indicate
deliberate targeting of cash givers by UNICEF.
• Once more, F2F is most significantly represented within the Few-giver category. F2F is
especially common among YF givers. This may indicate that it is an especially effective
way of gaining new Y-givers.
• The SMS channel sees a clear decline with Young-givers compared to its cash-to-pledge
counterpart.
• DRTV shows similar representation as in the cash-to-pledge category. Older-givers still
use this channel the most, however this time around OM givers use it the most, quite
unlike cash-to-pledge.
• Web also shows a similar representation. Younger-givers use this channel twice as much
as the older-givers.

4.3 Briefly on gender

We noticed during early visualization of the dataset that there were clear differences between
male and female givers. Female givers generally give less than men, but there are more
female givers in total, making them the largest contributing sex from age 30 and up. We
were considering ways to account for this, should it prove significant in clustering. Since
kmeans relies on distance, adding a gender feature to the clustering vectors would sabotage
the distance measure, and by extension the results. An alternative worth considering was
splitting the subsets by gender beforehand, and offering separate clusters for male and female
givers. As you know, we did not follow this approach; from the violin plots in Figure 4.6
and 4.8, you can see why. The distribution estimates have been made for each gender, and
while there are some differences for a given cluster and feature, these are within reasonable
limits. This makes sense considering that the observed difference between the sexes (age and
gift value) are both attributes which are clustered on. Thus a male giver will probably have
more in common with a female in the same cluster than a male from another.

34
5 | Interpretation and recommenda-
tions

In this chapter we discuss our recommended measures based on our analysis. We then outline
an implementation plan and discuss the limits of our analysis and how these may be addressed
in the future.

5.1 Recommended measures & implementation plan

Our implementation plan covers specific measures grounded in our analysis, with accompa-
nying stakeholder actions and schedule. Regrettably, the plan does not include a column
delegating the responsibility for each measure to specific UNICEF employees as we only have
contact with one of their workers. Consequently, UNICEF should make these delegations
before following the plan to ensure each measure is monitored and controlled by a capable
individual. Additionally, we do not have an explicit time schedule, as what is a realistic
schedule largely depends on UNICEF’s available resources, which we are only slightly familiar
with. The measures are, however, ranked by priority to assist UNICEF in implementing them.
Table 5.1 presents our measures1 .

ID Measure Stakeholder actions Insights

A YM givers can be used Identify YM givers will- These givers display similar willingness to donate in
to test the effectiveness ing to give feedback on almost all channels. Thus, when trying new cam-
of new campaigns. campaigns. paigns, channel bias will have the least influence
YM givers. Poor performance with this group is
thus an indicator of a weak campaign
B YM and OM givers can Recruit YM and OM These givers rarely give based on F2F campaigns.
be valuable representa- givers for F2F cam- Simultaneously, their enthusiasm for UNICEF is
tives in F2F campaigns paigns clear. Thus, one way to increase their contributions
to F2F campaigns is to let them front them.

1
Label reminder: Y and O specify giver age, young and older; F and M relate to the donor’s number of
gifts, few or many. Occasionally, the suffix c or n is used to describe the pledge givers’ origin, did they initially
give a cash gift before pledging (c) or had they not given anything prior to their first pledge (n)?

35
ID Measure Stakeholder actions Insights
C YFc givers who do not Identify YFc givers, ig- YF givers almost never respond to post campaigns.
respond to post cam- nore those who do use Such outreach can therefor be considered mostly a
paigns should stop re- post waste of resources.
ceiving them
D Post campaigns should Identify OFc-givers. OMc givers show a significant affinity for post cam-
be targeted at older Tailor mail-campaigns paigns. This may indicate unused potential among
givers to Few-givers OFc givers who currently do not exhibit the same
tendencies.
E Web campaigns should Understand which Older givers, likely due to discomfort with technol-
be tailored to younger campaigns are most ogy, do not respond well to Web campaigns. For
givers effective for Y givers younger givers, however, it is among the strongest
channels.
F Focus telemarketing on Get phone numbers of Many givers respond far more to telemarketing than
Many-givers (cash-to- Many-givers and make few-givers. Probably because they are less likely
pledge) specific telemarketing to disconnect when they realize it is a call from
strategies for this group UNICEF.
G DRTV marketing Understand how to DRTV is one of the few channels which OFc givers
should be used to best reach older givers respond strongly to.
recruit older givers with DRTV
H SMS should be used Identify most effective The SMS channel was the most popular for OFc and
with targeted cam- campaigns for each YMc givers and top 3 for YFc and OMc givers. This
paigns for all cash-to- giver category popularity makes it ideal to test the effectiveness of
pledge giver types targeting strategies across all giver types.
I Implement measures to Identify what keeps Older givers clearly shy away from the web channel
make Older givers more Older givers away from compared to younger ones
comfortable with the this channel
Web channel
J Use post to reach Get necessary contact Post was extremely effective with OMc givers. Its
newly acquired Older, information absence with nothing-to-pledge givers indicates un-
nothing-to-pledge used potential
givers
K Send repeated notices Understand how many These givers are very generous in TV events. Mak-
to YMn and OMn givers are usually aware ing sure they know about is thus very important
givers about upcoming of upcoming events
TV-events
L Telemarket towards Find necessary contact This was a very effective channel among cash-to-
nothing-to-pledge information pledge Many-givers. Indicates unused potential.
Many-givers

Table 5.1: The measures from our analysis, with required actions, and underlying analytical insights

36
5.1.1 Implementation schedule

To offer a reasonable implementation schedule, we plotted each measure on a Business val-

ue/Implementation feasibility matrix. Implementation feasibility was estimated based on the
perceived complexity and resource intensity of each measure. Business value was determined
by the perceived room for improvement and the total income of each channel, as seen in Figure
5.1. Figure 5.2 presents the business value and implementation feasibility for each measure.

Figure 5.1: The total income for each channel’s pledges

37
Figure 5.2: The proposed measures plotted with their business value and implementation feasibility

Again, since we are not particularly familiar with UNICEF’s resources, we will refrain from
proposing an explicit time schedule. However, with Figure 5.2, we are able to prioritize
which measures should be implemented first. Our main prioritization strategy is risk-neutral,
leaning slightly to risk-averse, meaning we prefer a measure with high feasibility and moderate
value to a measure with moderate feasibility but high value. However, to cover a risk-friendly
strategy, we also created a secondary, risk-seeking plan, alongside a risk-averse strategy for
completeness. The plans are separated into three phases. Phase 1 includes the measures which
we believe maximize value for the underlying risk-strategy. Phase 2 are those measures which
do not offer as good of a value proposition as Phase 1, while still being clearly worthwhile.
Phase 3 include those measures which are the lowest priority based on their feasibility and
value, and the risk-strategy. For each phase and risk-strategy, measures are ordered by priority
in Table 5.2. Note that the splits between phases are unavoidably arbitrary and based on
our subjective opinion. We do not assume that beginning a phase will require that it be
completed, though we suggest the measures are performed as they are ordered below.
Risk-averse Risk neutral Risk seeking
Phase 1 J, D, A, F B, D, G, J, A G, E, B, D
Phase 2 K, L, B, C, G E, F, K, L J, A, I, H, F
Phase 3 E, H, I H, C, I L, K, C

Table 5.2: Measures, ordered by priority, using a risk-averse, risk-neutral, and risk-seeking strategy

38
5.2 Answering the hypothesis

With our extensive data on the preferred marketing channels of our 8 cluster categories, we
have a solid foundation to answer the hypothesis provided by UNICEF: Which marketing
activities should be directed at which donor segments?. To answer this question, we have
made two types of observations, first we look at which marketing channels are most effective
for the four Few-giver categories. Then we compare the marketing channels between the
four Few-giver categories and their corresponding Many-giver category to understand the
marketing channels which become most relevant for a Few-giver, when they begin turning
into a Many-giver. Our observations are:

• OFc givers are best recruited through F2F, SMS, and DRTV.
• OFc givers, once they transition to OMc status, are best reached through SMS, Post,
TM, and TV-events.
• YFc givers are best reached through F2F, Web, and SMS.
• When transitioning to YMc status, YFc givers are best reached through SMS, Web, and
TV-events. Though DRTV, and TM are strong channels as well.
• OFn givers center around the F2F, SMS, DRTV, and TV-Event channels.
• Transitioning OFn givers strongly respond to the TV-event channel, followed by DRTV.
• YFn givers strongly prefer F2F, followed by Web and TV-Event.
• Transitioning YFn givers focus mostly on TV-events and Web.

5.3 Limitations and future analysis

5.3.1 Data

Our analysis is more limited by the data than what we would like. As we have no data before
2017 about specific pledges or donations—we only have aggregated data—we can only retrieve
marketing channel from individual payments made in 2017 or later. Because of this, only the
channels a person has used after 2017 have been included in the analysis. One consequence of
this is that looking at the marketing channel which originally recruited a new pledge donor is
difficult. As a result of this, our nothing-to-pledge analysis is unable to describe what channels
can be used to recruit new donators, instead it serves more as as an indicator about how to
keep people pledging. An area for future analysis could be to include specific data about
pledges and donations from before 2017. This would provide far more accurate information
about the marketing activities which effectively recruit new members.

Another area of future analysis are nothing to cash givers, people that went from not donating
to donating with cash. This category was one we believed to be especially important, but
the data was too incomplete for us to pursue it. To analyze this category, we would suggest

39
looking at the channels for some of the first cash donations made by each nothing to cash
giver. Currently, the data detailing marketing channels for cash donations is very incomplete,
so expanding it is necessary. This could help identify what channels are most effective at
recruiting new givers of different categories (e.g. old vs young), and if some channels are more
likely to produce Many-givers compared to others. Finally, more detailed demographic in-
formation, like employment status, education, and marital status would significantly increase
the number of possible subsets to cluster on, which could offer a more insightful analysis.

A potentially large downside of our current data is that it does not include negative results,
i.e. people who were approached through a certain channel, but declined to start giving.
This data can be hard to come by, yet highly valuable as it makes it far easier to make
determinations on what channels are effective at recruiting new donors. Take our nothing-
to-pledge subset, for example, with this additional data a Young Nothing and Old Nothing
cluster could appear and highlight new information about the marketing channels. Or perhaps
a nothing-to-nothing vector set could be made and clustered to see if some channels have a
higher failure rate with certain non-givers compared to others.

5.3.2 Method

The K-means clustering method is by no means an infallible clustering algorithm. The

method’s biggest strengths are usually thought to be its simplicity and accessibility. This
simplicity can come at significant cost, however. The method prefers clusters which are ap-
proximately the same size, and it can only create convex cluster geometries. The method
is unable to consider correlation between input features and overlapping cluster borders.
With that said, these limitations are theoretical, in practice, for our project and dataset,
the method was not significantly held back by these weaknesses; of the methods we tested,
K-means was the only one which produced meaningful results. Still, there is one shortcoming
which, once addressed, could produce highly valuable insights on UNICEF’s data. Soft clus-
tering (aka fuzzy clustering) does not assign data points to just one cluster, unlike K-means’
hard-clustering. Instead, each data point is given a degree of membership in all clusters.
This membership can be thought of as the probability that the data point belongs to the
various clusters. Methods which offer this functionality, e.g. Fuzzy C-means clustering, are
very common within market cluster analysis, and it is easy to see why. In our case, soft
clustering could show how close a specific Few-giver is to becoming a Many giver, based on
how large their membership is in the two Many clusters, compared to the two Few-clusters.
At the macro perspective, the analysis also sees benefits: when calculating the distribution of
market channels for each cluster, soft clustering allows for a more nuanced tallying. Whereas
K-means’ hard clustering strictly limits the channels of a giver to their assigned cluster, soft
clustering methods can intelligently spread the giver’s channels between the clusters based on
their membership in each cluster. Consequently, the channels of a transitioning young, few
giver will be weighted less than the channels of an established young many giver by the YM
cluster.

40
5.3.3 Future analysis

To complement our analysis, we believe exploring the timestamps for a giver’s pledges can
offer valuable, complementary information on the giver and their cluster. Since our clustering
deliberately did not make use of time-date, this dimension of the givers is currently not present
in our analysis. When accounting for this added dimension, we see the following opportunities:

• Explore gift frequency over time for the general dataset to identify periods of high and
low activity.
• Explore gift frequency over time for specific clusters to identify periods of high and low
activity compared to the general dataset.
• Create prediction models which predict gift frequency of individual givers and clusters.
Measures can be implemented based on this, e.g. to counteract a predicted decrease in
donations for a giver or cluster.
• Explore gift frequency over time for specific channels to identify periods of high and low
activity and find measures to increase channel effectiveness for each cluster.

One final area to complement our analysis is to examine the remaining hypotheses given by
UNICEF at the start of the project, which were all prioritized on a business analytics leverage
matrix in Figure 2.1. Continuing to explore these hypotheses can give valuable complementary
information. We especially think that the hypothesis "How can we make one-time donors
regular donors" is relevant. Not only was it one of the highest evaluated hypotheses, we also
believe it complements our current hypothesis (Which marketing activities should be directed
at various donor segments?), given that one-time donors are a very important donor segment
for UNICEF and that specific marketing activities may be central in turning them into regular
givers.

5.4 Conclusion

This project has taught us a huge amount on data analytics in practice. Compared to when we
started the project, we now know much more about the importance of a good data strategy,
data pre-processing and quality, and data-driven decision making. Seeing how design thinking
can enhance business analytics was also a positive surprise. We want to thank UNICEF and
NTNU for giving us this opportunity and hope that our results and measures will prove useful
to UNICEF’s noble cause.

41
References

[1] UNICEF, “Unicef annual report, 2019.” https://www.unicef.org/media/74016/file/

UNICEF-annual-report-2019.pdf, 2020. Accessed: 2020-09-18.

[2] R. V. Giles A. Hindle, “Developing a business analytics methodology: A case study in

the foodbank sector.” European Journal of Operational Research, 2020.

[3] P. et al., “Combining process guidance and industrial feedback for successfully deploying
big data projects.,” 2017.

[4] R. F. D. . T. Y. Siang, “What is design thinking and why is it so popular?.”

https://www.interaction-design.org/literature/article/what-is-design-
thinking-and-why-is-it-so-popular, 2020.

[5] D. Dennehy, F. Adam, and F. Carton, Leveraging Design Thinking to Innovate. 04 2016.

[6] UNICEF, “Unicef brand barometer study norway 2018,” 2018.

[7] S. Shen, “7 steps to ensure and sustain data quality.” https://towardsdatascience.

com/7-steps-to-ensure-and-sustain-data-quality-3c0040591366, 2019.

[8] scikit learn, “Clustering.” https://scikit-learn.org/stable/modules/clustering.

html, 2020.

[9] scikit learn, “Scaling features to a range.” https://scikit-learn.org/stable/

modules/preprocessing.html#scaling-features-to-a-range, 2020.

[10] SSB, “Norsk mediebarometer 2019: Bruk av ulike medier, etter medietype, kjønn,
alder, statistikkvariabel og år,” 2020. retrieved 13/11/2020 from https://www.ssb.
no/statbank/table/12947.

(FT Press Analytics) Merrill Warkentin (Ed.) - The Best Thinking in Business Analytics From The Decision Sciences Institute (2015, Pearson FT Press) - Libgen - Li
No ratings yet
(FT Press Analytics) Merrill Warkentin (Ed.) - The Best Thinking in Business Analytics From The Decision Sciences Institute (2015, Pearson FT Press) - Libgen - Li
287 pages
Elements of Data Strategy
No ratings yet
Elements of Data Strategy
317 pages
Guide To Intelligent Data Science: Michael R. Berthold Christian Borgelt Frank Höppner Frank Klawonn Rosaria Silipo
100% (1)
Guide To Intelligent Data Science: Michael R. Berthold Christian Borgelt Frank Höppner Frank Klawonn Rosaria Silipo
427 pages
Data Visualization Essentials
No ratings yet
Data Visualization Essentials
90 pages
Social Appropriation in Business Sciences - A Contribution To Health, Social Problems and Business Management
No ratings yet
Social Appropriation in Business Sciences - A Contribution To Health, Social Problems and Business Management
116 pages
RBM Handbook
100% (1)
RBM Handbook
146 pages
Introduction To Data Science
100% (1)
Introduction To Data Science
200 pages
Product Analytics: Applied Data Science Techniques For Actionable Consumer Insights (Pearson Business Analytics Series) 1st Edition Rodrigues Digital
No ratings yet
Product Analytics: Applied Data Science Techniques For Actionable Consumer Insights (Pearson Business Analytics Series) 1st Edition Rodrigues Digital
74 pages
ESEUR Draft
No ratings yet
ESEUR Draft
447 pages
Modern Business Analytics: Practical Data Science For Decision-Making - Ebook PDF
100% (1)
Modern Business Analytics: Practical Data Science For Decision-Making - Ebook PDF
40 pages
Fundamentals Big DAta Read
100% (1)
Fundamentals Big DAta Read
61 pages
Eda Report
No ratings yet
Eda Report
12 pages
Data Analysis New1
No ratings yet
Data Analysis New1
36 pages
Product Analytics: Applied Data Science Techniques For Actionable Consumer Insights (Pearson Business Analytics Series) 1st Edition Rodrigues Instant Download
No ratings yet
Product Analytics: Applied Data Science Techniques For Actionable Consumer Insights (Pearson Business Analytics Series) 1st Edition Rodrigues Instant Download
139 pages
MSC Thesis Rosa Nuijten Impact Investing 2
No ratings yet
MSC Thesis Rosa Nuijten Impact Investing 2
56 pages
Template ISS499 2 - 2
No ratings yet
Template ISS499 2 - 2
67 pages
SDG Academic Paper
No ratings yet
SDG Academic Paper
86 pages
Data Mining Fraud
No ratings yet
Data Mining Fraud
32 pages
Big Data and Social Science-Dikompresi
No ratings yet
Big Data and Social Science-Dikompresi
81 pages
Guide To Intelligent Data Analysis
No ratings yet
Guide To Intelligent Data Analysis
398 pages
Advantages of Bootstrap Forest For Yield Analysis
No ratings yet
Advantages of Bootstrap Forest For Yield Analysis
32 pages
Model Thinking
100% (1)
Model Thinking
103 pages
Statistical Modeling For Management 1. Publ Edition Hutcheson Download Full Chapters
100% (1)
Statistical Modeling For Management 1. Publ Edition Hutcheson Download Full Chapters
153 pages
Tajudin Mohammed
No ratings yet
Tajudin Mohammed
78 pages
Statistical Modeling For Management 1. Publ Edition Hutcheson Newest Edition 2025
No ratings yet
Statistical Modeling For Management 1. Publ Edition Hutcheson Newest Edition 2025
90 pages
Sallis-Research Methods and Data Analysis
No ratings yet
Sallis-Research Methods and Data Analysis
263 pages
Thesis - Predicting Tax Codes Using Machine Learning
No ratings yet
Thesis - Predicting Tax Codes Using Machine Learning
29 pages
Predictive Modeling
No ratings yet
Predictive Modeling
27 pages
Data Science Project Lifecycle
No ratings yet
Data Science Project Lifecycle
55 pages
Measuring The Success of Learning Through Technology - Ebook
No ratings yet
Measuring The Success of Learning Through Technology - Ebook
235 pages
Bon Data Analytics Apps
No ratings yet
Bon Data Analytics Apps
173 pages
IBA Koole First Chapters
No ratings yet
IBA Koole First Chapters
78 pages
Rapport Bi
No ratings yet
Rapport Bi
94 pages
BI Module 2
No ratings yet
BI Module 2
11 pages
Population - Data - ICPD30 - ThinkPiece - 050724 - FINAL - WEB
No ratings yet
Population - Data - ICPD30 - ThinkPiece - 050724 - FINAL - WEB
40 pages
Phase 5 Aut251106 (6) - 1 1
No ratings yet
Phase 5 Aut251106 (6) - 1 1
25 pages
2.thesis Dashboard
No ratings yet
2.thesis Dashboard
90 pages
ADANCO 2.0.1 User Manual
No ratings yet
ADANCO 2.0.1 User Manual
61 pages
Lecture 9 - Decision Making With Data Science
No ratings yet
Lecture 9 - Decision Making With Data Science
19 pages
Wnew Project
No ratings yet
Wnew Project
61 pages
Full Text 01
No ratings yet
Full Text 01
56 pages
Chapter 12 2 - Unlocked 1
No ratings yet
Chapter 12 2 - Unlocked 1
1 page
TG 2-Pedro
No ratings yet
TG 2-Pedro
41 pages
DM Fraud
No ratings yet
DM Fraud
32 pages
Thesis - MC Change Point Detection
No ratings yet
Thesis - MC Change Point Detection
65 pages
Data Science Team 7 Report 1
No ratings yet
Data Science Team 7 Report 1
29 pages
Kenobi Password Manager
No ratings yet
Kenobi Password Manager
41 pages
Baytar C. The Future of Data Mining 2022
No ratings yet
Baytar C. The Future of Data Mining 2022
156 pages
Sas Semma
100% (1)
Sas Semma
39 pages
Thesis 2
No ratings yet
Thesis 2
74 pages
DataScience Project Report
No ratings yet
DataScience Project Report
21 pages
Analysis and Presentation For Bank Marketing Data: Vinay Kumar MS by Research Scholar IIT Kharagpur +91-8348575432
No ratings yet
Analysis and Presentation For Bank Marketing Data: Vinay Kumar MS by Research Scholar IIT Kharagpur +91-8348575432
20 pages
Business Undestanding and Data Collection
No ratings yet
Business Undestanding and Data Collection
27 pages
Rapport PFE Balghouthi Hazemespdsi20201678154298523
No ratings yet
Rapport PFE Balghouthi Hazemespdsi20201678154298523
66 pages
Lecture 2 - The Dataset Presentation
No ratings yet
Lecture 2 - The Dataset Presentation
35 pages
6061 Smartergovermentpaper
No ratings yet
6061 Smartergovermentpaper
184 pages
Chicago Crime Reduction via Data Science
No ratings yet
Chicago Crime Reduction via Data Science
29 pages
Data-Driven Mobile App Development
No ratings yet
Data-Driven Mobile App Development
52 pages
Book DescriptionPublication Date
No ratings yet
Book DescriptionPublication Date
8 pages
Lecture 4 - No-Code and Low-Code Tools
No ratings yet
Lecture 4 - No-Code and Low-Code Tools
29 pages
FTC Report: Big Data: A Tool For Inclusion or Exclusion? Understanding The Issues
No ratings yet
FTC Report: Big Data: A Tool For Inclusion or Exclusion? Understanding The Issues
50 pages