Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
10 views104 pages

Intro To Prog Eval

Uploaded by

me21b002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views104 pages

Intro To Prog Eval

Uploaded by

me21b002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Tab 1

Impact Evaluation
Impact evaluation is a process used to determine how effective a program is in producing the
desired changes or outcomes. It answers one key question: What is the impact (or causal
effect) of a program on an outcome of interest? The main goal is to understand the direct results
of a program, including whether it actually caused the changes it aimed for.

Comparing Impact Evaluation to Other Methods:

1.​ Monitoring: This is an ongoing process that tracks a program’s progress, activities, and
performance over time. It helps program managers understand what’s happening in the
program but doesn’t focus on measuring long-term changes caused by the program.
○​ Example: If you are running a school lunch program, monitoring might track how
many meals are served daily and whether the program is staying within budget.
2.​ Evaluation: This is a periodic and detailed assessment of a program. Evaluations are
usually done at specific points in time and can be used to answer different types of
questions:
○​ Descriptive questions: These ask what is happening (e.g., What activities are
being carried out?).
○​ Normative questions: These compare what is happening to what should be
happening (e.g., Are the targets of the program being met?).
○​ Cause-and-effect questions: These are concerned with determining whether the
program caused a specific outcome (e.g., Did the school lunch program improve
students’ health?).

What Makes Impact Evaluation Special?

Impact evaluations focus on answering cause-and-effect questions. They aim to figure out what
specific outcomes or changes were caused directly by the program, rather than by other factors.
For example:

●​ Example: If a program offers scholarships to students, an impact evaluation would


examine if those scholarships actually increased school attendance and academic
achievement.
●​ Another Example: If a government builds new roads, an impact evaluation would try to
measure if the new roads helped increase people’s income by improving access to
markets.

Understanding Causality in Impact Evaluation:

The key feature of impact evaluation is its focus on causality. This means determining whether a
program was directly responsible for a change, or if other factors played a role.
To determine this, impact evaluations need to compare the program group (those who
participated in the program) with a counterfactual group (a group who did not participate, but
who would have been in the program if it were available to them). This comparison helps
determine what would have happened to the participants if they hadn’t been part of the
program.

For example:

●​ Example: Let’s say there is a program where children are given vitamin supplements to
improve their health. To measure the impact, we’d compare the health outcomes of
children who received the supplements (program participants) with those who didn’t
(non-participants, the counterfactual group). By comparing these two groups, we can
estimate how much of the health improvement is due to the vitamin supplements.

Choosing the Right Evaluation Method:

The way you design the evaluation depends on how the program is set up. Important factors
include:

●​ Program Resources: Do you have enough resources to serve everyone eligible for the
program?
●​ Program Type: Is the program targeted to a specific group or available to everyone?
●​ Timing: Is the program implemented all at once or gradually over time?

Based on these factors, you can decide which evaluation method is the best fit. For instance:

●​ Randomized Assignment: Participants are randomly assigned to either the treatment


(program) or control group (no program).
●​ Matching: You match program participants with similar non-participants to compare their
outcomes.

In Summary:

Impact evaluation is all about determining whether a program has had a meaningful effect on its
intended outcomes. By using careful comparisons, we can figure out if the changes were truly
due to the program itself or other factors. The evaluation method used will depend on how the
program operates and the resources available.

Let me know if you want me to explain any of these ideas further or provide more examples!
Prospective vs. Retrospective Impact Evaluation:
Impact evaluations can be divided into two main types: prospective and retrospective. The
main difference between them lies in when they are planned and how they are carried out.

1. Prospective Impact Evaluation (Evaluating Before the Program Starts)

A prospective impact evaluation is designed before a program is implemented. This means


that the evaluation is planned alongside the program’s design, and baseline data is collected
before the program begins. The key steps for prospective evaluations are:

●​ Baseline Data: Data on the groups (treatment and comparison) is collected before the
program starts. This helps in measuring the outcomes after the program and
understanding what changes occurred.
●​ Program Design and Evaluation Design are Aligned: Since the evaluation is planned
early, it aligns with the program’s goals, ensuring that the right questions are asked and
the program’s success can be measured effectively.
●​ Clear Goals and Success Measures: The program’s success is defined during the
planning stage. This focuses on what the program intends to achieve, and helps in
measuring if those goals were met.
●​ Better Causal Inference: Prospective evaluations have the best chance of generating
valid estimates of the program's impact because the treatment and comparison groups
are identified before the program starts, and there are more ways to establish
comparisons.

Example: Imagine a new scholarship program is being planned. The prospective evaluation
would start by collecting baseline data on students’ academic performance and school
attendance before the scholarships are offered. This way, when the scholarships are provided,
the evaluation can compare the academic changes in students who received the scholarships
(treatment group) and those who didn’t (comparison group) to measure the true impact of the
program.

2. Retrospective Impact Evaluation (Evaluating After the Program is


Already Running)

A retrospective impact evaluation is done after a program has already been implemented. In
this type of evaluation, the program is assessed based on what happened after it has been put
in place. The key challenges and characteristics of retrospective evaluations are:

●​ Lack of Baseline Data: Since the evaluation is happening after the program is
implemented, there’s no baseline data to compare what was happening before the
program started. This makes it harder to measure how the program has truly impacted
the outcomes.
●​ Limited Options for Creating a Comparison Group: In retrospective evaluations, the
treatment group (those who received the program) has already been chosen. It is difficult
to find a comparison group that is similar to the treatment group without bias, which
makes it harder to draw clear conclusions about causality.
●​ Reliance on Existing Data: Since the program is already running, retrospective
evaluations rely on the data that’s available. This can be problematic if the data isn’t
complete or well-organized.
●​ Quasi-Experimental Methods: Because prospective methods are not possible,
retrospective evaluations often use methods that rely on assumptions, which can
weaken the evidence or make it debatable.

Example: Let’s say a health program has been running for a few years without any planned
evaluation. A retrospective evaluation would now look at the health outcomes of people who
participated in the program and compare them with those who didn’t. However, because there
was no baseline data, it would be harder to prove that any improvements in health were caused
by the program, and not by other factors.

Why is Prospective Evaluation Better?

1.​ Baseline Data: With prospective evaluations, you collect baseline data before the
program starts, which helps establish clear benchmarks for measuring change.
2.​ Clear Program Goals: Designing the evaluation alongside the program helps make sure
the program’s goals and success measures are clearly defined.
3.​ Better Comparison Groups: By identifying the treatment and comparison groups ahead
of time, prospective evaluations provide more reliable and valid comparisons, leading to
stronger conclusions about causality.

Challenges with Retrospective Evaluations:

●​ Lack of Baseline Data: It's harder to know what the situation was like before the
program started, making it tough to assess real impact.
●​ Data Limitations: Since the data is already available, there might be gaps or
inconsistencies that can weaken the evaluation's findings.
●​ Quasi-Experimental Methods: Often, retrospective evaluations have to rely on
methods that aren’t as precise as those used in prospective evaluations, making the
results less certain.

In Summary:

●​ Prospective evaluations are planned before the program starts, collect baseline data,
and align the evaluation with program goals, making them more reliable and capable of
providing valid conclusions.
●​ Retrospective evaluations assess programs after they are implemented, often with
limited baseline data, and tend to rely on assumptions or quasi-experimental methods,
making the results more uncertain.
Efficacy Studies vs. Effectiveness Studies:
In impact evaluation, we often see two key types of studies: efficacy studies and
effectiveness studies. Both aim to evaluate the impact of a program, but they do so in different
ways and under different conditions.

1. Efficacy Studies (Testing Under Ideal Conditions)

Efficacy studies focus on testing whether a program can work under controlled, ideal
conditions. These studies are often done in pilot programs or small-scale trials, where
researchers have full control over the implementation and closely monitor the process. The
main goal is to test if the program works theoretically and whether it can achieve its desired
outcomes when everything goes as planned.

Key Characteristics:

●​ Controlled Environment: The program is implemented in a carefully controlled


environment, often with high involvement from researchers to ensure everything goes as
planned.
●​ Pilot or Small-Scale: Efficacy studies are often conducted as small-scale pilots, which
means they may not represent larger, real-world conditions.
●​ Proof of Concept: The study tests if the program is viable and whether it can work
under ideal conditions.
●​ Limited Generalizability: The results of efficacy studies are often specific to the pilot or
controlled environment and may not apply to larger or more diverse groups.

Example:

Imagine a new medical treatment for a disease is tested in a specialized hospital with expert
staff. The treatment shows promising results in this ideal setting. However, if the same treatment
were rolled out to an average hospital with fewer resources and less experienced staff, the
results might not be the same. The efficacy study tells us that the treatment works under ideal
conditions, but we can’t be sure it will work in a broader context.

2. Effectiveness Studies (Testing Under Normal Conditions)

Effectiveness studies, on the other hand, aim to assess whether a program works in real-world
conditions. These studies evaluate the program’s performance when it is implemented in a
more typical or regular setting, without the tight control of a research environment. The goal is to
understand how well the program works in practice and whether its effects can be generalized
to a larger population.

Key Characteristics:
●​ Real-World Conditions: The program is implemented as it would be in the general
population, using regular channels and processes.
●​ Generalizability: Effectiveness studies aim to produce results that are generalizable to a
larger group or population, making them more relevant for policy makers and
decision-makers.
●​ Focus on Scale: These studies are concerned with how a program performs when
expanded beyond small pilots or controlled settings.
●​ External Validity: The results from effectiveness studies are expected to apply to a
broader population beyond the specific group studied.

Example:

Consider a government program providing financial support to poor families. An effectiveness


study would look at how the program works when it is implemented on a larger scale, in
real-world conditions, in different regions. The aim is to see if the program can deliver the same
positive outcomes on a larger scale, such as improving the financial situation and well-being of
a broader population.

Why the Difference Matters:

●​ Efficacy Studies are crucial for testing new ideas or innovative programs in
controlled settings. They provide evidence that a program has potential, but they often
don’t tell us if the program will work when implemented widely.
●​ Effectiveness Studies are essential for understanding whether a program works in
the real world, in normal settings. The results from these studies are more applicable
for policy makers who want to know if a program can be expanded or replicated on a
larger scale.

Testing for Generalizability:

●​ Sometimes, researchers conduct multisite evaluations to test the generalizability of a


program. For example, the “graduation” approach to alleviating extreme poverty was
tested in multiple countries (Ethiopia, Ghana, Honduras, India, Pakistan, and Peru) to
see if the results from Bangladesh could be applied to different settings. This kind of
study helps to understand if a program's success in one country can be replicated in
others.

Example of a Multisite Evaluation:

A study conducted in 2007 by Banerjee and colleagues tested the Graduation Approach
(which provides cash, assets, training, and support to the poorest families) in six countries. The
results showed significant improvements in various outcomes, such as income, food security,
and mental health, though the impacts varied by country. This kind of evaluation helps
determine if a program can be effective across different contexts, making the results more
generalizable.
In Summary:

●​ Efficacy Studies: Test whether a program can work under ideal, controlled
conditions. They are good for testing new programs or ideas, but their results may not
apply to larger or real-world settings.
●​ Effectiveness Studies: Assess whether a program works in real-world conditions,
aiming to produce results that can be generalized to a larger population and used to
inform policy decisions.

Complementary Approaches to Impact Evaluation


Impact evaluations focus on understanding cause-and-effect relationships, but they can be
enhanced by using other approaches, such as monitoring, simulations, mixed methods, and
process evaluations. These help improve the evaluation by filling in gaps and providing more
context.

●​ Monitoring: Tracks program activities to ensure it’s being implemented as planned. It


checks if the right people are receiving the program and helps make sure the evaluation
is accurate.​

Example: If a job training program is being implemented, monitoring helps track how
many people are attending the training and if it’s reaching the intended audience.​

●​ Ex Ante Simulations: These are predictions made before a program starts, based on
available data. They simulate the expected effects of a program, helping to assess its
potential impact and make better design choices.​

Example: Before launching a new health campaign, simulations might predict how
effective different strategies (like advertising or free check-ups) will be in improving
health outcomes.​

●​ Process Evaluations: Help understand how and why a program works (or doesn’t work)
by focusing on its implementation and context. This can help policymakers understand
why certain outcomes were achieved.​

Example: A program providing nutrition education might use process evaluation to
explore whether the materials were well received or if there were barriers preventing
people from attending.​
Why These Approaches Matter:

●​ Impact evaluations alone can miss key details about how a program is working or why
certain results happened. Using additional methods helps fill in these gaps and improves
the overall evaluation.

Here’s a simplified and shortened version of the explanation on Mixed Methods:

Mixed Methods in Impact Evaluation


Mixed methods combine both quantitative (numerical) and qualitative (descriptive) data to
provide a fuller understanding of a program’s impact. These approaches help generate ideas,
focus research questions, and offer insights during and after the program.

●​ Qualitative Methods: These involve gathering in-depth, non-numerical data through


interviews, focus groups, and observations. Although not statistically representative,
qualitative data helps explain why certain results occurred.​

Example: If a job training program shows increased employment rates, interviews with
participants can reveal what specifically helped them succeed, such as personal
mentorship or practical workshops.​

Types of Mixed Methods Approaches (Creswell, 2014):

1.​ Convergent Parallel: Both types of data (quantitative and qualitative) are collected at
the same time to cross-check and provide early insights into the program’s effectiveness.​

Example: A health program collects survey data (quantitative) and interviews with
participants (qualitative) to understand the overall impact.​

2.​ Explanatory Sequential: The qualitative data explains the results found in the
quantitative data. It helps understand why some outcomes are better or worse than
expected.​

Example: After finding that a school program improved student performance, interviews
with teachers and students explain which aspects of the program were most beneficial.​

3.​ Exploratory Sequential: Qualitative methods are used first to generate ideas and
develop hypotheses. Then, quantitative data is collected to test those hypotheses.​

Example: Focus groups with community members help identify needs, followed by
surveys to measure the extent of those needs across a larger population.​

Here's a simplified and shortened version of the explanation on Process Evaluations:

Process Evaluations

Process evaluations focus on how a program is implemented and whether it follows its
original design. They help assess the program's operations, identify areas for
improvement, and provide valuable insights during the early stages of a program or pilot.
These evaluations are often cost-effective and quick to carry out.

●​ Purpose: They test if the program is operating as planned and if it aligns with its
intended goals. This helps identify operational problems early, allowing for
adjustments before the program is fully implemented.​

Example: In Tanzania, the government piloted a community-based cash transfer
program. A process evaluation helped identify issues, like delays in payments
and beneficiary selection problems, which were then addressed to improve the
program.​

Key Components of Process Evaluation:

●​ Program Objectives: Understanding the program’s goals and context.


●​ Design and Implementation: Describing how the program is set up and
operates.
●​ Operations and Changes: Tracking any changes made during implementation.
●​ Basic Data: Collecting data on operations like financials and coverage.
●​ Challenges: Identifying events that may have impacted the program.

Why Process Evaluation Matters:

Before applying an impact evaluation, it’s important to ensure the program is functioning
as intended. If the operational processes haven’t been validated, resources may be
wasted, or the program may change during the evaluation, affecting the results.
Cost-Benefit and Cost-Effectiveness Analysis

Cost-benefit analysis (CBA) and cost-effectiveness analysis (CEA) help assess a program’s
financial value and its efficiency in achieving specific outcomes.

1.​ Cost-Benefit Analysis (CBA): Compares the total benefits of a program to its total
costs. It tries to measure everything in monetary terms to decide if the benefits outweigh
the costs.​

Example: If a health program costs $1 million and delivers $1.5 million in benefits (like
fewer hospital visits), the cost-benefit ratio helps determine if the program is worth the
investment.​

2.​ Cost-Effectiveness Analysis (CEA): Compares the cost of two or more programs that
aim to achieve the same outcome, helping to identify which program is the most efficient
in achieving the goal.​

Example: Comparing two educational programs to see which one improves student test
scores for the lowest cost.​

Why Combine Cost with Impact Evaluation?

After assessing the impact of a program, adding cost information helps answer two questions:

●​ Cost-Benefit: What benefit does a program deliver for its cost?


●​ Cost-Effectiveness: How do various program alternatives compare in terms of cost for
achieving a common outcome?

Once impact and cost data are available, cost-effectiveness analysis helps policymakers
identify the most efficient investments.

Example: Comparing Education Programs

In a study comparing different education interventions, researchers analyzed the


cost-effectiveness of improving student learning outcomes. They found that programs focusing
on teacher accountability and pedagogical reforms were the most cost-effective, while simply
adding more teachers had limited impact on test scores.
Ethical Considerations in Impact Evaluation

Ethical Issues in Designing Impact Evaluations​


When designing an impact evaluation, it's important to consider several ethical issues:

1.​ Ethics of Investment: Spending public resources on programs without knowing their
effectiveness might be seen as unethical. Impact evaluations help determine a
program's effectiveness, making public investments more ethical.​

2.​ Assigning Program Benefits: Evaluations should not influence how benefits are
assigned. However, evaluations can help ensure the program's rules for eligibility are
fair, transparent, and equitable.​

3.​ Randomized Assignment: Some programs use random selection to decide who gets
benefits. This can raise ethical concerns about denying benefits to some people. But
since programs often can't serve everyone at once, random assignment ensures
fairness by giving equally eligible participants a fair chance to receive the program.​

4.​ Research Ethics: Evaluations involve studying human subjects, so ethical guidelines for
research on people must be followed to protect their rights and well-being. Review
boards or ethics committees usually monitor this.​

5.​ Transparency in Research: Impact evaluations should be clear, objective, and


reproducible. To ensure transparency:​

○​ Research plans should be made public in advance.


○​ Data and methods should be shared after the study, allowing others to verify or
replicate the work while protecting participants' privacy.

This summarizes the key ethical concerns in conducting impact evaluations, focusing on
fairness, transparency, and protection of human subjects. Let me know if you need more details!
Impact Evaluation for Policy Decisions

Role in Policy Making​


Impact evaluations help policy makers make informed decisions about programs. They are
especially useful for:

●​ Curtailing inefficient programs


●​ Scaling up successful interventions
●​ Adjusting program benefits
●​ Choosing between program alternatives

Evaluations are particularly helpful for testing new and unproven approaches, as seen with the
Mexican conditional cash transfer program. The evaluation results were key in scaling the
program nationally and internationally.

Types of Policy Questions

1.​ Effectiveness of a Program​


The basic impact evaluation tests whether a program works by comparing groups that
received the program (treatment group) with those that didn’t (comparison group). The
main challenge is ensuring these groups are similar, which is vital for a valid evaluation.​

2.​ Testing Design Innovations​


Evaluations can also test new ideas or improvements within an existing program without
a comparison group. For example, testing if a change in program design can boost
effectiveness or reduce costs.​

3.​ Testing Program Alternatives​


Evaluations can compare different ways of delivering a program to find the most
effective or cost-effective approach. For example, testing which outreach method
(mailing, house-to-house visits, or text messages) works best.​

4.​ Subgroup Comparisons​


Sometimes evaluations examine whether a program works better for certain groups, like
comparing its impact on male vs. female students. This requires large enough sample
sizes for each subgroup.​

Policy Applications of Evaluations​


Impact evaluations can influence decisions on:

●​ Continuing, reforming, or ending programs


●​ Scaling up successful pilot programs
They can also bring insights from one country to another and explore deeper questions, such as
how behavior influences outcomes.

Generalizing Results​
One key challenge is generalizability: can results from one evaluation be applied to other
settings? By comparing multiple evaluations across different contexts, we can identify patterns
and build more reliable conclusions. This approach, called the cluster approach, groups
evaluations based on common research questions, helping policymakers make better decisions.

Examples of Evaluation Clusters

●​ World Bank's Strategic Impact Evaluation Fund (SIEF) and other initiatives use this
cluster approach to fill knowledge gaps. For instance, research on early childhood
development has shown certain programs work, but more research is needed on how to
scale them cost-effectively.

This simplifies the concepts related to how impact evaluations influence policy decisions,
offering insights on their use in decision-making processes and their role in testing various
program alternatives and innovations. Let me know if you need further clarification!

Here's a simplified explanation of Deciding Whether to Carry Out an Impact Evaluation:

Deciding Whether to Carry Out an Impact Evaluation

Not all programs need an impact evaluation. It’s important to use them selectively when the
question requires a detailed examination of causality. Impact evaluations can be expensive,
especially if you need to collect your own data, so it's important to be strategic with your budget.

Key Questions to Ask Before Starting an Impact Evaluation:

1.​ What is at stake?​


Will the results influence major decisions like budget allocation or program expansion? If
the impact is small or affects only a few people, it might not be worth the cost. For
example, a small program in a clinic might not justify an impact evaluation, but a national
teacher pay reform that could affect all primary teachers would have much higher stakes.​

2.​ Is there any evidence showing that the program works?​


If you already have evidence or know the program's potential impact (from similar
programs in similar settings), an impact evaluation may not be necessary unless you're
testing new innovations. However, if no evidence exists, it might be worth doing a pilot
program with an impact evaluation to gather data.​

3.​ What characteristics make the program worth evaluating?​


The program should meet several criteria:​

○​ Innovative: It’s testing a new, promising approach.


○​ Replicable: It can be scaled or applied in other settings.
○​ Strategically Relevant: The evaluation will inform key decisions, like program
expansion or budgeting.
○​ Untested: There’s little known about the program's effectiveness or design.
○​ Influential: The results will influence policy decisions.
4.​ Do we have the necessary resources?​
Conducting a high-quality impact evaluation requires proper technical resources (data,
time, and finances) and institutional support. It also requires commitment from both the
evaluation team and the policymakers involved. The team must work together to design
a robust evaluation that provides meaningful results.​

Conclusion:​
If the questions above suggest that an impact evaluation is worthwhile and the necessary
resources are available, then you’re on the right track. This book is designed to help you and
your evaluation team navigate the process successfully.

This version simplifies the decision-making process about whether to conduct an impact
evaluation, outlining key factors like the stakes, existing evidence, program characteristics, and
resource requirements. Let me know if you'd like further clarification!
Preparation of an Evaluation
Initial Steps in Setting Up an Evaluation
This chapter outlines the initial steps to take when preparing for an evaluation. These steps
include:

1.​ Constructing a Theory of Change:​


The theory of change explains how a program or project is expected to achieve its
intended outcomes. It outlines the pathway from program activities to desired results.​

2.​ Developing a Results Chain:​


A results chain is a tool that helps visualize the theory of change. It breaks down the
program's logic, showing the sequence of inputs, activities, outputs, outcomes, and
impacts. This helps clarify how the program will achieve its goals.​

3.​ Specifying the Evaluation Question(s):​


Clear and precise evaluation questions guide the entire evaluation process. These
questions should be directly linked to the program’s goals and help focus the evaluation
on what matters most.​

4.​ Selecting Indicators to Assess Performance:​


Identifying the right indicators is crucial for assessing the program’s success. These
indicators should align with the program’s objectives and help measure both outputs
(what the program delivers) and outcomes (the actual changes or impacts).​

When Should These Steps Be Taken?


These initial steps are most effective when applied at the beginning of a program or reform,
ideally during the design phase. This is when the program's goals, strategies, and outcomes are
being defined.

By engaging a wide range of stakeholders—such as policymakers, program implementers, and


evaluators—these steps help establish a shared vision of the program's goals and how they will
be achieved. This engagement builds consensus around the evaluation’s focus, the key
questions it aims to answer, and the program’s implementation.
Constructing a Theory of Change
A Theory of Change explains how and why a program is expected to achieve its goals. It
shows the steps or sequence of events that lead to the desired outcomes, highlighting the key
activities, assumptions, and causal links between them. This theory helps clarify how a program
will produce its results, which is especially important for impact evaluations that focus on cause
and effect.

When to Develop a Theory of Change?

The best time to develop a theory of change is at the beginning of a program when all
stakeholders (such as program designers, policymakers, and implementers) can come together
to agree on the program’s objectives and the best way to achieve them. This ensures that
everyone has a shared understanding of how the program will work and what it aims to achieve.

Key Components of a Theory of Change

1.​ Causal Logic:​


It outlines the logical sequence of how activities will lead to the intended outcomes. For
example, how a particular intervention will bring about changes in behavior, health, or
education.​

2.​ Assumptions and Conditions:​


It explores the conditions and assumptions needed for the program to succeed. For
example, does the program rely on certain conditions, like community cooperation or
availability of resources?​

3.​ Inputs and Outputs:​


The theory of change clearly defines what will be provided (inputs) and what the
program will deliver (outputs). It then shows how these will lead to the expected
outcomes.​

Example: Piso Firme Project in Mexico

In Mexico, the Piso Firme program aimed to improve the living conditions of poor families by
replacing dirt floors with cement ones. The theory of change for this program was as follows:

1.​ Input: The government provides cement and materials, and the community helps install
the floors.​

2.​ Output: Households receive a cement floor.​


3.​ Outcome: The program hoped that cement floors would reduce the transmission of
parasites, which thrive on dirt floors, and thus decrease health issues like diarrhea and
malnutrition.​

4.​ End Goal: Health improvements, better nutrition, and even better cognitive development
in children.​

This was based on the assumption that dirt floors were a major source of parasite
transmission, which causes illnesses in children. By replacing dirt with cement floors, the
program aimed to interrupt this cycle, leading to improved health and overall well-being.

Why is a Theory of Change Important?

A theory of change helps clarify how the program works and what it aims to achieve. It also
identifies research questions to explore, like the impact on health outcomes in this case. For
example, in the Piso Firme project, the evaluation asked whether cement floors really reduced
diarrhea and malnutrition, improving overall health and happiness.

By having a clear theory of change, stakeholders can better understand the program's logic,
make adjustments, and ensure that the right outcomes are measured.

Developing a Results Chain


A Results Chain is a tool used to visualize and outline the steps of a program's Theory of
Change. It helps to show how the resources and activities in a program lead to the final
outcomes. This method is simple and effective, and it provides a clear picture of how a program
is supposed to work, from beginning to end.

Key Elements of a Results Chain

1.​ Inputs:​
These are the resources available to the program, such as the budget, staff, and
materials.​

2.​ Activities:​
These are the actions or work that are carried out to transform the inputs into tangible
outputs. For example, providing training or distributing materials.​

3.​ Outputs:​
These are the direct results of the program’s activities — the goods and services
produced. For example, if a program provides education, the output might be the number
of people trained.​

4.​ Outcomes:​
These are the short-term to medium-term results that occur once beneficiaries use
the program’s outputs. These are typically not directly controlled by the program but
depend on how the beneficiaries react to the program (e.g., improved knowledge or
skills).​

5.​ Final Outcomes:​


These are the long-term goals of the program. They represent the ultimate impact
and whether the program's objectives were achieved, such as reducing poverty or
improving health. These outcomes are influenced by various factors, including the
beneficiaries' behavior and external conditions.​

How Does the Results Chain Work?

The results chain shows how a program works in two stages:

●​ Implementation (Supply Side): This includes inputs, activities, and outputs, which are
within the control of the program.
●​ Results (Demand Side + Supply Side): These include the outcomes and final outcomes,
which are influenced by both the program’s implementation and the behavior of the
beneficiaries.

Example: Results Chain of a Health Program

Let’s imagine a program that provides vaccinations to children in a community:

1.​ Inputs:​

○​ Budget for vaccines, healthcare staff, and transportation.


2.​ Activities:​

○​ Conduct vaccination campaigns, educate parents about the importance of


vaccines, and administer vaccines.
3.​ Outputs:​

○​ Number of children vaccinated, number of educational sessions held.


4.​ Outcomes:​

○​ Increased immunity in children, fewer cases of preventable diseases in the


community.
5.​ Final Outcomes:​

○​ Reduced mortality and morbidity in children from vaccine-preventable diseases,


improved overall health in the community.

Why is a Results Chain Useful?

1.​ Clarifies the Causal Logic:​


It makes clear how activities lead to outputs, and how these outputs contribute to
outcomes and long-term impacts.​

2.​ Identifies Risks and Assumptions:​


It helps to uncover hidden assumptions or risks that could prevent the program from
achieving its goals.​

3.​ Improves Program Design:​


A good results chain can help policymakers and program managers see where the
program might be weak or unclear and help them improve the design.​

4.​ Facilitates Monitoring and Evaluation:​


By clearly outlining what needs to be tracked at each stage (inputs, activities, outputs,
outcomes), the results chain helps in selecting the right indicators for monitoring and
evaluating the program.​

In short, a results chain helps to organize the steps needed to achieve a program’s goals,
shows the links between actions and results, and identifies the resources, activities, and
expected outcomes involved. It is a powerful tool for improving and evaluating program
effectiveness.
Specifying Evaluation Questions
An evaluation question is the central focus of any effective evaluation. It helps narrow down
the research and ensures the evaluation directly addresses key policy interests. In the case of
impact evaluation, the question typically asks, “What is the impact (or causal effect) of a
program on a specific outcome?” The goal is to identify the changes caused by the program,
program modality, or design innovation.

Steps to Formulate Evaluation Questions

1.​ Start with a Clear Policy Interest:​


The evaluation question should be directly linked to the policy issue you want to
explore. This ensures the evaluation is focused on answering the most relevant question
for decision-makers.​

2.​ Testable Hypothesis:​


The evaluation question needs to be a testable hypothesis. This means that it should
be framed in a way that allows you to measure the difference between the results in
the treatment group (those who participated in the program) and the comparison
group (those who did not).​

3.​ Using the Theory of Change:​


The Theory of Change and Results Chain will help shape the evaluation question.
They guide you in understanding the program’s intended outcomes and help create a
hypothesis based on those outcomes.​

4.​ Different Types of Evaluation Questions:​


The evaluation question can focus on various aspects of the program, such as:​

○​ Effectiveness: Does the program achieve its intended results (e.g., improved
health, education, etc.)?
○​ Cost-effectiveness: Is one program model more cost-efficient than another?
○​ Behavioral Change: Does a program lead to changes in behavior, such as
increased enrollment or better health practices?

Examples of Evaluation Questions

●​ Health Insurance Subsidy Program (HISP):​


The evaluation question might be, “What is the effect of HISP on poor households'
out-of-pocket health expenditures?” This question focuses on whether the subsidy
program reduces the financial burden of healthcare on low-income families.​

●​ High School Mathematics Reform:​


In this case, the evaluation question is, “What is the effect of the new mathematics
curriculum on students' test scores?” This is based on the belief that improving the
curriculum will lead to better student performance.​

Box 2.2: Mechanism Experiments

Sometimes, it’s not necessary to test the full program right away. Instead, you can test a
mechanism — a part of the program's causal pathway — to understand if the underlying
assumptions are correct.

For example, imagine you're concerned about obesity in poor neighborhoods. One potential
cause is lack of access to nutritious food. Instead of launching a full program to provide food
subsidies, you could first test the mechanism by offering free baskets of fruits and vegetables
to see if this actually increases consumption.

●​ Mechanism Question: Does giving free vegetables lead to healthier eating habits in
residents of poor neighborhoods?

Box 2.3: High School Mathematics Reform Example

Let’s break down how the evaluation question was formulated for the High School
Mathematics Reform program.

●​ Theory of Change:​

○​Inputs: Budget, teachers, training facilities.


○​Activities: Designing the new curriculum, training teachers.
○​Outputs: Number of teachers trained, textbooks delivered.
○​Outcomes: Improved teacher usage of new methods and better student
performance.
○​ Final Outcomes: Increased high school graduation rates, better job opportunities
for students.
●​ Evaluation Question:​
The evaluation question was framed as, “What is the effect of the new mathematics
curriculum on test scores?”​

Why Is Formulating a Clear Evaluation Question Important?

1.​ Focus:​
The question narrows the focus of the evaluation, ensuring it targets the key aspects of
the program that need to be assessed.​

2.​ Testable Hypothesis:​


A well-defined evaluation question helps create a testable hypothesis, making it
possible to measure the program’s effects using data from treatment and comparison
groups.​

3.​ Inform Policy:​


The evaluation question helps policymakers understand whether the program is
effective and how it impacts key outcomes, such as health, education, or economic
status.​

In summary, specifying the evaluation question is critical for guiding an evaluation. It ensures
that the research is focused on the most important aspects of the program and helps generate
clear, actionable findings.

Selecting Outcome and Performance Indicators


When preparing for an impact evaluation, it is essential to define the outcome indicators that
will be used to assess the program’s effectiveness. These indicators help determine whether a
program has been successful and also assist in power calculations for sample size
determination, as discussed in Chapter 15.

Steps to Selecting Indicators

1.​ Establish Clear Objectives:​


After selecting the main indicators, define clear objectives regarding program success.
This step involves specifying the anticipated effects of the program on the outcome
indicators, such as expected changes in test scores or an increase in the adoption rate
of a health insurance policy.​

2.​ Agree on Outcome Indicators and Effect Sizes:​


Both the research team and policy team need to reach an agreement on the primary
outcome indicators that will measure the success of the program. It is also important to
define the effect sizes, which represent the magnitude of the expected change (e.g., a
5-point increase in test scores or a 10% reduction in out-of-pocket health expenditures).
These effect sizes serve as the basis for power calculations.​

3.​ Ex Ante Simulations:​


If data are available, ex ante simulations can be conducted to predict different
outcome scenarios. These simulations help estimate the likely effect sizes across
various indicators and assess cost-benefit or cost-effectiveness in advance. They also
provide benchmarks for comparing alternative interventions.​

SMART Indicators

It’s crucial to ensure that the selected indicators are effective measures of program
performance. A commonly used approach to ensure this is the SMART framework, which
ensures that each indicator meets the following criteria:

●​ Specific: The indicator should measure the desired outcome as precisely as possible.
●​ Measurable: The indicator must be something that can be easily quantified or obtained.
●​ Attributable: The indicator should be linked directly to the program’s efforts, so you can
trace the outcomes to the intervention.
●​ Realistic: The data for the indicator should be obtainable in a reasonable timeframe and
at a reasonable cost.
●​ Targeted: The indicator must relate to the target population (i.e., those intended to
benefit from the program).

Tracking the Results Chain

Indicators should be identified not just at the outcome level, but throughout the entire results
chain to ensure that the program’s causal logic can be tracked. This includes monitoring both:

●​ Implementation Indicators: These track whether the program has been carried out as
planned, whether it has reached the target population, and if it has been delivered on
time.
●​ Outcome Indicators: These measure whether the program has achieved the intended
outcomes. Even if the focus is on outcomes, tracking implementation indicators is still
essential to explain why certain results were or were not achieved.

Without indicators across the entire results chain, an evaluation risks becoming a “black box”
that simply identifies whether outcomes were achieved, but cannot explain the reasons behind
the success or failure.

Checklist for Data Collection

Once the indicators have been selected, the next step is to consider the practical aspects of
gathering data to measure those indicators. Below is a checklist to ensure that the indicators
can be reliably produced and used in the evaluation:

●​ Are the indicators clearly specified?​


The indicators should align with the core evaluation questions and be consistent with
program design documents and the results chain.​
●​ Are the indicators SMART?​
Ensure that the indicators are Specific, Measurable, Attributable, Realistic, and
Targeted.​

●​ What is the data source for each indicator?​


Identify whether data will come from surveys, reviews, administrative data, or another
source.​

●​ How frequently will data be collected?​


Include a clear timeline for when data should be collected and ensure it aligns with
program activities.​

●​ Who is responsible for data collection?​


Designate individuals or teams responsible for collecting data, verifying its quality, and
ensuring compliance with ethical standards.​

●​ Who will handle data analysis and reporting?​


Define who will analyze the data, how often analysis will be conducted, and the
frequency of reporting.​

●​ What resources are needed for data collection?​


Ensure the necessary resources (funding, personnel, tools) are in place to collect
reliable data, as this can be the most expensive part of the evaluation.​

●​ Is there appropriate documentation?​


Make plans for documenting the data collection process, including how data will be
stored and ensuring confidentiality.​

●​ What risks are involved?​


Identify potential risks and challenges in the data collection process, including issues
related to timing, data quality, and the impact of these risks on the evaluation.​

Conclusion

In summary, selecting the right outcome and performance indicators is a crucial step in
designing an effective impact evaluation. These indicators should be SMART, tied directly to
program goals, and span the full results chain to ensure that the causal logic of the program can
be tracked. Clear objectives and effect sizes must be defined early to guide the evaluation, and
practical considerations for data collection should be carefully planned to ensure reliable and
valid results.
Causal Inference and Counterfactuals in
Impact Evaluation
Causal Inference is the process of determining whether a program or intervention directly
causes a change in an outcome. For example, we may want to know if a vocational training
program leads to higher income for participants. It's not enough to just observe that someone's
income increased after completing the program; other factors, such as their effort or changes in
the job market, could also be responsible.

To establish causality, impact evaluations use specific methods to rule out other explanations
and isolate the program's effect. The goal is to determine how much of the observed change in
outcomes (like income) is due to the program itself, and not to other factors.

The basic formula for measuring impact is:

Δ = (Y | P = 1) − (Y | P = 0)

●​ Y represents the outcome (e.g., income).


●​ P = 1 means the person received the program (e.g., vocational training).
●​ P = 0 means the person did not receive the program.

This formula tells us that the causal impact (Δ) is the difference between the outcome (income)
with the program (P = 1) and the same outcome without the program (P = 0). In essence, we're
trying to measure what would have happened to the person if they hadn’t participated in the
program.

Example: Imagine a person completes a vocational training program. The goal is to compare
their income after the program (P = 1) to what their income would have been without the
program (P = 0). If we could observe both scenarios at the same time, we would know exactly
how much the program changed their income, without being influenced by any other factors.

However, it's impossible to observe both scenarios for the same person at the same time, so
impact evaluations often use counterfactuals (what would have happened in the absence of
the program) to estimate the causal effect.

By comparing groups that did and didn’t receive the program, or using statistical methods to
simulate the counterfactual, we can isolate the program’s true impact on outcomes.

What is the Counterfactual?


The counterfactual is the outcome that would have occurred if a person hadn’t participated in
the program. In other words, it’s what happens without the treatment or program. Since we
can only observe the person after they’ve joined the program (the "with program" outcome), we
can’t directly observe what would have happened if they hadn't joined the program (the "without
program" outcome). This is called the counterfactual problem.

Example: Miss Unique and the Cash Transfer Program

Let’s look at Miss Unique, a newborn baby who receives a cash transfer program for her
mother to take her to health checkups. The goal of the program is to improve Miss Unique’s
health (e.g., her height at age 3) by making sure she gets health services.

●​ What we can measure: We can measure Miss Unique’s height at age 3 after she has
received the cash transfer.
●​ What we can’t measure: We cannot know what Miss Unique’s height would have
been if her mother hadn’t received the cash transfer. This is the counterfactual — the
"what would have happened" scenario that we can’t directly observe.

To evaluate the impact of the program, we want to compare:

1.​ Miss Unique’s height with the cash transfer.


2.​ Miss Unique’s height without the cash transfer (this is the counterfactual, but it’s
unknown).

Why is this a problem?

Since Miss Unique actually received the cash transfer, we can’t know her height in a world
where she didn’t receive it. This makes it hard to say if the program caused any change in her
height. We would need to compare her to someone else who is very similar, but it’s impossible
to find someone who is exactly the same as Miss Unique. Every person has unique
circumstances, so just comparing Miss Unique to another child might not be accurate.

The "Perfect Clone" Example

In an ideal world, to solve the counterfactual problem, we could create a perfect clone of Miss
Unique. This clone would be exactly the same as Miss Unique in every way (same family,
same health, same background, etc.), but the clone wouldn’t receive the cash transfer.

●​ We could then compare Miss Unique’s height (after receiving the program) to the
clone’s height (without the program). The difference would show the program’s impact.

For example:

●​ Miss Unique (with cash transfer): 3 feet tall.


●​ Perfect Clone (without cash transfer): 2.9 feet tall.
The impact of the cash transfer would be the difference in their heights: 0.1 feet.

However, it’s impossible to find a perfect clone of someone because there are always
differences between people. Even identical twins have differences.

Real-Life Challenges

In the real world, we can’t find these perfect clones, so we use other methods to try to estimate
the counterfactual. For example, we compare Miss Unique to other children who didn’t receive
the cash transfer, but we have to be careful because those children may not be exactly like Miss
Unique. They may live in different areas, have different parents, or other factors that could affect
their health.

Summary

The counterfactual is the key concept in impact evaluations because it represents what would
have happened without the program. Since we can’t observe the counterfactual directly
(because a person can’t exist in two states at once), we estimate it by comparing people who
participated in the program to those who didn’t, while trying to account for other differences
between them.

Great question! The problem you're asking about is called the "counterfactual problem" in
impact evaluations. It refers to the difficulty of knowing what would have happened to a person if
they hadn’t participated in a program (since we can only see what actually happened to them
after they joined the program).

Let me break it down with an example to make it clearer:

Example: Vocational Training Program

Imagine we want to evaluate the effect of a vocational training program on a person's income.
Let's say that after the program, a person’s income increased from $30,000 to $40,000. The
question is: Did the program cause the income increase?

●​ With the Program: After completing the program, the person’s income is $40,000.​

●​ Without the Program: What if they hadn’t taken the program? Would their income still
have increased, or would it have stayed the same? This is what we don’t know because
we can’t see the "alternate reality" where the person didn’t participate in the program.​

The Counterfactual Problem


The counterfactual is the scenario where the person did not participate in the program. But
since we can’t go back in time and observe this alternative outcome, we have to find a way to
estimate it.

How Do We Solve This Problem?

We can't directly observe the "without program" income (since we only have the "with program"
data), so we compare the person’s income with other people who didn’t participate in the
program. These people form the comparison group.

Example of a Comparison Group:

Let’s say you find a group of people who are similar to the person in the training program (same
age, same education level, similar job history, etc.), but they didn’t take the training.

●​ You observe that these people’s income stayed around $30,000, the same as the
income of the person before they joined the program.

Now, you can compare:

●​ The person’s actual income after the program ($40,000).


●​ The person’s expected income without the program (which is like the income of
similar people, say $30,000).

So, in this case, you could say that the program likely contributed to the increase in income,
since the comparison group (people who didn’t take the program) did not show any increase in
income.

Why Is This Hard?

●​ We can never truly know what would have happened to the person without the
program. This is why we need to use statistical methods to estimate the counterfactual.
We try to find someone similar to the person in the program and estimate what their
income would have been if they hadn't participated.

In short:

●​ Without the program: We can’t observe directly, so we estimate it using other people
who didn’t participate.
●​ With the program: We observe directly (income after the program).

This difference is what we try to measure in impact evaluation to determine if the program
caused the change.
Estimating the Counterfactual:

The Problem

In impact evaluation, we are trying to figure out what would have happened to a group of
participants if they had not received the program or treatment. This is the counterfactual. We
can’t observe a person at the same time in two states (with and without the program), so we use
a comparison group to estimate the counterfactual.

Key Concepts for Estimating the Counterfactual:

1.​ Treatment Group: This group receives the program or treatment. For example, people
who receive the extra pocket money.
2.​ Comparison (Control) Group: This group does not receive the program. They are used
to estimate what would have happened to the treatment group without the program.

The goal is to compare the treatment group to a comparison group that is as similar as
possible, except for the fact that one group receives the program and the other does not.

The Main Challenge:

We need to make sure the treatment group and comparison group are statistically
identical, meaning that their characteristics are the same on average. If they are identical
except for receiving the program, then any differences in outcomes can be attributed to the
program itself.

Conditions for a Valid Comparison Group:

For the comparison group to be valid and provide a good estimate of the counterfactual, it must
meet these three conditions:

1.​ Same Characteristics:​

○​ The treatment and comparison groups must have the same average
characteristics in the absence of the program.
○​ For example, if we are comparing candy consumption, both groups should have
the same average age, gender, and preferences for candy, so any differences in
candy consumption are because of the program, not other factors.
2.​ The Comparison Group is Unaffected by the Program:​

○​ The treatment should not affect the comparison group, either directly or indirectly.
○​ For example, if the treatment group is given extra pocket money and this leads to
more trips to the candy store, we need to ensure the comparison group is not
affected by these trips. Otherwise, we wouldn’t be able to isolate the impact of
the pocket money itself.
3.​ Same Reaction to the Program:​

○​ The treatment and comparison groups should respond to the program in the
same way.
○​ If one group’s income increases by $100 due to a training program, then we
expect the comparison group’s income to also increase by $100 if they had
received the same training. If this happens, any difference in outcomes between
the groups can be attributed to the program.

Example: Pocket Money and Candy Consumption

Let’s go back to the example of Mr. Fulanito receiving extra pocket money and consuming
more candy.

●​ Treatment Group: This group (say, Mr. Fulanito) receives the extra pocket money.
●​ Comparison Group: This group does not receive the extra pocket money, but they
should be similar to Mr. Fulanito in terms of age, preferences for candy, and other
characteristics.

Now, let’s go through the process:

1.​ Treatment Group Outcome (Y | P = 1):​

○​ The average candy consumption of the treatment group (those who received
pocket money).
○​ For example, they consume 6 candies on average.
2.​ Comparison Group Outcome (Y | P = 0):​

○​ The average candy consumption of the comparison group (those who did not
receive pocket money).
○​ For example, they consume 4 candies on average.
3.​ Estimate the Impact:​

○​ The impact of the program (extra pocket money) is the difference between the
two averages.
○​ Impact = 6 (candies for treatment group) – 4 (candies for comparison group) = 2
candies.

In this case, we estimate that the pocket money program caused an increase of 2 candies in
candy consumption on average.
Why is this Comparison Group Important?

If we don’t have a valid comparison group, the estimated impact could be biased. This means
we could be measuring not only the effect of the program but also the effect of other differences
between the groups.

For example, if the comparison group is much older or lives in a different area where candy is
cheaper, their candy consumption might be different for reasons unrelated to the program. This
could distort the estimate of the program’s true impact.

Conclusion

In summary, the key to estimating the counterfactual and determining the true impact of a
program is finding a valid comparison group. This comparison group must:

1.​ Have the same average characteristics as the treatment group.


2.​ Not be affected by the program.
3.​ React to the program in the same way.

When we find such a group, we can confidently say that any difference in outcomes between
the two groups is due to the program itself.

Counterfeit Counterfactual Estimate 1: Comparing Outcomes


Before and After a Program
The idea behind before-and-after comparisons (also known as pre-post or reflexive
comparisons) is to evaluate the impact of a program by looking at the outcomes for participants
before and after they participate in the program.

However, this method can often lead to misleading or counterfeit estimates of the
counterfactual. Here's why:

In this method, you are comparing the outcomes before the program (the baseline) and after
the program has been implemented. The assumption is that if the program hadn't existed, the
outcome for participants would have stayed the same as it was before the program. But this
assumption is usually not valid because outcomes can change due to other factors, not just the
program itself.

Example: Evaluating a Microfinance Program


Let's say we are evaluating the impact of a microfinance program that provides loans to poor
farmers to help them buy fertilizers and increase their rice yields. Here's how the
before-and-after comparison would look:

1.​ Before the program: In Year 0 (before the program starts), the farmers have an
average rice yield of 1,000 kg per hectare.
2.​ After the program: After one year (Year 1), with the microloan, the farmers' rice yield
has increased to 1,100 kg per hectare.

A before-and-after comparison would calculate the program’s impact as:

Δ=1,100 kg/ha−1,000 kg/ha=100 kg/ha\Delta = 1,100 \, \text{kg/ha} - 1,000 \, \text{kg/ha} = 100 \,


\text{kg/ha}

So, the before-and-after estimate suggests that the program increased rice yields by 100 kg
per hectare.
The Problem with Before-and-After Comparisons

The problem with this method is that it assumes that without the program, the farmers' yield
would have stayed the same at 1,000 kg per hectare (the baseline). But this assumption is
incorrect because there are many factors that can affect the outcome (such as weather or
market conditions) that are not accounted for in this analysis.

For example:

●​ If there was a drought in the year the program was implemented, the yield would have
likely been lower without the program, perhaps around 900 kg per hectare. In this case,
the actual program impact would be 1,100 kg - 900 kg = 200 kg (which is larger than the
100 kg estimated from the before-and-after comparison).​

●​ If rainfall improved in the year the program was implemented, the yield would have
likely been higher even without the program, perhaps 1,200 kg per hectare. In this case,
the actual impact of the program would be 1,100 kg - 1,200 kg = -100 kg (a negative
impact).​

Thus, the true impact could be larger or smaller than the 100 kg estimate, depending on
factors like weather (rainfall or drought) or other external influences.

Why It's a "Counterfeit" Estimate

The before-and-after method uses the baseline (Year 0) as the counterfactual (the "what
would have happened" scenario if the program hadn't existed). However, this is an incorrect
assumption because external factors (like weather) could have changed the outcome, making
the baseline an unreliable estimate of the counterfactual.

In summary:

●​ Before-and-after comparisons are risky because they ignore the fact that other
factors can affect the outcome over time.
●​ It assumes the outcome would have stayed the same without the program, but that's
rarely the case because of factors like weather, economic changes, etc.
●​ This leads to counterfeit estimates of the program’s true impact.

Key Takeaway

Before-and-after comparisons can give misleading results because they fail to account for the
many factors that can affect outcomes over time. Instead of using the baseline as the
counterfactual, a more valid method requires using a comparison group (a group not receiving
the program) that is similar in all aspects except for the program itself.
Counterfeit Counterfactual Estimate 2: Comparing Enrolled and
Nonenrolled (Self-Selected) Groups
This is another method of estimating the impact of a program, but it has its own pitfalls. The idea
here is to compare the outcomes of people who voluntarily choose to participate in a program
(the "enrolled" group) to those who choose not to participate (the "nonenrolled" group).
However, this approach can give you a counterfeit estimate of the program's true impact
because the two groups may not be comparable.

1. The Basic Idea of this Estimate

You want to compare two groups of people:

●​ The Enrolled Group: These are people who voluntarily chose to join the program.
●​ The Nonenrolled Group: These are people who decided not to join, even though they
were eligible for the program.

By comparing these two groups, you’re trying to estimate the counterfactual outcome for the
enrolled group. That is, you want to know how the enrolled group would have performed if they
hadn't participated in the program (the counterfactual situation).

In theory, if the two groups (enrolled vs. nonenrolled) were identical in all important ways except
for program participation, the difference in outcomes could be attributed to the program. But, in
reality, this is rarely the case.

2. The Problem with Comparing Enrolled vs. Nonenrolled Groups

The issue with using this approach is that the two groups are fundamentally different, and
these differences can affect the outcome you're measuring. Specifically, people who choose to
enroll in a program are often different from those who don’t, in ways that are hard to observe
and measure. Here are a couple of key reasons why:

●​ Motivation: People who choose to participate in a program may be more motivated to


succeed than those who don’t. For example, if we’re evaluating a vocational training
program, the enrolled group might be more driven to improve their skills and income
than those who choose not to enroll. This higher motivation could lead them to perform
better in the labor market, even without the training program.​
●​ Expectations of Benefit: Enrolled individuals might believe they will benefit more from
the program, which could push them to work harder or pursue more opportunities. Those
who don’t enroll may not see the value of the program or might have lower expectations,
affecting their outcomes.​

●​ Skills or Background: In some cases, program administrators might select participants


based on certain preexisting characteristics like qualifications or motivation. This can
create an admission bias, where those who are chosen to participate in the program
may already be better suited to succeed (even without the program).​

3. Why This Leads to a Counterfeit Estimate

When you compare the enrolled group to the nonenrolled group, you are essentially comparing
two groups that are different in many ways. The counterfactual estimate you get from this
comparison is not valid because the nonenrolled group is not a fair representation of what
would have happened to the enrolled group if they had not participated in the program.

For example:

●​ If people who chose to enroll in a vocational training program already had better skills or
a higher level of motivation, then the difference in outcomes (e.g., higher income for
the enrolled group) might not be due to the program, but because the enrolled group
was already more likely to succeed than the nonenrolled group.

This means you could overestimate the program’s impact. You might wrongly attribute a higher
income or better performance to the program when, in fact, the enrolled group would have
performed better anyway, just because of their initial advantages.

4. The Key Concept: Selection Bias

What you’re dealing with here is called selection bias. This happens when the reason people
choose to enroll in the program is related to factors that affect the outcome, even in the absence
of the program.

In other words:

●​ If people who are more motivated or more skilled are the ones who enroll, then any
improvement in their outcomes can’t necessarily be attributed to the program itself. The
pre-existing differences (motivation, skills, etc.) are the real cause of their better
outcomes.
5. How This Impacts the Counterfactual Estimate

If you rely on comparing the enrolled group to the nonenrolled group, you will likely get a
biased estimate of the program’s true impact. This is because the enrolled group and
nonenrolled group are not comparable. The difference in their outcomes is not entirely due to
the program—it also reflects the underlying differences in their characteristics.

The estimate you get is a counterfeit estimate because it assumes that the nonenrolled group
represents the true counterfactual (i.e., what would have happened to the enrolled group if they
hadn’t participated in the program), but it does not. The nonenrolled group is likely to have very
different characteristics that affect their outcomes in ways that the enrolled group does not
share.

Key Takeaways

●​ Enrolled vs. Nonenrolled Comparison leads to selection bias, because the two
groups differ in ways that affect the outcome.
●​ These differences are not accounted for, meaning any difference you observe in
outcomes between the groups could be due to pre-existing differences (e.g.,
motivation, skills) rather than the program itself.
●​ The counterfactual estimate you get from this comparison is incorrect and
counterfeit, because the nonenrolled group is not a valid comparison for the enrolled
group.
●​ This method can overestimate the program’s impact, as it assumes the nonenrolled
group would have had the same outcomes as the enrolled group if they had participated
in the program.

In conclusion, comparing enrolled and nonenrolled groups can be misleading unless you
account for the differences between the two. Methods like randomized controlled trials
(RCTs) or techniques like propensity score matching are often used to overcome this issue by
making the groups more comparable, which helps produce a more accurate estimate of the
program's impact.

Numerical example for Counterfeit Counterfactual Estimate 2:

Example Scenario: Vocational Training Program


Imagine we have a vocational training program designed to help young, unemployed individuals
increase their income. The program is offered to a group of eligible youths, but only some
choose to enroll, while others decide not to.

We want to evaluate the impact of this vocational training program on income. Specifically, we
want to know how much more (if anything) the enrolled individuals are earning after completing
the program compared to those who didn’t enroll.

The Key Problem: Self-Selection Bias

The problem arises because people who choose to enroll in the program (the "enrolled" group)
might be fundamentally different from those who don't enroll (the "nonenrolled" group). These
differences may affect their income, regardless of the program.

The Two Groups:

●​ Enrolled Group (Treatment Group): 100 individuals who voluntarily chose to enroll in
the vocational training program.
●​ Nonenrolled Group (Comparison Group): 100 individuals who were eligible for the
program but decided not to enroll.

Let’s say, based on some survey data, we know the following about their average incomes
before the program:

Group Average Income Before Program Average Income After Program


(US$) (US$)

Enrolled 5,000 7,000

Nonenrolle 5,000 6,200


d

Step 1: Calculate the Difference in Income (Before and After)

First, let's calculate the average income change for both groups (before vs. after):

●​ Enrolled Group:​

○​ Before: $5,000
○​ After: $7,000
○​ Change in Income: $7,000 - $5,000 = $2,000 increase
●​ Nonenrolled Group:​
○​ Before: $5,000
○​ After: $6,200
○​ Change in Income: $6,200 - $5,000 = $1,200 increase

Step 2: Compare the Two Groups

If we directly compare the enrolled group to the nonenrolled group, it looks like the program
had a pretty strong impact:

●​ The enrolled group saw a $2,000 increase in income.


●​ The nonenrolled group saw a $1,200 increase in income.

So, the program appears to have increased income by $800 more for those who participated
($2,000 vs. $1,200).

But, here's the issue:

Why This Estimate is Counterfeit

The $800 difference might seem like the program’s impact, but the comparison is not valid. The
enrolled group and the nonenrolled group are likely to be different in important ways, even
before the program started. These differences could be the reason why the enrolled group
earned more even before the program, and why they saw a larger increase in income. The
groups are not equivalent to begin with!

Here are some potential reasons for these differences:

●​ Motivation: The enrolled group may be more motivated to improve their life, so even
without the program, they may have pursued other avenues to increase their income.
●​ Skills: The enrolled group might have higher skills or experience, making them more
likely to earn more, even before the program.
●​ External Factors: The nonenrolled group might be facing more financial hardship or
have fewer resources to improve their income, making them less likely to take part in the
program.

Because the two groups were not randomly selected and are likely different in important ways,
the income difference we observe could be due to those differences, not the program itself.

Step 3: Selection Bias Explained

The income difference observed between the groups is due to selection bias. People who
chose to enroll in the program were likely different (in motivation, skills, etc.) from those who
chose not to. This makes it impossible to attribute the entire difference in income to the
program itself.

For example:
●​ The enrolled group may have been motivated to improve their income even without the
program, while the nonenrolled group might have been less motivated, leading to
different outcomes.
●​ If the nonenrolled group had chosen to participate, they might have experienced a
larger or smaller increase in income than we see with the enrolled group.

Step 4: The True Impact of the Program (The Correct Approach)

To properly estimate the program’s impact, we need to account for the differences between
the enrolled and nonenrolled groups. A randomized controlled trial (RCT) or matching
techniques could help us find a more accurate comparison group (nonenrolled individuals who
are similar to the enrolled individuals in terms of skills, motivation, and other characteristics).

For instance, if we randomly assigned people to either the enrolled or nonenrolled group, we
would have more confidence that any differences in income are due to the program and not
other factors.

Step 5: Conclusion

So, comparing the enrolled and nonenrolled groups without addressing the differences
between them results in a counterfeit estimate of the program's impact. The difference in
income of $800 could be due to factors other than the program itself, such as motivation or skill
levels, and is therefore not a reliable estimate of the program’s true impact.

To avoid this, we would need a method that accounts for these differences, such as random
assignment or matching to ensure we're comparing individuals who are truly similar in every
way except for program participation. This would allow us to make a more accurate estimate of
the program's effect.

Summary:

●​ Self-Selection Bias: Participants are likely to be different from nonparticipants, which


makes direct comparisons misleading.
●​ Counterfeit Estimate: The $800 difference may not be due to the program but rather to
pre-existing differences.
●​ Correct Approach: Use methods like random assignment to avoid selection bias and
get a true estimate of the program’s impact.
Randomized Assignment: Evaluating
Programs Based on the Rules of
Assignment
In the previous sections, we discussed two "counterfeit" methods for estimating program
impacts, such as before-and-after comparisons and enrolled versus non-enrolled group
comparisons. Both of these methods have high risks of bias. Now, we are moving toward more
reliable methods for evaluating program impacts. One of the strongest and fairest approaches
to estimating program impact is randomized assignment, commonly known as randomized
controlled trials (RCTs).

What is Randomized Assignment?

Randomized assignment involves using a random process—essentially like a lottery—to


decide who participates in a program and who doesn’t. This process ensures that everyone who
is eligible for a program has an equal chance of being selected for the program.

●​ Example: Imagine a program that is designed to help unemployed youth find jobs. If
there are more eligible youth than available slots, a random lottery is used to select
who will get the opportunity to participate in the program. Those selected become the
treatment group, and those not selected form the control group.

Why is Randomized Assignment a Good Method?


1.​ Fair and Transparent: By using a lottery or random selection, the process is fair.
Everyone who is eligible has the same chance of participating, and there is no favoritism
or manipulation in the selection. This transparency is essential for avoiding accusations
of bias or corruption. For example, if the program is being implemented in a community,
people will trust the fairness of the process because they know everyone has an equal
shot.​

2.​ The Gold Standard of Impact Evaluation: Randomized assignment is considered the
gold standard for evaluating the impact of social programs because it gives us the best
estimate of what would have happened to the program participants if they hadn’t
participated. This is important because it helps us create an accurate counterfactual
(what would have happened in the absence of the program).​

3.​ Avoids Selection Bias: In other methods (like comparing enrolled and non-enrolled
groups), participants might differ systematically from non-participants in ways that affect
the outcome. Randomized assignment eliminates this risk because participants are
chosen randomly and are therefore more likely to be similar to the non-participants
(control group) in key characteristics.​

Randomized Assignment and Scarce Resources

In most cases, programs have limited resources and cannot serve every eligible person. This is
where randomized assignment is particularly useful. Let’s say you have a program that aims
to help the poorest 20% of households in a country. If you cannot reach everyone at once due
to budget or capacity constraints, you can randomly select participants from the eligible
population.

●​ Example: If an education program can only provide school materials to 500 schools, but
there are thousands of eligible schools, a random lottery can be used to choose the
500 schools that will receive the materials.

This method ensures that the selection process is fair and transparent, and there is no room
for arbitrary decisions, favoritism, or corruption in choosing participants.

Randomized Assignment: Practical Implementation

1.​ Target Population Larger than Available Slots: In many cases, the population of
eligible participants is larger than the number of spots available in the program. For
example, if a youth employment training program can only enroll a limited number of
youth, a random lottery can determine who gets in. This is often the case for programs
with budget constraints or capacity limitations.​

2.​ Program Rules for Assignment: When a program has more applicants than slots, the
program needs to decide how to allocate the slots. If the program is designed to serve a
specific population, but demand exceeds capacity, a fair and transparent way to allocate
slots is by using a random process.​

○​ Example: If a rural road improvement program can only pave a few roads in a
given year, and there are many eligible roads in need of improvement,
randomized assignment ensures that the selection of roads is done fairly.

Using Randomized Assignment in Practice

Programs often face the challenge of allocating benefits to a large pool of potential participants.
In such cases, even if there is a way to rank participants (e.g., based on income), the rankings
can be imprecise. Randomly assigning participants within a specific range (e.g., households
with incomes close to the threshold) ensures fairness and prevents errors in allocation.

●​ Example: If a poverty-targeted program wants to help households with incomes in the


bottom 20%, but income measurements are imperfect, a random lottery could be used
to assign benefits to households near the threshold (e.g., between the 15th and 25th
percentiles). This ensures fairness in the process even if the exact level of poverty is
uncertain.

Why Randomized Assignment is a Powerful Tool

1.​ Eliminates Bias: Because assignment is random, there are no systematic differences
between those who receive the program and those who don’t, at least not due to the
selection process itself. This removes the risk of selection bias, where the groups differ
in ways that could affect the outcome (such as motivation, skills, etc.).​

2.​ Fairness in Resource Allocation: Randomized assignment is a widely accepted


method for fairly distributing limited resources, especially when demand exceeds supply.
People can trust the system because the process is objective and free of favoritism.​

3.​ Clear and Transparent Process: A public lottery is a clear and transparent way of
making decisions about program participation. When the process is open to everyone, it
minimizes the risk of misunderstanding or accusations of unfairness.​

Example from Practice

Box 4.1: Randomized Assignment as a Valuable Operational Tool

This box provides two examples of how randomized assignment has been used as a fair and
transparent method to allocate program benefits, even outside of impact evaluations.

1.​ Côte d'Ivoire: After a crisis, the government introduced a temporary employment
program for youth, offering jobs like road rehabilitation. Because demand exceeded the
available spots, a public lottery was used to fairly select participants. Applicants drew
numbers publicly, and those with the lowest numbers were selected. This process
helped ensure fairness in a post-conflict environment.​

2.​ Niger: In 2011, the government launched a national safety net project but had more
eligible poor households than benefits available. Due to limited data, a public lottery
was used to select beneficiary villages within targeted areas. Village names were drawn
randomly, ensuring fairness and transparency. This method continued to be used in later
phases of the project due to its success in promoting fairness.​

In both cases, using a randomized assignment approach (lottery) ensured that the allocation
was fair, transparent, and widely accepted by local authorities and participants.
Summary

●​ Randomized assignment (or randomized controlled trials) is a powerful tool for


evaluating program impact because it provides a reliable way to estimate the
counterfactual and helps to eliminate biases like selection bias.
●​ It is also a fair and transparent method for allocating limited resources in programs
when demand exceeds supply.
●​ The use of random selection (like lotteries) ensures that everyone has an equal chance
of receiving the program, which builds trust and prevents accusations of favoritism.
●​ Randomized assignment is considered the gold standard in impact evaluation and is
increasingly used to assess the effectiveness of social programs.

Why Does Randomized Assignment Produce an Excellent Estimate of the


Counterfactual?

Randomized assignment is an effective way to estimate the impact of a program by ensuring


that the treatment and comparison groups are statistically equivalent. This means that, through
random selection, both groups will have similar characteristics (both observed and unobserved)
before the program starts, which is key to accurately estimating the counterfactual (what would
have happened without the program).

1.​ Random Assignment Process: In a randomized assignment, individuals are randomly


assigned to either the treatment group (receiving the program) or the comparison group
(not receiving the program). With a large enough number of units, randomization
ensures that the groups are statistically similar in all respects (e.g., gender, age,
characteristics). Even unmeasured factors like motivation or personality are likely to be
equally distributed between both groups.​

2.​ Estimation of Impact: Once the program is implemented, any differences in outcomes
between the two groups can be attributed to the program itself, because the groups were
identical before the program started. This allows for a true estimate of the program's
impact.​

3.​ Why It's Effective: By randomly assigning participants, we can be confident that any
observed differences in outcomes are due to the program, rather than other external
factors or biases. This eliminates the risk of selection bias, which could occur if
participants were chosen based on certain characteristics.​

4.​ Simplified Process: The impact is calculated by simply comparing the average outcome
in the treatment group to that in the comparison group, and this difference represents the
true impact of the program. Randomized assignment, therefore, ensures that the
counterfactual is accurate, leading to more reliable estimates of a program's
effectiveness.​

In essence, randomized assignment helps create two groups that are as similar as possible,
making it easier to attribute differences in outcomes to the program itself.

External and Internal Validity

In randomized assignments, internal validity and external validity are key concepts to ensure
that the impact estimates are both accurate and generalizable.

1.​ Internal Validity:​


Internal validity ensures that the estimated impact of the program is due to the program
itself, not other confounding factors. This means that the comparison group accurately
represents the counterfactual (what would have happened without the program).
Randomized assignment helps achieve internal validity because the treatment and
comparison groups are statistically equivalent at the baseline, meaning any difference in
outcomes can only be attributed to the program.​

2.​ External Validity:​


External validity ensures that the results of the evaluation can be generalized to the
entire population of eligible units, not just the sample in the study. To ensure external
validity, random sampling is used to select a representative evaluation sample from the
population. If the sample accurately reflects the broader population, the results can be
generalized to other units in the population.​

3.​ Trade-offs Between Internal and External Validity:​


In some cases, there may be a trade-off between internal and external validity. For
example, if an impact evaluation uses a nonrandom sample of the population, it can
achieve internal validity but might not be able to generalize the results to the larger
population, thus limiting external validity. Similarly, a random sample can provide
external validity, but if treatment assignment isn't randomized, internal validity is
compromised.​

In essence, randomized assignment ensures internal validity by creating equivalent treatment


and comparison groups, while external validity depends on the representativeness of the
sample used in the study.
By combining random sampling (for external validity) and randomized assignment (for internal
validity), evaluations can both measure accurate impacts and generalize those results to the
larger population.

Examples​

Here are the examples explained in short:

1.​ Conditional Cash Transfers and Education in Mexico (Progresa/Prospera):​


The Mexican government ran the Progresa program, offering cash transfers to poor
mothers in rural areas, conditioned on their children's school enrollment and health
checkups. To evaluate its impact, two-thirds of the localities were randomly selected to
receive the program in the first two years, while the remaining served as a comparison
group. The evaluation showed a 3.4% increase in enrollment, with the largest increase
seen among girls, particularly those who had completed grade 6. This was likely due to
the larger transfer amounts given to girls to keep them in school.​

2.​ Youth Employment in Northern Uganda:​


The Youth Opportunities Program in Uganda aimed to reduce youth unemployment by
funding business activities and vocational training. Proposals were randomly assigned to
receive funding. After four years, youth in the treatment group were more likely to
practice skilled trades, earn more money, and accumulate more capital compared to the
comparison group. However, there was no impact on social cohesion or antisocial
behavior.​
3.​ Water and Sanitation Interventions in Bolivia:​
The Bolivian government randomly assigned water and sanitation interventions to rural
communities in need. Due to resource limitations, only some communities could receive
the intervention. A public lottery system was used to randomly select which communities
would benefit. This ensured fairness and transparency, and the remaining communities
were promised future funding after the evaluation.​

4.​ Spring Water Protection in Kenya:​


In Kenya, a program aimed at improving water quality through spring protection
technology was evaluated using randomized assignment. 100 out of 200 springs were
randomly chosen to receive the treatment. The results showed that spring protection
significantly reduced water contamination and decreased child diarrhea by 25%.​

5.​ HIV Education to Curb Teen Pregnancy in Kenya:​


In Kenya, a study tested two different HIV/AIDS education programs to reduce unsafe
sexual behavior and teen pregnancies. Schools were randomly assigned to receive one
of the two programs. The results showed that the program focusing on relative HIV risks
(age- and gender-disaggregated data) significantly reduced teenage pregnancy by 28%,
while the other program had no effect.​

1. Internal Validity:

Internal validity is concerned with whether the program or treatment itself (in your case, the
treatment group) is responsible for the observed effects. In other words, we want to ensure that
any differences in outcomes (e.g., employment rates, income, etc.) between the treatment
group and the control group are due to the program itself and not due to other external factors
(like individual characteristics, motivations, etc.).

●​ Treatment Group: The group that receives the program.


●​ Control Group: The group that does not receive the program.

Internal Validity Example:

●​ If, after the job training program, the treatment group finds more jobs than the control
group, internal validity ensures that the difference is caused by the program itself
(random assignment helps create two statistically equivalent groups to eliminate bias).
●​ If there’s no bias in how the two groups were formed, we can confidently attribute the
difference to the program, making the internal validity strong.

2. External Validity:
External validity is concerned with how well the results of your study can be generalized to
other people, settings, or times. Essentially, can the results of this program be applied to
other groups outside of your study?

●​ Treatment Group: The people who participated in the program.


●​ Control Group: The people who did not participate in the program.

External Validity Example:

●​ If the study was done on unemployed youth in one city, we’d want to know whether the
results would apply to other groups—like unemployed adults, people in rural areas, or
people in different countries.
●​ If the study sample (treatment and control groups) is representative of the larger
population, external validity is high, and we can generalize the findings to similar
groups outside of the study.

To summarize:

●​ Internal validity is about the study's design (was the treatment the cause of the
difference?), and it depends on how well the random assignment created equivalent
groups (treatment vs. control). It is about the accuracy of the cause-and-effect
relationship within the study itself.​

●​ External validity is about generalizability (can the results be applied to the broader
population or different settings?). It depends on how well the study sample represents
the larger population you're trying to generalize to.​

So:

●​ Internal Validity = Can we trust that the differences between the groups are due to the
program itself and not other factors?
●​ External Validity = Can we apply the results of this study to other groups or settings
beyond the study?


When randomized assignment can be used:
1. When the eligible population exceeds the available program spaces:
●​ Situation: When there are more eligible participants than there is capacity to serve
them, a lottery system can be used to select the treatment group.
●​ Example: Suppose a government wants to provide school libraries to public schools, but
there’s only enough budget for one-third of them. A lottery is held where each school has
a 1 in 3 chance of being selected. The schools that win the lottery get the library
(treatment group), while the remaining schools without a library serve as the comparison
group.
●​ Purpose: This ensures that the comparison group is statistically equivalent to the
treatment group, and no ethical issues arise because the schools left out are essentially
part of the natural limitation due to budget constraints.

2. When a program needs to be gradually phased in:

●​ Situation: If a program is rolled out gradually and will eventually cover the entire eligible
population, randomizing the order in which people receive the program can create a
valid comparison group for evaluating impacts.
●​ Example: If the health ministry wants to train 15,000 nurses across three years, it could
randomly assign one-third of nurses to be trained in each year. After the first year, those
trained in year 1 become the treatment group, while those trained in year 3 are the
comparison group (since they haven’t been trained yet). This allows for the evaluation of
the effects of receiving the program for different amounts of time.

Key Point:

Randomized assignment in these cases helps ensure that the treatment and comparison groups
are statistically equivalent, and it provides a way to estimate the counterfactual (what would
have happened without the program). In both scenarios, either due to limited program spaces or
gradual implementation, randomized assignment can be used to assess the true impact of a
program while maintaining fairness and validity in the evaluation.

How to randomly assign treatment:


Step 1: Define Eligible Units

●​ Situation: The first step is to identify the population of units that are eligible for the
program. A unit can be a person, a school, a health center, a business, or even a whole
village, depending on the program.
●​ Example: If you’re evaluating a teacher training program for primary school teachers,
then only primary school teachers should be considered as eligible units. Teachers from
other levels (like secondary school teachers) would not be part of the eligible units.

Step 2: Select the Evaluation Sample

●​ Situation: After defining the eligible units, you may not need to include all of them in the
evaluation due to practical constraints (like budget or time). In this case, you randomly
select a sample from the eligible units based on the evaluation’s needs.
●​ Example: If your eligible population includes thousands of teachers, you might select a
sample of 1,000 teachers from 200 schools to evaluate, as it would be more
cost-effective than assessing every teacher in the country.

Step 3: Randomize Assignment to Treatment

●​ Situation: Once you have the evaluation sample, the next step is to randomly assign the
units to either the treatment group or the comparison group. Here are some methods to
do this:
1.​ Flipping a coin: For a 50/50 split between treatment and comparison groups, flip
a coin for each unit. Decide beforehand whether heads or tails will assign a unit
to the treatment group.
2.​ Rolling a die: If you want to assign one-third of the sample to the treatment
group, roll a die. For instance, decide that a roll of 1 or 2 means the unit goes into
the treatment group, and 3 to 6 means the comparison group.
3.​ Drawing names from a hat: Write the names of all units on pieces of paper, mix
them up, and draw the required number of names for the treatment group.
4.​ Automated process: For larger samples (like over 100 units), use a random
number generator (via software or a spreadsheet) to assign units to the treatment
or comparison group. For example, you might assign the 40 highest random
numbers to the treatment group.

Key Points:

●​ Documentation & Transparency: The randomization process must be transparent and


documented to ensure the process was truly random and unbiased. Whether using a
coin flip, dice, or random number generator, the assignment rule must be decided
beforehand and clearly communicated.
●​ Automation for Large Samples: If you need to assign treatment to many units,
automation (using computers) can be more efficient and accurate, but it’s important to
follow the pre-decided rule and log all computations for transparency and replication.

By following these steps, you ensure a fair and valid randomization process that helps evaluate
the program’s effects effectively.​

When Randomized Assignment Can Be Used:

1.​ When the Eligible Population Exceeds Program Spaces:​

○​ Situation: When there are more eligible participants than available spots, use a
lottery to randomly select participants.
○​ Example: If only a third of schools can receive a library due to budget
constraints, a lottery is held where 1 in 3 schools are selected as the treatment
group (with libraries), and the others form the comparison group (without
libraries).
○​ Purpose: Ensures fairness and that the comparison group is similar to the
treatment group.
2.​ When a Program Needs to Be Phased In Gradually:​

○​ Situation: For programs rolled out over time, randomize the order of who
receives the program first to create a valid comparison group.
○​ Example: If nurses are trained over three years, randomly assign them to be
trained in year 1, 2, or 3. Nurses trained in year 1 are the treatment group, and
those trained in year 3 are the comparison group.
○​ Purpose: Allows for an evaluation of the program's impact over different
timeframes.

How to Randomly Assign Treatment:

1.​ Step 1: Define Eligible Units​

○​ Situation: Identify the eligible population (e.g., primary school teachers for a
teacher training program).
○​ Example: Teachers from other levels (e.g., secondary school) are excluded.
2.​ Step 2: Select the Evaluation Sample​

○​ Situation: Due to practical constraints, randomly select a sample from the


eligible units.
○​ Example: Instead of assessing all teachers, select a sample of 1,000 teachers.
3.​ Step 3: Randomize Assignment to Treatment​

○​ Methods:
■​ Coin flip: For a 50/50 split, flip a coin to assign to treatment or
comparison.
■​ Rolling a die: For a 1/3 split, decide beforehand how the die will allocate
participants.
■​ Drawing names: Write names on paper, randomly draw for treatment
group.
■​ Automated process: Use random number generators for larger samples.
○​ Key Points: Ensure transparency and documentation of the randomization
process.

Key Points:

●​ Documentation & Transparency: The randomization process must be transparent and


well-documented to maintain fairness and reproducibility.
●​ Automation for Large Samples: For larger samples, using automated tools like
software or spreadsheets can improve efficiency and accuracy.

At what level randomized assignment is performed:


1. Level of Assignment (Individual, Household, Community, etc.)

●​ Situation: Randomized assignment can be done at different levels depending on the


program’s implementation. This could include assigning treatment at the individual,
household, business, community, or regional level.
●​ Example: If a health program is being implemented at health clinics, you would
randomly select clinics and assign some to the treatment group and others to the
comparison group.

2. Higher-Level Assignment (Region/Province Level)

●​ Challenge: When the level of randomization is higher (e.g., provinces or regions), it may
become harder to perform a valid impact evaluation due to a small number of regions.
This can make it difficult to balance characteristics between the treatment and
comparison groups.
●​ Example: If a country has only six provinces, and three are randomly assigned to the
treatment group, it may not be sufficient to ensure balanced groups. External factors (like
weather or local events) may affect regions differently, leading to biased results.
●​ Key Point: For unbiased impact estimates, it is crucial that factors such as rainfall, which
can vary over time, are balanced across treatment and comparison groups.

3. Lower-Level Assignment (Individual/Household Level)

●​ Challenge: When randomized assignment is done at the individual or household level,


there is a higher risk of spillovers (when the treatment group affects the comparison
group) and imperfect compliance (when people in the comparison group receive the
treatment or vice versa).
●​ Example: If children in a treatment group receive deworming medicine, those in nearby
comparison households might be indirectly affected by the treatment (a spillover effect),
as they might not contract worms from their treated neighbors.
●​ Key Point: To reduce spillovers, it’s important to assign treatment in a way that
minimizes contact between the treatment and comparison groups. For example,
ensuring that treatment and comparison households are located far apart can help
isolate the program's effects.

4. Choosing the Right Level

●​ Optimal Level: Ideally, randomized assignment should be done at the lowest level
possible to maximize the sample size of the treatment and comparison groups, as long
as spillovers can be minimized.
●​ Example: In a community-level program, it's important to consider whether individuals
within the same community will affect each other’s outcomes, especially if the program is
designed to target specific individuals but the entire community ends up benefiting.

Key Point: When determining the level of randomized assignment, one must consider the
trade-offs between ensuring a large sample size and minimizing the risks of spillovers or
imperfect compliance. Randomizing at lower levels (such as individuals or households) can lead
to more accurate results but requires careful management to avoid these issues.

Estimating Impact under Randomized Assignment

Once the evaluation sample is selected and treatment is randomly assigned, estimating the
program's impact is straightforward. Here's a breakdown of how it works:

1.​ Measuring Outcomes: After the program has been implemented for a certain period,
you need to measure the outcomes for both the treatment group and the comparison
group. These groups are compared to determine the effect of the program.​

2.​ Calculating Impact: The impact of the program is simply the difference between the
average outcomes of the treatment and comparison groups.​

○​ Formula:​
Impact = Average Outcome (Treatment Group) - Average Outcome (Comparison
Group)​
○​ Example:​

■​ Average Outcome for Treatment Group: 100


■​ Average Outcome for Comparison Group: 80
■​ Impact of Program: 100 - 80 = 20​
In this example, the program’s impact is 20, meaning that the treatment
group experienced an outcome 20 units higher than the comparison
group.
3.​ Assumption:​
In the basic example, it’s assumed that everyone in the treatment group receives the
treatment, and no one in the comparison group does.​

4.​ Incomplete Compliance (Realistic Scenario):​


In real-world settings, not all units in the treatment group will necessarily receive the
treatment (imperfect compliance), and some units in the comparison group might gain
access to the program.​

○​ For example, in a teacher training program, it’s possible that not all teachers
assigned to the treatment group receive the training, or a teacher in the
comparison group may attend a training session.
5.​ Even in this case, randomized assignment still allows for an unbiased estimate of the
program's impact, though interpreting the results will require considering the degree of
compliance and crossover between groups.​

Key Point:

The basic estimation of impact is straightforward—subtracting the average outcome of the


comparison group from the treatment group. However, real-world factors like incomplete
compliance may complicate interpretation, but randomized assignment still helps in obtaining an
unbiased estimate.

At What Level Randomized Assignment Is Performed:

1.​ Level of Assignment (Individual, Household, Community, etc.)​

○​ Situation: Randomized assignment can occur at various levels depending on the


program’s implementation.
○​ Example: For a health program, you might randomly select health clinics to
receive treatment, where some clinics get the treatment and others serve as
comparison groups.
2.​ Higher-Level Assignment (Region/Province Level)​

○​ Challenge: Randomization at a higher level, like regions or provinces, can be


problematic due to a small number of regions. This makes balancing treatment
and comparison groups more difficult.
○​ Example: If there are only six provinces and three are randomly assigned to
treatment, factors like weather or local events may introduce bias, affecting the
results.
○​ Key Point: Balancing key factors like weather across treatment and comparison
regions is essential for unbiased results.
3.​ Lower-Level Assignment (Individual/Household Level)​

○​ Challenge: Randomized assignment at lower levels (individual or household)


risks spillover effects (treatment group affecting comparison group) and imperfect
compliance (some in the comparison group may receive treatment).
○​ Example: In a deworming program, nearby households may experience spillover
effects if treated households reduce disease transmission.
○​ Key Point: To reduce spillovers, assign treatment with minimal interaction
between groups, such as placing treated and comparison households far apart.
4.​ Choosing the Right Level​

○​ Optimal Level: Randomizing at the lowest level possible maximizes sample size
and ensures accurate results, while managing risks like spillovers and imperfect
compliance.
○​ Example: In a community-level program, ensure individuals do not interact in
ways that influence each other’s outcomes.
○​ Key Point: When determining the level, weigh the need for a large sample size
against the risk of spillovers and imperfect compliance.

Estimating Impact under Randomized Assignment:

1.​ Measuring Outcomes:​

○​ After the program is implemented, measure outcomes for both the treatment and
comparison groups.
2.​ Calculating Impact:​
○​ The impact is the difference between the average outcomes of the treatment and
comparison groups.​

○​ Formula:​

■​ Impact = Average Outcome (Treatment Group) - Average Outcome


(Comparison Group)
○​ Example:​

■​ Average Outcome for Treatment Group: 100


■​ Average Outcome for Comparison Group: 80
■​ Impact of Program: 100 - 80 = 20
■​ This means the treatment group’s outcome is 20 units higher than the
comparison group’s.
3.​ Assumption:​

○​ In the simple example, it assumes everyone in the treatment group receives the
treatment, and no one in the comparison group does.
4.​ Incomplete Compliance (Real-World Scenario):​

○​ Not all units in the treatment group may receive the treatment, and some in the
comparison group might get the treatment.​

○​ Example: In a teacher training program, some teachers in the treatment group


might not attend, or some in the comparison group might accidentally attend.​

○​ Even with imperfect compliance, randomized assignment provides an unbiased


estimate, though interpretation may require adjusting for non-compliance or
crossover.​

○​ Key Point: While estimating impact is straightforward (difference between the


two group averages), real-world challenges like incomplete compliance require
careful interpretation, but randomized assignment still ensures an unbiased
estimate.​
Instrumental Variables (IV)

Evaluating Programs When Not Everyone Complies


with Their Assignment
1.​ Full Compliance Assumption:​

○​ In randomized assignment (as discussed in Chapter 4), we assume perfect


compliance—everyone assigned to the treatment group receives the treatment
and no one in the comparison group does.
○​ This is often easier in controlled settings like medical trials but is less realistic in
real-world social programs, where compliance can be imperfect.
2.​ Challenges in Real-World Programs:​

○​ Voluntary Enrollment: Many programs allow people to choose whether to


participate, making it difficult to ensure full compliance.
○​ Universal Coverage: Some programs have enough budget to cover the entire
eligible population, so excluding people for evaluation would be unethical.
3.​ Instrumental Variables (IV) Method:​

○​ IV is used when there is imperfect compliance or voluntary participation in a


program.
○​ Key Concept: An instrumental variable (IV) is an external factor that influences
a participant’s likelihood of receiving the treatment but is not controlled by the
participant and is unrelated to their characteristics.
4.​ Using IV for Program Evaluation:​

○​ The IV method relies on an external source of variation to determine who


participates in the program. This source of variation must meet certain conditions
to produce valid impact estimates.
○​ Randomized assignment can serve as a good instrument, as it satisfies the
necessary IV conditions, even when not all individuals comply with their assigned
treatment.
5.​ Applications of IV:​

○​ Imperfect Compliance: IV can extend randomized assignment methods when


not everyone complies with their treatment assignment.
○​ Voluntary or Universal Programs: IV can be used in programs that offer
voluntary enrollment or universal coverage, where random assignment or
exclusion isn’t feasible.

In summary, IV is a powerful tool for evaluating programs where compliance is not guaranteed
or where participants can choose their treatment. By using an external instrument (like random
assignment), researchers can still estimate program impacts effectively.

Types of Impact Estimates


In impact evaluation, the goal is to estimate the effect of a program by comparing outcomes
between a treatment group and a comparison group. Here’s a breakdown of key concepts:

1.​ Full Compliance Assumption:​

○​ In ideal randomized trials, we assume full compliance: everyone in the


treatment group participates and no one in the comparison group does.
○​ In such cases, we estimate the Average Treatment Effect (ATE), which is the
overall impact for the entire population.
2.​ Real-World Programs and Imperfect Compliance:​

○​ In practice, full compliance is rare. In many programs, individuals can decide


whether to participate, leading to imperfect compliance.
○​ In such cases, there are two key impact estimates:
■​ Intention-to-Treat (ITT): The impact of being offered the program,
regardless of whether individuals in the treatment group actually enroll.
■​ Treatment-on-the-Treated (TOT): The impact for those who actually
participate in the program.
3.​ Intention-to-Treat (ITT):​

○​ Definition: The ITT measures the difference in outcomes between the group
offered the treatment (treatment group) and the group not offered the treatment
(comparison group), even if not everyone in the treatment group actually receives
the treatment.
○​ Example: In the Health Insurance Subsidy Program (HISP), all households in
treatment villages were eligible for insurance, but only 90% enrolled. The ITT
compares the outcomes of all households in treatment villages (whether they
enrolled or not) with the outcomes in comparison villages (where no households
enrolled).
4.​ Treatment-on-the-Treated (TOT):​
○​ Definition: The TOT measures the impact on individuals who actually receive the
treatment. It is based only on those who participated, not just those offered the
program.
○​ Example: In the HISP, the TOT would estimate the impact for the 90% of
households in treatment villages that actually enrolled in the health insurance
program. It provides insight into the effect of receiving the treatment, rather than
just being offered it.
5.​ When ITT and TOT Differ:​

○​ If there is full compliance (everyone in the treatment group participates), ITT and
TOT estimates are the same.
○​ However, if there is non-compliance (not all offered the treatment participate),
ITT and TOT will differ because ITT includes both participants and
non-participants, while TOT only includes those who actually receive the
treatment.
6.​ Instrumental Variables Example - Sesame Street:​

○​ Study: Kearney and Levine (2015) used an instrumental variables (IV) approach
to evaluate the impact of the TV show Sesame Street on school readiness.
○​ Instrument: They used households’ proximity to a television tower (which
affected access to UHF channels) as an instrument for participation. The
distance to the tower wasn’t related to household characteristics, but it influenced
whether they could watch Sesame Street.
○​ Results: The study found that children in areas where the show was accessible
were more likely to advance through primary school on time, with notable effects
for African-American, non-Hispanic children, boys, and children from
economically disadvantaged backgrounds.

Key Takeaways:

●​ ITT estimates the impact of offering a program (irrespective of whether individuals


participate).
●​ TOT estimates the impact on those who actually participate.
●​ Both estimates are important for understanding program effectiveness, especially when
compliance is imperfect.

Imperfect Compliance in Program Evaluation


In real-world evaluations, full compliance with program assignments (where everyone in the
treatment group receives treatment and no one in the comparison group does) is often not
achievable. Imperfect compliance can arise in various forms and impacts the estimation of
program effects. Below are key points related to imperfect compliance:
1.​ Full Compliance vs. Imperfect Compliance:​

○​ Full Compliance: All individuals assigned to the treatment group participate, and
none of the comparison group participates.
○​ Imperfect Compliance: Some individuals assigned to the treatment group may
not participate, or individuals from the comparison group may manage to
participate.
2.​ Impact Estimation with Imperfect Compliance:​

○​ In the ideal scenario, we estimate the Average Treatment Effect (ATE) by


comparing the treatment group’s outcomes with the comparison group’s
outcomes, assuming full compliance.
3.​ Imperfect Compliance Case 1 - Non-Enrollment in the Treatment Group:​

○​ Example: In a teacher-training program, some teachers assigned to the


treatment group don’t show up for training.
○​ Solution: In this case, we estimate the Treatment on the Treated (TOT), which
calculates the impact for those who actually participate in the program. The TOT
estimate focuses on teachers who were assigned to the treatment group but only
for those who attended the training.
4.​ Imperfect Compliance Case 2 - Comparison Group Receives Treatment:​

○​ Example: In the teacher-training program, some teachers in the comparison


group might somehow participate in the training.
○​ Problem: This leads to bias because the comparison group is no longer a valid
counterfactual. The average outcome for the comparison group is impacted by
some of its members receiving the treatment, making it impossible to estimate
the true counterfactual.
5.​ Intention-to-Treat (ITT) Estimate:​

○​ The ITT compares outcomes between the treatment group (those assigned
treatment, regardless of participation) and the comparison group (those assigned
no treatment, regardless of participation).
○​ Usefulness: ITT can still be useful for measuring the impact of offering the
program (especially when participants self-select) and is a common estimate
when noncompliance is primarily on the treatment side.
○​ Example: If some teachers in the treatment group don’t enroll, ITT compares the
outcomes of all teachers offered the training with the outcomes of the comparison
group.
6.​ Bias Due to Noncompliance in Comparison Group:​

○​ If individuals in the comparison group receive the treatment, the comparison


group's average outcomes are affected, leading to biased estimates.
○​ Example: Motivated teachers in the comparison group attending the training
could increase the average outcome in the comparison group, leading to an
underestimation of the treatment effect.
7.​ Local Average Treatment Effect (LATE):​

○​ Definition: LATE is the impact on a specific group of individuals—the


compliers—those who would have participated if assigned to the treatment
group and would not have participated if assigned to the comparison group.
○​ When to Use LATE: LATE is particularly useful when both the treatment and
comparison groups have noncompliance. It focuses on the impact for the
subgroup that complied with their assignment.
○​ Example: In the teacher-training program, LATE would estimate the impact only
for those teachers in the treatment group who would have enrolled in the
program if assigned to the treatment group and would not have enrolled if
assigned to the comparison group.
8.​ TOT as a Special Case of LATE:​

○​ TOT is essentially a LATE in cases where noncompliance is only on the


treatment side (i.e., only in the treatment group). It estimates the impact for those
who actually participated in the treatment program.

Summary:

●​ Imperfect Compliance occurs when individuals don’t fully adhere to their treatment or
comparison group assignments.
●​ Intention-to-Treat (ITT) measures the effect of being offered the program, regardless of
participation.
●​ Treatment-on-the-Treated (TOT) estimates the impact only for those who actually
participate in the program.
●​ Local Average Treatment Effect (LATE) provides the treatment effect for a specific
group (the compliers) and is used when there is noncompliance in both treatment and
comparison groups.

Randomized Assignment of a Program and Final Take-Up


This passage explains the evaluation of a job-training program when individuals are randomly
assigned to either the treatment (program) or comparison (no program) group. There are three
types of individuals:

1.​ Enroll-if-assigned: These individuals will enroll in the program if assigned to the
treatment group but will not enroll if assigned to the comparison group.
2.​ Never: These individuals will never enroll in the program, even if assigned to the
treatment group.
3.​ Always: These individuals will find a way to enroll in the program regardless of their
assignment, even if they are in the comparison group.

In the treatment group, the Enroll-if-assigned and Always individuals will enroll, while the
Never group will not. In the comparison group, the Always individuals will enroll, but the
Enroll-if-assigned and Never groups will not. The challenge in evaluating the program lies in
identifying these groups since some individuals can’t be easily distinguished based on their
behavior alone. This makes it difficult to measure the true impact of the program.

Estimating Impact under Randomized Assignment with Imperfect


Compliance
In this passage, the focus is on estimating the Local Average Treatment Effect (LATE) in a
randomized program evaluation where imperfect compliance exists. This means individuals may
not fully comply with their assigned treatment group, either due to voluntary participation or
other factors.

The estimation process involves two steps:

1.​ Intention-to-Treat (ITT) Estimate: This is the first step where we simply compare the
outcomes (e.g., wages) between those assigned to the treatment group and those in the
comparison group, irrespective of whether they actually enrolled in the program. For
instance, if the treatment group’s average wage is $110 and the comparison group’s
average wage is $70, the ITT impact is $40.​

2.​ Local Average Treatment Effect (LATE) Estimate: The next step is to estimate the
impact of the program specifically for the Enroll-if-assigned group, those who would
only enroll if assigned to the treatment group. To do this, the ITT impact is adjusted for
the proportions of the three types of individuals (Never, Always, and Enroll-if-assigned).
For example, if 90% of the treatment group enrolls, and 10% do not (the Never group),
while 10% of the comparison group enrolls (the Always group), then the difference of
$40 in the ITT estimate must come from the 80% Enroll-if-assigned group. Thus, the
LATE for the Enroll-if-assigned group is $50, derived by adjusting the $40 ITT by the
80% enrollment rate.​

The key challenge in these evaluations is that it’s difficult to distinguish between the three
groups (Never, Always, Enroll-if-assigned) for individual participants, as enrollment decisions
are not always observable.
Instrumental Variables (IV) Approach

To estimate LATE effectively, we use randomized assignment as an Instrumental Variable


(IV). The random assignment serves as a predictor for actual program enrollment, but is not
correlated with other factors (e.g., ability or motivation) that might influence outcomes. This IV
approach allows us to recover the LATE by isolating the impact of the program from other
potential confounding factors.

Example: School Voucher Program in Colombia

An example of using IV in practice is the PACES program in Colombia, where secondary school
vouchers were randomly assigned through a lottery. Researchers used the lottery outcome as
an IV to estimate the effect of the vouchers on educational and social outcomes. Even with
some noncompliance (e.g., 90% of lottery winners used the voucher), the randomized
assignment allowed for a reliable estimate of the treatment effect.

In summary, the randomized assignment helps estimate impacts despite imperfect compliance
by acting as an IV to predict enrollment and recover LATE. The final estimate provides insights
into the program’s effect on those who comply with their assignment to treatment.

Interpreting the Estimate of the Local Average Treatment Effect


(LATE)
When evaluating the impact of a program, it's crucial to understand the difference between an
Average Treatment Effect (ATE) and a Local Average Treatment Effect (LATE). The LATE is
particularly important when considering imperfect compliance, where some individuals may
not follow their treatment assignment.

1.​ Understanding the Population of Interest:​

○​ The LATE estimate provides the impact of the program on a specific subgroup of
the population: those who comply with their assignment (Enroll-if-assigned).
○​ These compliers are different from Never and Always types. The Never group
(those who do not participate even if assigned to the treatment group) may
include people who expect little benefit from the program. The Always group
(those who would enroll even if assigned to the comparison group) may include
highly motivated individuals who are likely to benefit the most from participation.
2.​ For example, in a teacher-training program, the Never group might consist of teachers
who feel they don’t need training, have a higher opportunity cost (like a second job), or
face less supervision. On the other hand, the Always group might include teachers who
are highly motivated or are under strict supervision, making them more likely to enroll in
the training even if they were assigned to the comparison group.​

3.​ Limitations of the LATE Estimate:​

○​ The LATE estimate applies only to the Enroll-if-assigned group, and does not
reflect the impact on the Never or Always groups.
○​ For example, if the ministry of education offers a second round of teacher training
and forces the Never group to participate, we do not know how their outcomes
would compare to those in the first round. Similarly, the LATE estimate does not
provide insights into the impact on the Always group (the most self-motivated
teachers).
4.​ The LATE estimate should not be generalized to the entire population, as it only applies
to the subgroup of individuals who would participate if assigned to the treatment group.​

In summary, the LATE estimate gives the program’s impact only for the compliers—those who
enroll in the program if assigned to the treatment group—but does not apply to those who never
enroll or those who always find a way to participate. Therefore, the LATE estimate is not
representative of the entire population, and its interpretation should be confined to the specific
group of compliers.

Randomized Promotion as an Instrumental Variable


In the previous section, we explored how randomized assignment can be used to estimate
impact, even with imperfect compliance. Here, we propose a similar approach for evaluating
programs with universal eligibility, open enrollment, or when the program administrators
cannot control who participates and who does not. This approach, called randomized
promotion (or encouragement design), introduces an additional incentive for a random set
of individuals to enroll in the program. This promotion serves as an instrumental variable (IV),
providing an external source of variation that affects the probability of receiving the treatment
but is unrelated to the participants’ characteristics.

Voluntary Enrollment Programs and Noncompliance Types

In a voluntary enrollment program, individuals who are interested can choose to enroll. For
example, consider a job-training program where individuals can enroll freely. But since not all
will choose to participate, we encounter different types of individuals:

●​ Always: These are individuals who will enroll in the program regardless of any external
influence.
●​ Never: These individuals will never enroll in the program, regardless of external
incentives.
●​ Compliers or Enroll-if-promoted: These individuals will enroll only if encouraged or
promoted, such as through an additional incentive or outreach. Without the incentive,
they would not enroll in the program.

Example of Randomized Promotion in a Job-Training Program

Imagine a job-training program with an open enrollment policy where anyone can sign up.
However, many unemployed individuals may not know about the program or may lack the
incentive to participate. To address this, an outreach worker is hired to randomly visit a subset
of unemployed people and encourage them to enroll in the program.

●​ The outreach worker does not force participation but instead incentivizes a random
group to participate.
●​ The non-visited group is also free to enroll, but they have to seek out the program on
their own.
●​ If the outreach effort works, those who are visited by the outreach worker are more likely
to enroll than those who are not visited.

Using Randomized Promotion to Estimate Impact

To evaluate the impact of the job-training program, we cannot simply compare those who
enrolled with those who did not. The enrollees are likely different from the non-enrollees in ways
that affect their outcomes, such as education or motivation.

However, since the outreach worker's visits are randomized, we can compare the group that
was visited (promoted) with the group that was not visited (non-promoted). This random
assignment helps us create a valid comparison group because:

●​ Both groups (promoted and non-promoted) contain individuals who are Always enrolled
and individuals who are Never enrolled, based on their individual characteristics.
●​ The key difference is that in the promoted group, individuals who are compliers
(Enroll-if-promoted) are more likely to enroll because of the extra encouragement, while
the non-promoted group has these same individuals, but without the added incentive to
participate.

The variation between the two groups, with one group being encouraged to enroll, allows us to
estimate the Local Average Treatment Effect (LATE). Specifically, the LATE estimate tells us
the impact of the program on the Enroll-if-promoted group, which is the group that only enrolls
because of the random promotion.

Conditions for Valid Randomized Promotion

For this approach to be valid, several conditions need to be met:


1.​ Effectiveness of Promotion: The outreach efforts should significantly increase
enrollment for the Enroll-if-promoted group. If the promotion does not effectively
encourage enrollment, the groups will not differ enough to generate a meaningful
estimate of impact.
2.​ Independence of Promotion from Final Outcomes: The promotion itself should not
directly affect the outcomes of interest (e.g., earnings) since we are interested in the
impact of the training program, not the outreach strategy itself. If the promotion includes
incentives (e.g., money to enroll), this could bias the outcome.

LATE Estimate from Randomized Promotion

Randomized promotion creates a random difference between the promoted and non-promoted
groups, making it an effective instrumental variable (IV). This helps us estimate the impact of
the program on the compliers (Enroll-if-promoted), but the result is still a LATE estimate. Just
like in randomized assignment with imperfect compliance, this estimate applies only to the
specific subgroup of individuals who are compliers and should not be generalized to the whole
population. The Always and Never groups, who behave differently, are not included in the
estimate.

In summary, randomized promotion is a strategy that can be used when a program has open
enrollment and it is possible to randomly encourage some individuals to participate. This
strategy allows us to use random promotion as an instrumental variable to estimate impact in an
unbiased way. However, as with randomized assignment with imperfect compliance, the impact
evaluation based on randomized promotion provides a LATE estimate, which is a local estimate
of the impact on a specific subgroup of the population, the Enroll-if-promoted group. This
estimate cannot be directly extrapolated to the entire population, as it does not account for the
Always or Never groups.

You Said “Promotion”?

Randomized promotion aims to increase participation in a voluntary program by encouraging a


randomly selected subsample of the population. The promotion can take different forms, such
as:

●​ Information campaigns: Reaching individuals who didn’t enroll because they were
unaware or didn't fully understand the program's content.
●​ Incentives: Offering small gifts, prizes, or transportation to motivate enrollment.

This strategy relies on the instrumental variable (IV) method to provide unbiased estimates of
program impact. It randomly assigns an encouragement to participate in the program, which
helps evaluate programs that are open to anyone eligible.
Key Concept

Randomized promotion is an instrumental variable method that allows for unbiased estimation
of program impact. It randomly encourages a selected group to participate, making it especially
useful for evaluating programs with open eligibility.

Conditions for Valid Randomized Promotion

For randomized promotion to provide a valid estimate of the program’s impact, several
conditions must be met:

1.​ Promoted and Nonpromoted Groups Must Be Similar:


○​ The average characteristics of the promoted and nonpromoted groups must be
statistically equivalent. This equivalence is ensured by randomly assigning
promotion to the individuals in the evaluation sample.
2.​ The Promotion Should Not Directly Affect Outcomes:
○​ The promotion itself should not influence the outcomes of interest. This ensures
that any observed changes in outcomes are due to the program itself and not the
promotion.
3.​ Promotion Must Substantially Increase Enrollment:
○​ The promotion must lead to a substantial increase in enrollment among the
promoted group compared to the nonpromoted group. This can be verified by
comparing enrollment rates between the two groups.

The Randomized Promotion Process

The process of randomized promotion involves several steps:

1.​ Define Eligible Units: Identify the individuals eligible for the program.​

2.​ Select the Evaluation Sample: Randomly select individuals from the population to be
included in the evaluation. This can be a subset of the population or, in some cases, the
entire population if data is available.​

3.​ Randomize Promotion: Randomly assign individuals to the promoted or


nonpromoted groups. This random assignment ensures that both groups are
equivalent in terms of characteristics.​

4.​ Enrollment: After the promotion campaign, observe who enrolls in the program.​

In the nonpromoted group, only individuals in the Always category will enroll. However, it’s not
possible to distinguish between the Never and Enroll-if-promoted groups because they both
do not enroll.
In the promoted group, both Enroll-if-promoted and Always individuals will enroll, while the
Never individuals will not. In this group, we can identify the Never group, but we cannot
distinguish between Enroll-if-promoted and Always individuals.

Types of Units in the Population

Once the population is identified, we can classify units into three groups:

1.​ Always: Individuals who will always enroll in the program, regardless of promotion.
2.​ Enroll-if-promoted: Individuals who will only enroll if they receive additional promotion
or encouragement.
3.​ Never: Individuals who will never enroll in the program, even if promoted.

These types—Always, Enroll-if-promoted, and Never—are intrinsic characteristics of individuals,


often related to factors like motivation and information, and cannot easily be observed by the
program evaluation team.

Summary

Randomized promotion provides a creative way to evaluate voluntary programs by


encouraging a random subset of individuals to participate. By ensuring the promotion is random
and effective, it allows us to identify the Local Average Treatment Effect (LATE) for the
Enroll-if-promoted group. However, as with other instrumental variable methods, the estimate
applies only to this specific subgroup, and conclusions cannot be directly extrapolated to the
entire population. This approach is especially useful when random assignment to the program is
not possible.

Estimating Impact under Randomized Promotion

Imagine a scenario where we are evaluating a program using randomized promotion. Suppose
there are 10 individuals per group in a study. In the nonpromoted group, 30% of individuals
enroll (which means 3 individuals, all of whom are "Always" enrollees). In the promoted group,
80% of individuals enroll (which means 3 "Always" individuals and 5 "Enroll-if-promoted"
individuals).

Step 1: Compute the Difference in Outcomes

●​ The average outcome in the nonpromoted group is 70, and in the promoted group, it is
110.
●​ The difference between the average outcomes is 40 (110 - 70).

Now, we can attribute the difference of 40 in outcomes:


●​ Never group: We know that this group does not enroll in either the promoted or
nonpromoted group, so they don't contribute to this difference.
●​ Always group: Since the Always group enrolls in both the promoted and nonpromoted
groups, they also do not contribute to the difference.
●​ Enroll-if-promoted group: This is the group that only enrolls when promoted, and they
are the ones responsible for the 40-point difference.

Step 2: Calculate the Local Average Treatment Effect (LATE)

To estimate the LATE, we need to understand that the Enroll-if-promoted group represents
50% of the population in the promoted group (5 out of 10). The impact on this group is
calculated by dividing the total difference (40) by the percentage of individuals in the population
who are Enroll-if-promoted (50% or 0.5).

●​ The LATE is 40 / 0.5 = 80.

Thus, the Local Average Treatment Effect (LATE) for the Enroll-if-promoted group is 80.

This LATE estimate is valid because the promotion was assigned randomly, ensuring that the
promoted and nonpromoted groups have similar characteristics. Therefore, the observed
differences in outcomes can be attributed to the program's impact on the Enroll-if-promoted
individuals.

Important Notes

●​ The impact calculated here is specific to the Enroll-if-promoted group. This estimate
cannot be directly extrapolated to other groups (like the Never or Always groups)
because they are likely to be very different from the Enroll-if-promoted group in terms
of their characteristics, such as motivation or information.
●​ In this case, while the promoted group showed a higher average outcome (110), this
increase in outcomes is entirely due to the individuals who enrolled because of the
promotion. The Always and Never groups did not contribute to this impact.

Box 5.3: Randomized Promotion of Education Infrastructure Investments in


Bolivia

In 1991, Bolivia scaled up a successful Social Investment Fund (SIF) aimed at improving rural
infrastructure, including education, health, and water. As part of the impact evaluation for the
education component, randomized promotion was used to encourage communities in the
Chaco region to apply for funding.

●​ Promoted communities received extra visits and encouragement from program staff.
●​ Non-promoted communities could apply independently.
The evaluation showed that the program succeeded in improving the physical infrastructure
of schools (e.g., electricity, sanitation, and textbooks), but it had little effect on educational
outcomes. However, there was a small reduction (about 2.5%) in the dropout rate.

The use of randomized promotion provided valuable insights into how physical infrastructure
improvements affect school quality. These findings helped adjust future priorities in Bolivia’s
education investment strategy.

Limitations of the Randomized Promotion Method

While randomized promotion is a useful strategy for evaluating voluntary or universally


eligible programs, it has certain limitations:

1.​ Effectiveness of Promotion:​

○​ The promotion must increase enrollment effectively. If the promotion doesn’t


significantly change enrollment rates, there will be no difference between the
promoted and nonpromoted groups, making it impossible to detect any impact.
○​ The promotion campaign needs careful planning and piloting to ensure that it
actually works as intended.
2.​ Estimates Only for a Subset of the Population:​

○​ The LATE (Local Average Treatment Effect) estimates are only for those
individuals who enroll in the program only when encouraged (the
Enroll-if-promoted group). This is a subset of the entire population.
○​ If the program's goal is to help people who would enroll without encouragement
(the Always group), the randomized promotion method will not estimate impacts
for this group. The LATE estimate applies only to individuals who enroll when
encouraged.
○​ In some cases, the Always group may be the target group for the program,
meaning that the randomized promotion approach won't fully capture the impact
on them.

Conclusion

Randomized promotion is an effective strategy for evaluating voluntary programs, especially


when random assignment to treatment is not feasible. By creating an instrumental variable (IV)
that increases the likelihood of participation, it allows evaluators to estimate the Local Average
Treatment Effect (LATE) for the Enroll-if-promoted subgroup. However, it's important to
recognize that the estimates from this method are specific to this subgroup and may not apply to
the broader population, particularly those who always or never enroll in the program.
Regression Discontinuity Design

Regression Discontinuity Design (RDD) – A Clear Explanation

What is Regression Discontinuity Design (RDD)?

Regression Discontinuity Design (RDD) is an evaluation method used to measure the causal
impact of programs or interventions when eligibility is determined by a threshold on a
continuous variable (e.g., income, test score, age). Essentially, this method takes advantage of
a situation where the eligibility for a program is based on whether a certain value crosses a
specific cutoff. It compares people who are just above and just below this cutoff to assess the
impact of the program.

RDD is particularly useful because it helps evaluate program effectiveness in situations where
random assignment (like in randomized controlled trials) is not feasible.

How Does RDD Work?

Here’s the general process:

1.​ Continuous Eligibility Index: The program or policy uses a continuous index to
determine eligibility. This could be a score, income level, age, or any other measurable
factor.​

○​ Example: A program targeting people below a certain poverty score or students


who score above a specific test score.​

2.​ Threshold or Cutoff: A specific value in the eligibility index is defined as the cutoff.
Individuals just below this value may be eligible for the program, while those just above
may not be.​

○​ Example: For a scholarship, only students who score 90 or higher in an exam


may qualify. Students with a score of 89 are not eligible.​

3.​ Comparison of Groups: The individuals just above the cutoff are very similar to those
just below it, except for their eligibility for the program. RDD compares these two groups
(treated vs untreated) to determine the causal impact of the program.​

Main Conditions for RDD

To use RDD effectively, the following conditions must be met:


1.​ Smooth (Continuous) Index: The eligibility index must be continuous or smooth. For
example, a test score or income is continuous because it can take many values (e.g.,
80.5, 81, 81.1). In contrast, discrete variables like employment status or car ownership
are not suitable for RDD.​

2.​ Clearly Defined Cutoff: There must be a clear, unambiguous cutoff that separates
those who are eligible from those who are not. For instance, if the poverty index is used,
only households with scores below 50 are considered eligible.​

3.​ Unique Cutoff for the Program: The cutoff should only be used for the program being
evaluated. If the same threshold is used for multiple programs, it could confuse the
impact measurement for a single program.​

4.​ Non-Manipulability of the Score: The score that determines eligibility should not be
easily manipulated. This ensures that the assignment to treatment (program
participation) is random around the cutoff, making it possible to draw valid conclusions.​

Example: Fertilizer Subsidy for Farmers

Let’s use the example of an agriculture program targeting small farms:

●​ Eligibility: The program provides fertilizer subsidies to farms with fewer than 50
hectares of land.​

●​ Index: The number of hectares a farm has is the continuous eligibility index.​

●​ Cutoff: Farms with fewer than 50 hectares qualify for the subsidy, while those with 50
hectares or more do not.​

Now, let’s say we have:

●​ Farms with 48, 49, and 49.9 hectares that are eligible for the subsidy.​

●​ Farms with 50, 50.1, and 50.2 hectares that are ineligible.​

RDD would compare the outcomes (e.g., rice yields) of farms just below the cutoff (49.9
hectares) with those just above the cutoff (50.1 hectares). Since these farms are very similar in
all aspects except for the subsidy (fertilizer), the difference in their outcomes can be attributed to
the impact of the fertilizer subsidy itself.

Impact Measurement:
●​ The average rice yield for farms just below 50 hectares is compared to those just above
50 hectares.​

●​ Any difference in rice yield between these groups is considered the effect of the fertilizer
subsidy.​

Visualizing the Example:

●​ Baseline (before the program): Rice yields are plotted against the number of hectares
of land. You would typically see a decline in yield as farm size increases (i.e., smaller
farms tend to have lower yields).​

●​ Follow-up (after the program): You compare the yield after the subsidy is given. The
farms that received the subsidy (those just under 50 hectares) may show a noticeable
increase in rice yields, while those just above the cutoff (ineligible farms) do not.​

Key Insights from the RDD:

●​ Local Average Treatment Effect (LATE): The impact estimated by RDD is valid only
near the cutoff (around 50 hectares). So, we can be confident in the results for
medium-sized farms just below the cutoff, but the results may not apply to very small
farms (e.g., 10 or 20 hectares).​

●​ No Need for Control Group: Since the program rules assign eligibility strictly based on
the cutoff, there’s no need for a traditional control group in this evaluation. The
comparison group (farms just above the cutoff) acts as a valid counterfactual.​

Advantages and Limitations of RDD:

Advantages:

●​ Causal Inference: RDD is one of the best quasi-experimental methods for estimating
causal effects because it uses a natural cutoff to compare very similar individuals.​

●​ No Randomization Needed: RDD can be used in cases where randomized controlled


trials are not feasible.​

Limitations:

●​ Local Results: The impact estimated by RDD is local to the region around the cutoff,
which means it may not be generalized to all potential participants (e.g., very small or
very large farms).​

●​ Data Requirements: RDD requires a large number of observations near the cutoff to
provide accurate estimates.​

Conclusion

Regression Discontinuity Design (RDD) is a powerful tool for evaluating programs that use a
clear eligibility index and cutoff. By comparing individuals or units just above and just below the
cutoff, RDD allows researchers to estimate the causal impact of the program. However, the
results are most reliable for the "local" area around the cutoff, and the method assumes that the
index cannot be manipulated.

Example Recap:

In the fertilizer subsidy program for farms:

●​ Farms just under 50 hectares receive subsidies (treatment group).​

●​ Farms just over 50 hectares do not receive subsidies (comparison group).​

●​ The difference in rice yields between these groups is attributed to the fertilizer subsidy,
giving us an estimate of the program’s impact.​

Fuzzy Regression Discontinuity Design (RDD)


The concept of Fuzzy Regression Discontinuity Design (RDD) builds on the core idea of
sharp RDD but accounts for the possibility of noncompliance on either side of the cutoff. In
sharp RDD, units comply strictly with the eligibility rules, meaning those on one side of the cutoff
receive the treatment, and those on the other side do not. However, in fuzzy RDD, some units
that qualify for the program may opt out, and others who are not eligible might somehow
manage to participate.

Key Differences between Sharp and Fuzzy RDD:

●​ Sharp RDD: Full compliance with the treatment assignment based on the cutoff. If a unit
is eligible, they must participate.​

●​ Fuzzy RDD: Some units do not comply with the eligibility assignment. For example,
those who qualify may choose not to participate, and some who do not qualify might find
a way to participate. In this case, we apply the instrumental variable (IV) approach to
account for this noncompliance.​

The instrumental variable in a fuzzy RDD is the eligibility index, which determines whether a
unit is eligible for the program based on their score (for instance, the poverty index). The
instrument helps identify the local average treatment effect (LATE), which is only valid for the
subpopulation around the cutoff that complies with the treatment assignment.

Example: Social Safety Net in Jamaica

The Jamaica PATH program provides a clear example of using RDD to evaluate the
effectiveness of a social safety net program targeting low-income households. Here's how the
researchers evaluated the program:

1.​ Eligibility Based on a Poverty Index:​


The program used a poverty index to identify eligible households. If a household’s index
score was below a certain threshold, they were eligible for the program.​

2.​ RDD Application:​


The researchers applied RDD by comparing households just below and just above the
eligibility threshold. They found that the treatment group (households eligible for the
program) experienced an increase in school attendance and health care visits.​

3.​ Fuzzy RDD Challenges:​


If some households did not follow the treatment assignment (noncompliance), this could
affect the validity of the evaluation. Therefore, researchers might have had to use fuzzy
RDD and account for this noncompliance using the eligibility index as an instrumental
variable.​

Validating RDD

Before using RDD, it is crucial to verify that there is no manipulation of the eligibility index.
Manipulation might occur if individuals or administrators adjust the index to gain access to the
program.

1.​ Density Tests: By plotting the distribution of the eligibility index, researchers can check
for signs of manipulation. If there is a bunching of units just below the cutoff and a
scarcity just above, that might suggest manipulation (e.g., people reporting lower poverty
scores to qualify for benefits).​

2.​ Participation Tests: Checking the relationship between the eligibility index and actual
program participation helps confirm whether the program was administered as planned.​
Example: Health Insurance Subsidy Program (HISP)

The HISP study offers another example where RDD was used to evaluate the impact of a
health insurance subsidy program. Here's the process:

1.​ Eligibility Criteria: A poverty index with a cutoff score of 58 determines who is eligible
for the health insurance subsidy. Households with scores below 58 are considered poor
and eligible for the program.​

2.​ Density and Participation Check: No manipulation is found around the cutoff, as the
density of households across the poverty index is smooth, and only households below
the cutoff participate in the program.​

3.​ RDD Evaluation:​

○​ The follow-up analysis shows a discontinuity at the cutoff (poverty index of 58),
indicating that those just below the cutoff have significantly higher health
expenditures due to the subsidy.​

○​ A regression analysis further confirms the impact of the program, showing a


reduction of $9.03 in health expenditures for eligible households (significant at
the 1% level).​

Graphical Representation of RDD:

●​ Figure 6.4: Displays the potential manipulation of the eligibility index.​

○​ In Panel A (no manipulation), the distribution of the eligibility index is smooth.​

○​ In Panel B (with manipulation), you can see a bunching effect, indicating that
some households might have manipulated their eligibility score to qualify for the
program.​

Conclusion

RDD, whether sharp or fuzzy, provides a robust framework for evaluating programs that use
eligibility thresholds. Fuzzy RDD is particularly useful when there’s noncompliance with
treatment assignment. By using the instrumental variable approach, researchers can estimate
the local average treatment effect (LATE) for the population near the cutoff.

In practice, fuzzy RDD is applied when noncompliance is suspected. In contrast, sharp RDD is
valid when there is strict compliance, meaning the eligibility criteria are strictly adhered to. Both
designs require thorough checks for manipulation and careful validation of program
participation.

Limitations and Interpretation of the Regression


Discontinuity Design (RDD) Method

Limitations and Interpretation of the Regression Discontinuity Design


(RDD) Method

Regression Discontinuity Design (RDD) is a powerful quasi-experimental method for estimating


the local average treatment effect (LATE) at the eligibility cutoff. However, like any evaluation
technique, RDD has its limitations and specific conditions for optimal application.

1. Local Average Treatment Effect (LATE)

RDD provides an estimate of the treatment effect specifically for the group of individuals
around the cutoff (the local population), rather than the entire population. This can be a strength
or a limitation, depending on the policy question:

●​ Strength: If the policy question is about marginal decision-making (e.g., Should the
program be expanded or contracted near the eligibility cutoff?), then RDD gives the
exact estimate needed.​

●​ Limitation: If the question is about the overall effectiveness of the program for the
entire population, RDD may not provide a representative estimate, as it only applies to
those near the cutoff.​

Interpretation Issue: The generalizability of the results is limited to those close to the cutoff
score. Individuals far from the cutoff may have different characteristics or responses to the
program, which makes extrapolating the results less reliable for the broader population.

2. Imperfect Compliance and Fuzzy RDD

Another key challenge arises when there is noncompliance with the assignment rule. This
occurs when individuals who are supposed to receive the treatment (based on the eligibility
index) do not participate, or when individuals who are supposed to be in the control group
manage to participate in the program. This leads to fuzzy RDD, where the eligibility index
becomes an instrumental variable for participation in the program.
●​ Instrumental variable methodology: In this case, the eligibility cutoff serves as an
instrument for whether individuals receive the treatment. However, this means that the
estimated treatment effect only applies to those marginally compliant with the eligibility
rule (i.e., those close to the cutoff), rather than the broader population.​

Interpretation Issue: The findings are localized to those on the margin of eligibility and may
not apply to those who are always compliant (always-takers) or never compliant (never-takers).

3. Statistical Power and Bandwidth Choice

RDD typically relies on a smaller sample of units close to the cutoff, which can lower the
statistical power of the analysis compared to methods that use larger samples (e.g.,
randomized controlled trials). To address this, researchers must choose an appropriate
bandwidth around the cutoff:

●​ A larger bandwidth may increase sample size, but it could also introduce greater
heterogeneity between treatment and comparison units, potentially biasing the results.​

●​ A smaller bandwidth may lead to fewer observations, reducing the power of the analysis.​

Practical Tip: To mitigate this challenge, researchers often perform robustness checks by
testing the results using different bandwidths. This helps assess the sensitivity of the estimates
to the choice of bandwidth.

4. Functional Form Sensitivity

RDD relies on a regression model to estimate the treatment effect, and the functional form of
the relationship between the eligibility index and the outcome of interest plays a crucial role. If
the relationship is non-linear, but the model assumes a linear form, it could lead to incorrect
conclusions.

Practical Tip: Researchers should test the sensitivity of their results to different functional forms
(e.g., linear, quadratic, cubic) to ensure the robustness of their estimates. Failure to account for
complex relationships can lead to incorrect conclusions about the existence and magnitude of
a discontinuity at the cutoff.

5. Manipulation of the Eligibility Rule

For RDD to provide valid results, the eligibility rule and cutoff must be precisely defined and
resistant to manipulation. If the eligibility index can be manipulated by program participants,
enumerators, or other stakeholders (e.g., by altering reported values of assets or income), this
could lead to a discontinuity in the eligibility index that undermines the assumptions of the
RDD.
Example of Manipulation: In some cases, if participants know that a small adjustment (e.g., a
minor change in reported income or assets) could make them eligible for the program, they
might manipulate their eligibility score. This results in a bunched distribution of scores just
below the cutoff, which would undermine the validity of the RDD.

Practical Tip: Researchers should test for manipulation by checking the distribution of the
eligibility index around the cutoff (e.g., using density tests) to ensure there is no unusual
concentration of participants just below the cutoff.

6. Uniqueness of the Eligibility Rule

RDD works best when the eligibility rule is specific and unique to the program being evaluated.
If the same eligibility index is used for multiple programs (e.g., multiple welfare or poverty
programs), it becomes difficult to isolate the effect of one program from the effects of others.
This issue arises when targeting rules overlap, and the eligibility cutoff might not be unique to a
single program.

Interpretation Issue: When eligibility rules are not unique, it becomes challenging to attribute
the observed effect to the program of interest. Multiple programs targeting the same individuals
can confound the results.

Summary of RDD Limitations:

1.​ Locality of Estimates: The estimates apply to those near the cutoff, not the entire
population.​

2.​ Noncompliance: Fuzzy RDD requires an instrumental variable approach, but the results
are only relevant for those who comply with the eligibility rule.​

3.​ Statistical Power: A small sample size near the cutoff can reduce statistical power, and
bandwidth selection is crucial.​

4.​ Sensitivity to Functional Form: Incorrect functional forms can distort results;
robustness checks are essential.​

5.​ Manipulation: Manipulation of eligibility can undermine RDD validity.​

6.​ Eligibility Rule Uniqueness: If the eligibility rule is shared across multiple programs, it’s
hard to isolate the effect of a specific program.​
Difference in Difference
The Difference-in-Differences (DD) method is a technique used in impact evaluation when a
program is implemented, but there is no clear rule for assignment or randomization. It is typically
employed when the program's assignment rules are less transparent or not feasible for more
precise methods like randomized controlled trials (RCTs), instrumental variables (IV), or
regression discontinuity design (RDD). This method uses two groups: a treatment group (those
who receive the program) and a comparison group (those who do not). The method compares
the changes in outcomes over time between these two groups.

Key Concepts:

1.​ Treatment Group: The group of individuals or entities receiving the program or
intervention.​

2.​ Comparison Group: The group of individuals or entities that do not receive the
program, but otherwise face similar conditions.​

3.​ Before-and-After Comparison: A basic method where we compare the outcomes


before and after the intervention for both the treatment and comparison groups.​

4.​ Counterfactual Estimate: The DD method uses the comparison group to estimate what
would have happened to the treatment group if they had not received the intervention.​

Steps in the Difference-in-Differences Method:

1.​ Measure Outcomes Before and After the Program:​

○​ For both the treatment and comparison groups, the outcome of interest (e.g.,
employment rate) is measured before and after the intervention.​

2.​ Calculate Changes in Outcomes:​

○​ For the treatment group: Measure the change in the outcome from before to
after the intervention. This is denoted as (B - A).​
○​ For the comparison group: Measure the change in the outcome over the same
period. This is denoted as (D - C).​

3.​ Compute the Difference-in-Differences (DD):​

○​ The estimate of the program's impact is calculated by subtracting the change in


the comparison group from the change in the treatment group:​
DD Impact=(B−A)−(D−C)\text{DD Impact} = (B - A) - (D - C)
○​ This accounts for any time-varying factors that could affect both groups, such as
changes in the economy or other external influences.​

Example:

Imagine a road repair program where the goal is to improve access to labor markets, and
employment rates are used as the outcome measure. If certain districts (treatment group)
receive the program while others (comparison group) do not, we compare the changes in
employment rates over time:

●​ For the treatment group: The employment rate goes from 60% (A) before the program
to 74% (B) after the program.​

●​ For the comparison group: The employment rate goes from 78% (C) to 81% (D) after
the program.​

The change in the treatment group is (B - A) = 74% - 60% = 14%.​


The change in the comparison group is (D - C) = 81% - 78% = 3%.

Thus, the DD Impact is:

DD Impact=(B−A)−(D−C)=14%−3%=11%\text{DD Impact} = (B - A) - (D - C) = 14\% - 3\% =


11\%

This suggests that the program led to an 11% increase in the employment rate, after accounting
for the general time trend that also affected the comparison group.

Assumptions:

●​ Parallel Trends Assumption: The key assumption in the DD method is that, in the
absence of the program, the treatment and comparison groups would have experienced
the same trend over time. This means that any difference in their outcomes can be
attributed to the program.​
Advantages and Limitations:

●​ Advantages:​

○​ Useful when randomized assignment is not feasible.​

○​ Helps control for unobserved factors that are constant over time within each
group.​

●​ Limitations:​

○​ Requires the parallel trends assumption, which may not hold in all cases.​

○​ Not always possible to find a good comparison group.​

In summary, the DD method is a powerful tool when randomization isn't possible, as it combines
before-and-after comparisons with comparisons between treatment and control groups to better
estimate program impacts. However, it relies on strong assumptions, and results can be biased
if those assumptions are violated.

How is the Difference-in-Differences (DD) Method


Helpful?
The Difference-in-Differences (DD) method is particularly useful in impact evaluation when the
program's assignment is not randomized or clearly defined, which might otherwise introduce
bias in comparing treated and non-treated groups. One of the primary challenges in such
situations is that the treatment and comparison groups may have different characteristics that
could explain differences in outcomes, rather than the program itself.

Here’s how the DD method helps:

1.​ Controlling for Time-Invariant Differences:​


The main advantage of the DD method is its ability to control for time-invariant
differences between the treatment and comparison groups. These are characteristics
that do not change over time, such as an individual's birth year, a region's geographical
location, or a community's baseline health or education level. These characteristics,
whether observed or unobserved, could influence the outcomes being measured, but
they do not vary over time. By focusing on changes over time, the DD method helps
"cancel out" the effect of these constant factors, both observed and unobserved.​

For example: If we are studying the impact of a road repair program on employment,
characteristics such as the baseline infrastructure of a district (which doesn't change
over time) might influence employment. By comparing the before-and-after changes in
the treatment and comparison groups, the DD method helps isolate the effect of the
program from these time-invariant factors.​

2.​ Comparing Trends Between Groups, Not Just After Outcomes:​


Instead of comparing outcomes at a single point in time (after the intervention), the DD
method compares trends over time for the treatment and comparison groups. This
comparison accounts for any underlying trends that might affect both groups. By doing
so, it helps ensure that any observed differences between the groups after the
intervention are more likely to be attributed to the program itself, rather than pre-existing
trends or factors that were already in place before the intervention.​

3.​ Dealing with Unobserved Characteristics:​


The DD method can also help control for unobserved characteristics that do not change
over time. These might include personality traits, family background, or historical cultural
factors, which could affect an individual's or district's outcomes. By comparing changes
rather than levels of outcomes, we remove the bias introduced by these unobserved,
time-invariant factors.​

The Difference-in-Differences (DiD) method is a powerful statistical tool that helps evaluate
the causal impact of a treatment or intervention in observational settings where randomized
controlled trials are not feasible. Here's a breakdown of the key points related to the "Equal
Trends" Assumption and the testing methods mentioned in your provided text:

The “Equal Trends” Assumption in


Difference-in-Differences
The core assumption of DiD is that, in the absence of treatment, the treatment group and the
comparison group would have followed parallel trends over time. This means that any
difference in the trends between these two groups after the intervention can be attributed to the
treatment effect. This assumption is crucial because:

●​ What It Implies: Without the program or treatment, the outcomes for both the treatment
and comparison groups should have evolved in the same way (parallel trends).​

●​ What Goes Wrong If This Assumption Is Violated: If the groups would have followed
different trends in the absence of treatment, the comparison of post-treatment
differences would lead to a biased estimate of the treatment effect. Specifically, you
might overestimate or underestimate the impact of the treatment, as the counterfactual
for the treatment group (i.e., what would have happened to them without the treatment)
is incorrectly modeled using the comparison group.​

Example:

If a road repair program occurs in a treatment area at the same time a new seaport is
constructed, it would be impossible to separate the effects of the two events using DiD because
the comparison group may have had different trends or experiences (e.g., the impact of the
seaport) that could distort the interpretation of the program’s effects.

Validity Check for the “Equal Trends” Assumption

While you can’t directly observe what would have happened to the treatment group without the
program, there are ways to test the validity of the equal trends assumption:

1.​ Pre-treatment Trend Comparison:​

○​ To ensure that the trends of both groups were similar before the treatment, you
should compare the changes in outcomes for the treatment and comparison
groups before the intervention.​

○​ If both groups showed similar trends in the pre-intervention period, it increases


confidence that their future trends would have been parallel in the absence of the
intervention.​

2.​ Placebo Test:​

○​ This test helps verify that no unaccounted-for differences are driving the results.​

○​ Fake treatment groups are created (e.g., using a cohort that was not affected
by the intervention) to check if any pre-existing differences between the groups
were in fact influencing the outcome.​

○​ If this “fake” treatment group doesn’t show any effect, it supports the assumption
of parallel trends for the actual groups.​

3.​ Testing with a Fake Outcome:​

○​ Another variation of the placebo test is to check the assumption with an outcome
that is unaffected by the treatment. For example, if the intervention is supposed
to influence school attendance, you can check if the treatment has any impact
on something unrelated, such as number of siblings. A significant effect here
would indicate a flawed comparison group.​

4.​ Comparison of Multiple Comparison Groups:​

○​ If different comparison groups (e.g., eighth graders vs. sixth graders) yield similar
results, it strengthens the case for the validity of the parallel trends assumption.​

○​ If they yield different results, it suggests that the assumption might not hold.​

Box Examples: Application of the DiD Method

Water Privatization and Infant Mortality in Argentina (Box 7.3)

●​ Researchers used DiD to analyze the effects of water privatization on child mortality
rates.​

●​ They showed equal pre-intervention trends in mortality rates across municipalities


before privatization.​

●​ Placebo Test: They tested a fake outcome (mortality from causes unrelated to water),
and found no impact, suggesting that the program had a valid effect on mortality due to
water-related diseases.​

●​ The study found that privatization was associated with reduced child mortality,
particularly in the poorest areas where the water network expansion was greatest.​

School Construction in Indonesia (Box 7.4)

●​ The evaluation focused on the impacts of a large-scale school construction program.​

●​ They tested the equal trends assumption by comparing age cohorts (18–24 vs. 12–17
years) in districts where school construction happened, and found no significant
differences in educational attainment pre-program.​

●​ The results confirmed that the program had a positive impact on educational attainment
and wages for younger cohorts, showing parallel trends in the absence of the
intervention.​

Evaluating the Impact of HISP: Example

In the case of the Health Insurance Subsidy Program (HISP), DiD was used to evaluate how
the program affected household health expenditures:
●​ Before and After Comparison: The table you provided compares health expenditures
for enrolled and nonenrolled households before and after the program.​

●​ Regression Analysis: The simple and multivariate linear regression confirmed a


significant reduction of $8.16 in household health expenditures as a result of the
program.​

Conclusion

The difference-in-differences method is useful for controlling for both observed and
unobserved time-invariant characteristics that could otherwise confound the results.
However, its validity hinges on the assumption of equal trends in the pre-intervention period.
The various testing methods, including pre-treatment comparisons, placebo tests, and using
multiple comparison groups, help assess the robustness of this assumption and ensure that the
estimated treatment effects are not biased.

The Difference-in-Differences (DiD) method, while useful for estimating the impact of an
intervention, has several limitations that can lead to biased or invalid estimates of treatment
effects, even when the assumption of equal trends holds. Let’s break down these limitations in
more detail:

Key Limitations of Difference-in-Differences


1.​ Unaccounted Factors That Affect the Groups Differently:​

○​ Explanation: The DiD method assumes that the only difference between the
treatment and comparison groups is the treatment itself. However, if there are
any other factors that affect one group more than the other at the same time
the intervention occurs, and if these factors are not controlled for in the
regression, the results will be biased.​

○​ Example: If you're evaluating the impact of subsidized fertilizer on rice


production, but there is a drought in year 1 that disproportionately affects the
subsidized farmers (treatment group), then the difference in rice production
between the treatment and comparison groups could be due to the drought, not
just the fertilizer subsidy. In this case, the DiD method would incorrectly attribute
the difference in production solely to the fertilizer subsidy, leading to an invalid
estimate.​
2.​ Time-Varying Factors:​

○​ The DiD method assumes that the only thing that changes over time for the two
groups is the intervention itself. However, if there are any time-varying factors
that affect the groups differently, such as natural disasters, policy changes, or
regional economic shifts, the assumption of equal trends can be violated.​

○​ Example: If a new government policy is implemented in the treatment area


during the same period as the intervention, it could distort the estimated
treatment effect. If this policy affects only the treatment group, the DiD method
could incorrectly attribute the difference in outcomes to the treatment, when it
was actually due to the policy.​

3.​ Failure to Control for Confounding Variables:​

○​ If a study fails to account for variables that influence the treatment and
comparison groups differently over time, multivariate regression analysis
(which is typically used to control for confounding variables) might not fully adjust
for those differences. In this case, the DiD estimate will still be biased.​

○​ Example: If socioeconomic changes, such as a rise in income in the treatment


area, are not included as control variables in the regression, then the estimated
effect of the intervention could be skewed.​

4.​ Group-Specific Shocks:​

○​ If a group-specific shock (such as a local economic boom or a natural disaster)


hits one of the groups around the time of the intervention, and it is not accounted
for in the regression, this could also distort the treatment effect. Essentially, the
estimate would capture both the shock and the intervention.​

○​ Example: In an evaluation of a program to increase access to healthcare, if the


treatment area experiences a sudden influx of migrant workers or an economic
boom at the same time, this would interfere with the estimate of the program’s
impact unless such factors are explicitly included in the analysis.​

Conclusion

The Difference-in-Differences (DiD) method is not foolproof, and even when trends are equal
before the intervention, several factors can still introduce bias into the estimation. These
include:
●​ Unaccounted external factors (like droughts or policy changes) that affect the
treatment and comparison groups differently.​

●​ Time-varying shocks that influence one group more than the other.​

●​ Failure to control for confounding variables that change over time.​

To mitigate these issues, researchers must be diligent in identifying and controlling for all
potential confounding factors and external shocks that could impact the treatment and
comparison groups differently during the study period. If these factors are not accounted for, the
DiD method may produce invalid or biased estimates.

MATCHING
Matching: Constructing an Artificial Comparison
Group
Matching is a statistical technique used to create a comparison group for estimating the impact
of a treatment or program when there is no clear assignment rule (e.g., randomization). The
goal is to find individuals from the non-treatment group (comparison group) who are as similar
as possible to those in the treatment group based on certain observed characteristics.

How Matching Works:

●​ Data with Treated and Non-Treated Groups: For example, you're trying to evaluate the
effect of a job training program on income. The dataset includes individuals who enrolled
in the program (treatment group) and those who did not (comparison group).​

●​ Matching Process: Matching uses statistical techniques to identify non-treated


individuals who have similar characteristics (such as age, gender, education, etc.) to the
treated individuals. These matched non-treated individuals then become the comparison
group, which allows you to estimate what would have happened to the treated
individuals without the program.​

Challenges in Matching:
1.​ Curse of Dimensionality: When you try to match on too many characteristics (e.g., age,
education, employment history), it becomes difficult to find exact matches for each
treated individual. This is called the "curse of dimensionality."​

2.​ Large Data Sets: If there are many characteristics or if each characteristic takes on
many values, finding a good match can be challenging unless you have a very large
dataset.​

3.​ Balancing Characteristics: If too few characteristics are matched, the treatment and
comparison groups might still differ in important ways. If too many characteristics are
used, it can be hard to find matches.​

Example in Figure 8.1:

●​ The figure shows matching on four characteristics: age, gender, months unemployed,
and whether the individual has a secondary school diploma.​

●​ Matching tries to find a non-treated individual who has similar characteristics to the
treated individuals. For instance, if the treatment group includes individuals with certain
combinations of these four characteristics, the matching process finds non-treated
individuals who have the closest combination.​

In summary, matching helps create a comparison group by finding non-treated individuals with
similar characteristics to those in the treatment group, allowing for more reliable estimation of
the treatment effect. However, increasing the number of characteristics makes the matching
process harder and can lead to difficulties in finding good matches.

Propensity Score Matching (PSM)


Propensity Score Matching (PSM) is a method that addresses the "curse of dimensionality"
when matching treatment and control groups in observational studies. Instead of matching
individuals on every characteristic directly (which can be difficult when there are many
variables), PSM calculates a propensity score, which is the probability of a unit (individual)
being treated (e.g., enrolled in a program) based on their observed characteristics.

Key Concepts in Propensity Score Matching:

1.​ Propensity Score:​

○​ The propensity score is a single value that summarizes the likelihood of a unit
receiving treatment based on observed characteristics. This score ranges from 0
to 1.​

○​ It is computed using a statistical model that uses baseline (pre-treatment) data,


ensuring that only characteristics unaffected by the treatment are considered.​

2.​ Matching Using Propensity Scores:​

○​ After computing the propensity score for each individual in both the treatment
(enrolled) and control (non-enrolled) groups, individuals in the treatment group
are matched with individuals in the control group who have similar propensity
scores.​

○​ The aim is to form a comparison group that resembles the treatment group as
closely as possible on the observed characteristics that influence the likelihood of
receiving treatment.​

3.​ Estimating the Impact:​

○​ The impact of the treatment is estimated by comparing the outcomes of the


matched pairs: one individual from the treatment group and one from the control
group.​

○​ The average treatment effect (ATE) is derived from the difference in outcomes
between these matched groups.​

4.​ Local Average Treatment Effect (LATE):​

○​ If some treated units cannot find a close match due to a lack of common
support (i.e., no units in the control group have a similar propensity score), the
analysis may only provide estimates for those units that can be matched — the
local average treatment effect (LATE). This refers to the treatment effect for
those individuals for whom a match exists.​

Steps in Propensity Score Matching (PSM):

1.​ Estimate Propensity Scores:​

○​ For each individual, estimate the probability of being treated based on observed
characteristics (using a statistical model like logistic regression).​

2.​ Check for Common Support:​


○​ Ensure there is overlap between the propensity scores of the treatment and
control groups. If there's no overlap, matching is not possible for those units.​

3.​ Match Treatment and Control Units:​

○​ Match treated units with non-treated units that have similar propensity scores.​

4.​ Estimate the Treatment Effect:​

○​ Calculate the difference in outcomes between the matched treatment and control
units. This gives an estimate of the program's impact.​

5.​ Address Lack of Common Support:​

○​ If no match is found for certain treated or untreated units (because their


propensity scores are too extreme), restrict the analysis to units with common
support.​

Challenges and Considerations:

1.​ Unobserved Characteristics:​

○​ Matching can only account for observed characteristics. If there are unobserved
factors influencing both treatment assignment and outcomes (e.g., individual
motivations), the results may be biased.​

2.​ Pre-Treatment Data:​

○​ Only pre-treatment data (before the program starts) should be used for
calculating the propensity score. Using post-treatment data (which could be
influenced by the program) would bias the results.​

3.​ Matching on Relevant Characteristics:​

○​ The quality of the matching depends on having data on the relevant


characteristics that determine treatment assignment. If we don't understand what
influences treatment decisions, the matched comparison group might not be
valid.​

4.​ Combining with Difference-in-Differences (DiD):​

○​ If baseline data (pre-intervention) is available, combining PSM with


difference-in-differences (DiD) can reduce potential biases. DiD helps account
for unobserved confounding by comparing changes over time between the
treated and control groups.​

Figure 8.2: Lack of Common Support

●​ The figure illustrates the distribution of propensity scores for both the treatment
(enrolled) and control (non-enrolled) groups.​

●​ If the propensity score distributions do not overlap well, meaning treated units with high
propensity scores cannot be matched to control units, there is a lack of common
support.​

●​ In this case, the treatment effect is only estimated for those units where both the treated
and control groups have similar propensity scores.​

Conclusion:

Propensity Score Matching (PSM) offers a way to estimate treatment effects when random
assignment isn't possible. It reduces bias by matching treated and control units with similar
propensity scores. However, it relies on the assumption that all relevant characteristics are
observed, and there must be common support between the treatment and control groups to
ensure valid comparisons.

Combining Matching with Other Methods


In impact evaluation, matching methods, while useful, can sometimes be limited by biases or
data challenges. To overcome these issues, matching can be combined with other statistical
techniques, such as difference-in-differences (DiD) and the synthetic control method.
These combinations help improve the robustness and accuracy of the impact estimates by
addressing potential biases and confounding factors.

Matched Difference-in-Differences (DiD)

The matched difference-in-differences (DiD) method combines the strengths of matching and
difference-in-differences to provide a more reliable estimate of program effects. This is
particularly useful when there are baseline data on outcomes and concerns about unobserved
characteristics that could bias results.
Steps in Matched Difference-in-Differences:

1.​ Perform Matching: Match treatment and control units based on observed
characteristics (e.g., demographics, socio-economic factors).​

2.​ Calculate First Difference (Treatment Group): For each treated unit, compute the
change in the outcome between the "before" and "after" periods (i.e., the difference in
outcomes for each individual before and after treatment).​

3.​ Calculate Second Difference (Control Group): For each matched control unit,
compute the same change in the outcome between the before and after periods.​

4.​ Difference-in-Differences: Subtract the second difference from the first difference. This
difference accounts for time-related changes that might affect both the treatment and
control groups similarly.​

5.​ Average the Double Differences: Finally, calculate the average of these differences to
estimate the program's impact.​

Advantages of Matched DiD:

●​ Reduces Bias: By combining matching and DiD, this method reduces the bias that may
arise from unobserved factors that could affect both program participation and outcomes.​

●​ Control for Time Effects: DiD accounts for time trends that might affect both the
treatment and control groups similarly, ensuring that the estimated effect reflects the
program’s impact, not just general trends.​

Real-World Examples:

●​ Rural Roads and Market Development in Vietnam (Box 8.1): A study used matched
DiD to evaluate the impact of a rural road program on local market development. The
researchers matched treatment communes with control communes and used DiD to
estimate how the road rehabilitation affected market conditions.​

●​ Cement Floors and Child Health in Mexico (Box 8.2): Another study combined
matching with DiD to assess the impact of the Piso Firme program, which replaced dirt
floors with cement floors in households. The method helped estimate improvements in
child health, maternal happiness, and other welfare indicators.​

The Synthetic Control Method


The synthetic control method is a powerful technique used to estimate the effects of an
intervention or event on a single treated unit (e.g., a country, a firm, or a hospital) by comparing
it to a "synthetic" control group. This method is typically used when there is only one treated
unit and no clear comparison group.

How Synthetic Control Works:

●​ Instead of using multiple untreated units, the synthetic control method constructs an
artificial comparison unit by weighting untreated units so that their characteristics
match those of the treated unit as closely as possible.​

●​ The synthetic control is a weighted average of the untreated units that closely
resembles the treated unit in terms of pre-treatment characteristics, allowing for a valid
comparison of post-treatment outcomes.​

●​ The method is particularly useful when the treated unit is unique and no other unit in the
sample is a good match.​

Advantages of the Synthetic Control Method:

●​ Ideal for Unique Cases: It’s particularly useful when dealing with the impact of policies
or interventions that only affect a single unit (e.g., a single country or region).​

●​ Constructs a Better Comparison Group: The synthetic control is not a single unit but a
weighted average of several untreated units, making it a more flexible and reliable
comparison.​

Summary of Combined Methods:

1.​ Matched Difference-in-Differences (DiD):​

○​ Combines matching (to account for differences between treatment and control
groups) with DiD (to control for time trends and potential confounding).​

○​ Ideal for when there is baseline data on outcomes and unobserved factors may
influence both treatment and outcomes.​

○​ Example: Evaluating the impact of road rehabilitation in Vietnam on local market


development or the effects of cement floors on child health in Mexico.​

2.​ Synthetic Control Method:​


○​ Used when the intervention targets a single unit, such as a country or region.​

○​ Constructs a synthetic control group by weighting untreated units to resemble the


treated unit in pre-treatment characteristics.​

○​ Ideal for estimating impacts when only one treated unit is available for study.​

Conclusion:

By combining matching with other methods like difference-in-differences and the synthetic
control method, researchers can significantly reduce biases and improve the accuracy of impact
estimates. These combined approaches are valuable tools in evaluating interventions where
randomization is not feasible, and they allow for a more nuanced understanding of how
programs affect different units.

Synthetic Control Method


The Synthetic Control Method (SCM) is used to estimate the impact of an intervention or
event on a unit (such as a country, company, or region) by comparing it to a constructed
"synthetic" unit. This synthetic unit is an artificial combination of untreated units that closely
resemble the treated unit before the intervention. SCM is particularly useful when there’s only
one treated unit (like a single country or region) and no natural comparison group.

●​ How it works: Instead of comparing a treated unit to just one untreated unit or a group
of untreated units, SCM creates a synthetic comparison by weighting untreated units in
such a way that their pre-treatment characteristics (e.g., GDP, unemployment rate)
closely match those of the treated unit. This synthetic unit represents what would have
happened to the treated unit without the intervention.​

●​ Example (from the text): The economic effects of terrorism in Spain’s Basque Country
were studied using SCM. The Basque Country's economy was significantly impacted by
terrorism, so SCM combined other regions to create a synthetic Basque Country that
could reflect what the Basque economy might have looked like without the conflict. This
way, they could isolate the impact of terrorism.
Limitations in matching method
The matching method is a widely used technique for estimating the impact of programs or
interventions, but it has several important limitations that must be considered. Let’s go over the
key challenges highlighted in the passage:

1. Need for Large, Extensive Datasets

Matching methods require extensive data on a large sample of units (e.g., households, regions,
etc.). This is because the method relies on comparing the treated units to non-treated ones
based on observed characteristics, and for a meaningful comparison, a broad set of
characteristics is necessary. In smaller datasets or cases with limited data, matching may not
produce reliable or valid results.

●​ Problem: Even when large datasets are available, there may not be enough overlap
between the treated and untreated groups in terms of observable characteristics (this is
called lack of common support). In these cases, the matching method can't find
suitable matches, which weakens the reliability of the estimated impact.​

2. Inability to Account for Unobserved Characteristics

One of the most significant limitations of matching methods is that they can only match units
based on observed characteristics. It is impossible to incorporate unobserved factors (i.e.,
factors that are not included in the data) into the matching process.

●​ Problem: If there are differences between the treated and comparison groups in
unobserved characteristics (e.g., motivation, individual preferences, or hidden biases)
that affect both participation in the program and the outcome, then the matching results
will be biased. This could lead to misleading conclusions about the impact of the
intervention.​

●​ Assumption: Matching methods rely on the assumption that there are no unobserved
confounders (unmeasured variables) that influence both treatment assignment and the
outcome. This is a strong assumption and, importantly, it cannot be tested. If this
assumption is violated, the estimated treatment effect may be biased.​

3. Comparison with Other Evaluation Methods

Matching is often considered less robust than other methods like randomized controlled
trials (RCTs), instrumental variable (IV) methods, and regression discontinuity designs
(RDD). This is because:
●​ RCTs do not rely on assumptions about unobserved characteristics, as participants are
randomly assigned to treatment or control groups. This randomization eliminates the risk
of bias due to unobserved factors.​

●​ IV methods and RDD do not require the assumption of no unobserved confounding


factors in the same way matching does. These methods can identify causal effects in the
presence of such unobserved factors, at least under certain conditions.​

4. Ex Post Matching

The limitations of matching are particularly problematic when the matching is done after the
program has already started (referred to as ex post matching). In these cases, the matching is
performed based on characteristics that were observed after the intervention had already been
implemented.

●​ Problem: This introduces a risk of post-treatment bias. If the characteristics being


matched on were influenced by the treatment itself (for example, if program participation
changed the individuals' behavior or circumstances), the matched comparison groups
might not be truly comparable to the treated group, thus invalidating the results.​

5. Dependence on Baseline Data

Matching works best when there is baseline data available on the characteristics of individuals
or units before they received the treatment. If such data are available, matching on those
baseline characteristics helps ensure that the treated and untreated groups are similar prior to
the intervention, reducing the risk of bias in estimating the treatment effect.

●​ Problem: Without baseline data (i.e., data collected before the intervention), matching
becomes more risky because the characteristics you match on might already be
influenced by the program itself. In such cases, matching is unlikely to provide a valid
estimate of the causal effect.​

6. The Importance of Pre-program Design

Ideally, impact evaluations are best designed before the program starts, as this allows for the
collection of baseline data and the possibility of using more rigorous methods (like RCTs). Once
the program has already started and if there is no way to influence how it is allocated (for
example, when the treatment is non-randomly assigned), conducting a valid evaluation
becomes more challenging.

Summary of Limitations:
●​ Data requirements: Matching requires large datasets with extensive baseline data.​

●​ Unobserved factors: It can’t account for unmeasured or hidden factors that may
influence both participation and outcomes, leading to potential bias.​

●​ Comparative robustness: Matching is less robust compared to randomized controlled


trials, instrumental variables, or regression discontinuity designs.​

●​ Ex post matching risks: Matching performed after the treatment has started is risky
and may lead to biased estimates if the characteristics being matched on were affected
by the treatment.​

●​ Dependence on baseline data: Matching is most effective when baseline data


(pre-treatment) are available to ensure comparability.​

In practice, matching is often used when other more robust methods (like RCTs or IVs) are not
feasible, but it requires careful consideration of its limitations to avoid drawing invalid
conclusions.
Tab 2
Regression Discontinuity Design (RDD)

What is RDD?​
Regression Discontinuity Design (RDD) is a method used to measure the causal impact of a
program or intervention when eligibility is determined by a specific threshold on a continuous
variable (e.g., income, test scores, age). It compares individuals just above and just below this
threshold to assess the program’s effect, making it useful when random assignment isn't
possible.

How Does RDD Work?

1.​ Continuous Eligibility Index: The program uses a continuous variable (e.g., test
scores, income) to determine eligibility.​

2.​ Threshold (Cutoff): A specific cutoff separates eligible from ineligible individuals (e.g.,
test score ≥ 90 for a scholarship).​

3.​ Comparison of Groups: Individuals just above and just below the cutoff are very
similar, except for program eligibility. RDD compares these groups to estimate the
program's impact.​

Main Conditions for RDD:

●​ Smooth Index: The eligibility index must be continuous (e.g., income, test score).​

●​ Clearly Defined Cutoff: The cutoff must be clear and unambiguous.​

●​ Unique Cutoff: The cutoff should be specific to the program being evaluated.​

●​ Non-Manipulability of the Score: The eligibility score should not be easily manipulated.​

Impact Measurement:

●​ Local Average Treatment Effect (LATE): RDD estimates the impact near the cutoff.
Results may not apply to individuals far from the cutoff.​

●​ No Need for Control Group: The comparison group (just above the cutoff) serves as a
valid counterfactual.​

Advantages of RDD:
●​ Causal Inference: RDD is one of the best quasi-experimental methods for estimating
causal effects.​

●​ No Randomization Needed: Useful when randomized controlled trials are not feasible.​

Fuzzy Regression Discontinuity Design (RDD)

What is Fuzzy RDD?​


In RDD, if all individuals comply with their treatment assignment based on their eligibility index
(i.e., those below the cutoff receive treatment, and those above do not), the design is called
“sharp.” However, if there is noncompliance—some individuals eligible for the program don’t
participate, or some ineligible individuals find a way to participate—the RDD is considered
“fuzzy.” Fuzzy RDD occurs when eligibility does not guarantee treatment.

Correcting for Noncompliance in Fuzzy RDD

In the case of fuzzy RDD, we use the instrumental variable approach to correct for
noncompliance. The eligibility index serves as the instrumental variable, just as randomized
assignment does in randomized controlled trials. The key drawback of fuzzy RDD is that the
impact estimate becomes localized—valid only for the subgroup of the population near the cutoff
who participate based on eligibility.

Limitations and Interpretation of RDD

RDD Limitations

●​ Locality: Results apply only to those near the cutoff.​

●​ Noncompliance: Fuzzy RDD estimates effects for those marginally complying.​

●​ Statistical Power: Small sample sizes reduce power, and bandwidth choice is critical.​

●​ Functional Form: Incorrect functional forms distort results; robustness checks are
needed.​

●​ Manipulation: Eligibility manipulation can invalidate results.​

●​ Eligibility Uniqueness: Overlapping eligibility rules complicate the attribution of effects.​


●​ Difference-in-Differences (DD)

The Difference-in-Differences (DD) method is used to estimate the impact of a program when
random assignment isn’t possible. It compares the changes in outcomes over time between a
treatment group (those receiving the program) and a comparison group (those not receiving the
program).

Key Concepts:

●​ Treatment Group: Individuals receiving the program.​

●​ Comparison Group: Individuals not receiving the program but otherwise facing similar
conditions.​

●​ Before-and-After Comparison: Comparing outcomes before and after the program for
both groups.​

●​ Counterfactual Estimate: The comparison group estimates what would have happened
to the treatment group without the program.​

Steps in DD Method:

1.​ Measure Outcomes Before and After: For both groups, measure the outcome of
interest before and after the intervention.​

2.​ Calculate Changes:​

○​ For the treatment group: (B - A)​

○​ For the comparison group: (D - C)​

3.​ Compute DD Impact:​


DDImpact=(B−A)−(D−C)DD Impact = (B - A) - (D - C)​
This accounts for external factors affecting both groups.​

Example:​
A road repair program improves employment rates.

●​ Treatment Group: Employment increases from 60% (A) to 74% (B).​

●​ Comparison Group: Employment increases from 78% (C) to 81% (D).​


The DD Impact = (74% - 60%) - (81% - 78%) = 14% - 3% = 11%.​
Assumptions:

●​ Parallel Trends Assumption: In the absence of the program, the treatment and
comparison groups would have followed the same trend over time.​

Advantages:

●​ Useful when randomization isn’t feasible.​

●​ Controls for unobserved factors constant over time within each group.​

Limitations:

●​ Relies on the parallel trends assumption, which may not always hold.​

●​ Finding a good comparison group can be difficult.

How DD is Helpful

1.​ Control for Time-Invariant Differences:​


DD isolates the program's effect by comparing changes over time between treated and
non-treated groups, controlling for factors that don't change (e.g., baseline health,
region).​

2.​ Compare Trends, Not Just Outcomes:​


DD compares pre- and post-treatment trends for both groups, ensuring observed
differences are due to the program, not pre-existing trends.​

3.​ Account for Unobserved Characteristics:​


DD focuses on outcome changes, controlling for unobserved, time-invariant factors that
could bias results.​

The Equal Trends Assumption

DD assumes that, in the absence of treatment, both groups would have followed similar trends
over time. If they would have had different trends, DD estimates may be biased.

Example: A road repair program and a nearby seaport might skew results, as the groups would
experience different trends.
Testing the Equal Trends Assumption

1.​ Pre-treatment Trend Comparison:​


Check if groups had similar trends before treatment.​

2.​ Placebo Test:​


Use a fake treatment group to verify no other influences.​

3.​ Fake Outcome Test:​


Test for impacts on an unrelated outcome to check validity.​

4.​ Multiple Comparison Groups:​


Compare multiple groups to strengthen the assumption.​

Key Limitations of Difference-in-Differences (DiD)

1.​ Unaccounted Factors Affecting Groups Differently​

○​ Issue: DiD assumes the only difference between treatment and comparison
groups is the intervention itself. If other factors affect one group more than the
other at the same time, results may be biased.
2.​ Time-Varying Factors​

○​ Issue: DiD assumes that changes over time only come from the intervention. If
other factors (e.g., policy changes, natural disasters) affect the groups differently,
the equal trends assumption can be violated.
3.​ Failure to Control for Confounding Variables
○​ Issue: DiD may still be biased if important confounders are not included in the
analysis, such as socioeconomic factors that change over time.​

4.​ Group-Specific Shocks


○​ Issue: A group-specific shock (like a local economic boom or disaster) occurring
around the time of the intervention could distort the treatment effect if not
controlled for.​

You might also like