Intro To Prog Eval
Intro To Prog Eval
Impact Evaluation
Impact evaluation is a process used to determine how effective a program is in producing the
desired changes or outcomes. It answers one key question: What is the impact (or causal
effect) of a program on an outcome of interest? The main goal is to understand the direct results
of a program, including whether it actually caused the changes it aimed for.
1. Monitoring: This is an ongoing process that tracks a program’s progress, activities, and
performance over time. It helps program managers understand what’s happening in the
program but doesn’t focus on measuring long-term changes caused by the program.
○ Example: If you are running a school lunch program, monitoring might track how
many meals are served daily and whether the program is staying within budget.
2. Evaluation: This is a periodic and detailed assessment of a program. Evaluations are
usually done at specific points in time and can be used to answer different types of
questions:
○ Descriptive questions: These ask what is happening (e.g., What activities are
being carried out?).
○ Normative questions: These compare what is happening to what should be
happening (e.g., Are the targets of the program being met?).
○ Cause-and-effect questions: These are concerned with determining whether the
program caused a specific outcome (e.g., Did the school lunch program improve
students’ health?).
Impact evaluations focus on answering cause-and-effect questions. They aim to figure out what
specific outcomes or changes were caused directly by the program, rather than by other factors.
For example:
The key feature of impact evaluation is its focus on causality. This means determining whether a
program was directly responsible for a change, or if other factors played a role.
To determine this, impact evaluations need to compare the program group (those who
participated in the program) with a counterfactual group (a group who did not participate, but
who would have been in the program if it were available to them). This comparison helps
determine what would have happened to the participants if they hadn’t been part of the
program.
For example:
● Example: Let’s say there is a program where children are given vitamin supplements to
improve their health. To measure the impact, we’d compare the health outcomes of
children who received the supplements (program participants) with those who didn’t
(non-participants, the counterfactual group). By comparing these two groups, we can
estimate how much of the health improvement is due to the vitamin supplements.
The way you design the evaluation depends on how the program is set up. Important factors
include:
● Program Resources: Do you have enough resources to serve everyone eligible for the
program?
● Program Type: Is the program targeted to a specific group or available to everyone?
● Timing: Is the program implemented all at once or gradually over time?
Based on these factors, you can decide which evaluation method is the best fit. For instance:
In Summary:
Impact evaluation is all about determining whether a program has had a meaningful effect on its
intended outcomes. By using careful comparisons, we can figure out if the changes were truly
due to the program itself or other factors. The evaluation method used will depend on how the
program operates and the resources available.
Let me know if you want me to explain any of these ideas further or provide more examples!
Prospective vs. Retrospective Impact Evaluation:
Impact evaluations can be divided into two main types: prospective and retrospective. The
main difference between them lies in when they are planned and how they are carried out.
● Baseline Data: Data on the groups (treatment and comparison) is collected before the
program starts. This helps in measuring the outcomes after the program and
understanding what changes occurred.
● Program Design and Evaluation Design are Aligned: Since the evaluation is planned
early, it aligns with the program’s goals, ensuring that the right questions are asked and
the program’s success can be measured effectively.
● Clear Goals and Success Measures: The program’s success is defined during the
planning stage. This focuses on what the program intends to achieve, and helps in
measuring if those goals were met.
● Better Causal Inference: Prospective evaluations have the best chance of generating
valid estimates of the program's impact because the treatment and comparison groups
are identified before the program starts, and there are more ways to establish
comparisons.
Example: Imagine a new scholarship program is being planned. The prospective evaluation
would start by collecting baseline data on students’ academic performance and school
attendance before the scholarships are offered. This way, when the scholarships are provided,
the evaluation can compare the academic changes in students who received the scholarships
(treatment group) and those who didn’t (comparison group) to measure the true impact of the
program.
A retrospective impact evaluation is done after a program has already been implemented. In
this type of evaluation, the program is assessed based on what happened after it has been put
in place. The key challenges and characteristics of retrospective evaluations are:
● Lack of Baseline Data: Since the evaluation is happening after the program is
implemented, there’s no baseline data to compare what was happening before the
program started. This makes it harder to measure how the program has truly impacted
the outcomes.
● Limited Options for Creating a Comparison Group: In retrospective evaluations, the
treatment group (those who received the program) has already been chosen. It is difficult
to find a comparison group that is similar to the treatment group without bias, which
makes it harder to draw clear conclusions about causality.
● Reliance on Existing Data: Since the program is already running, retrospective
evaluations rely on the data that’s available. This can be problematic if the data isn’t
complete or well-organized.
● Quasi-Experimental Methods: Because prospective methods are not possible,
retrospective evaluations often use methods that rely on assumptions, which can
weaken the evidence or make it debatable.
Example: Let’s say a health program has been running for a few years without any planned
evaluation. A retrospective evaluation would now look at the health outcomes of people who
participated in the program and compare them with those who didn’t. However, because there
was no baseline data, it would be harder to prove that any improvements in health were caused
by the program, and not by other factors.
1. Baseline Data: With prospective evaluations, you collect baseline data before the
program starts, which helps establish clear benchmarks for measuring change.
2. Clear Program Goals: Designing the evaluation alongside the program helps make sure
the program’s goals and success measures are clearly defined.
3. Better Comparison Groups: By identifying the treatment and comparison groups ahead
of time, prospective evaluations provide more reliable and valid comparisons, leading to
stronger conclusions about causality.
● Lack of Baseline Data: It's harder to know what the situation was like before the
program started, making it tough to assess real impact.
● Data Limitations: Since the data is already available, there might be gaps or
inconsistencies that can weaken the evaluation's findings.
● Quasi-Experimental Methods: Often, retrospective evaluations have to rely on
methods that aren’t as precise as those used in prospective evaluations, making the
results less certain.
In Summary:
● Prospective evaluations are planned before the program starts, collect baseline data,
and align the evaluation with program goals, making them more reliable and capable of
providing valid conclusions.
● Retrospective evaluations assess programs after they are implemented, often with
limited baseline data, and tend to rely on assumptions or quasi-experimental methods,
making the results more uncertain.
Efficacy Studies vs. Effectiveness Studies:
In impact evaluation, we often see two key types of studies: efficacy studies and
effectiveness studies. Both aim to evaluate the impact of a program, but they do so in different
ways and under different conditions.
Efficacy studies focus on testing whether a program can work under controlled, ideal
conditions. These studies are often done in pilot programs or small-scale trials, where
researchers have full control over the implementation and closely monitor the process. The
main goal is to test if the program works theoretically and whether it can achieve its desired
outcomes when everything goes as planned.
Key Characteristics:
Example:
Imagine a new medical treatment for a disease is tested in a specialized hospital with expert
staff. The treatment shows promising results in this ideal setting. However, if the same treatment
were rolled out to an average hospital with fewer resources and less experienced staff, the
results might not be the same. The efficacy study tells us that the treatment works under ideal
conditions, but we can’t be sure it will work in a broader context.
Effectiveness studies, on the other hand, aim to assess whether a program works in real-world
conditions. These studies evaluate the program’s performance when it is implemented in a
more typical or regular setting, without the tight control of a research environment. The goal is to
understand how well the program works in practice and whether its effects can be generalized
to a larger population.
Key Characteristics:
● Real-World Conditions: The program is implemented as it would be in the general
population, using regular channels and processes.
● Generalizability: Effectiveness studies aim to produce results that are generalizable to a
larger group or population, making them more relevant for policy makers and
decision-makers.
● Focus on Scale: These studies are concerned with how a program performs when
expanded beyond small pilots or controlled settings.
● External Validity: The results from effectiveness studies are expected to apply to a
broader population beyond the specific group studied.
Example:
● Efficacy Studies are crucial for testing new ideas or innovative programs in
controlled settings. They provide evidence that a program has potential, but they often
don’t tell us if the program will work when implemented widely.
● Effectiveness Studies are essential for understanding whether a program works in
the real world, in normal settings. The results from these studies are more applicable
for policy makers who want to know if a program can be expanded or replicated on a
larger scale.
A study conducted in 2007 by Banerjee and colleagues tested the Graduation Approach
(which provides cash, assets, training, and support to the poorest families) in six countries. The
results showed significant improvements in various outcomes, such as income, food security,
and mental health, though the impacts varied by country. This kind of evaluation helps
determine if a program can be effective across different contexts, making the results more
generalizable.
In Summary:
● Efficacy Studies: Test whether a program can work under ideal, controlled
conditions. They are good for testing new programs or ideas, but their results may not
apply to larger or real-world settings.
● Effectiveness Studies: Assess whether a program works in real-world conditions,
aiming to produce results that can be generalized to a larger population and used to
inform policy decisions.
● Ex Ante Simulations: These are predictions made before a program starts, based on
available data. They simulate the expected effects of a program, helping to assess its
potential impact and make better design choices.
Example: Before launching a new health campaign, simulations might predict how
effective different strategies (like advertising or free check-ups) will be in improving
health outcomes.
● Process Evaluations: Help understand how and why a program works (or doesn’t work)
by focusing on its implementation and context. This can help policymakers understand
why certain outcomes were achieved.
Example: A program providing nutrition education might use process evaluation to
explore whether the materials were well received or if there were barriers preventing
people from attending.
Why These Approaches Matter:
● Impact evaluations alone can miss key details about how a program is working or why
certain results happened. Using additional methods helps fill in these gaps and improves
the overall evaluation.
1. Convergent Parallel: Both types of data (quantitative and qualitative) are collected at
the same time to cross-check and provide early insights into the program’s effectiveness.
Example: A health program collects survey data (quantitative) and interviews with
participants (qualitative) to understand the overall impact.
2. Explanatory Sequential: The qualitative data explains the results found in the
quantitative data. It helps understand why some outcomes are better or worse than
expected.
Example: After finding that a school program improved student performance, interviews
with teachers and students explain which aspects of the program were most beneficial.
3. Exploratory Sequential: Qualitative methods are used first to generate ideas and
develop hypotheses. Then, quantitative data is collected to test those hypotheses.
Example: Focus groups with community members help identify needs, followed by
surveys to measure the extent of those needs across a larger population.
Process Evaluations
Process evaluations focus on how a program is implemented and whether it follows its
original design. They help assess the program's operations, identify areas for
improvement, and provide valuable insights during the early stages of a program or pilot.
These evaluations are often cost-effective and quick to carry out.
● Purpose: They test if the program is operating as planned and if it aligns with its
intended goals. This helps identify operational problems early, allowing for
adjustments before the program is fully implemented.
Example: In Tanzania, the government piloted a community-based cash transfer
program. A process evaluation helped identify issues, like delays in payments
and beneficiary selection problems, which were then addressed to improve the
program.
Before applying an impact evaluation, it’s important to ensure the program is functioning
as intended. If the operational processes haven’t been validated, resources may be
wasted, or the program may change during the evaluation, affecting the results.
Cost-Benefit and Cost-Effectiveness Analysis
Cost-benefit analysis (CBA) and cost-effectiveness analysis (CEA) help assess a program’s
financial value and its efficiency in achieving specific outcomes.
1. Cost-Benefit Analysis (CBA): Compares the total benefits of a program to its total
costs. It tries to measure everything in monetary terms to decide if the benefits outweigh
the costs.
Example: If a health program costs $1 million and delivers $1.5 million in benefits (like
fewer hospital visits), the cost-benefit ratio helps determine if the program is worth the
investment.
2. Cost-Effectiveness Analysis (CEA): Compares the cost of two or more programs that
aim to achieve the same outcome, helping to identify which program is the most efficient
in achieving the goal.
Example: Comparing two educational programs to see which one improves student test
scores for the lowest cost.
After assessing the impact of a program, adding cost information helps answer two questions:
Once impact and cost data are available, cost-effectiveness analysis helps policymakers
identify the most efficient investments.
1. Ethics of Investment: Spending public resources on programs without knowing their
effectiveness might be seen as unethical. Impact evaluations help determine a
program's effectiveness, making public investments more ethical.
2. Assigning Program Benefits: Evaluations should not influence how benefits are
assigned. However, evaluations can help ensure the program's rules for eligibility are
fair, transparent, and equitable.
3. Randomized Assignment: Some programs use random selection to decide who gets
benefits. This can raise ethical concerns about denying benefits to some people. But
since programs often can't serve everyone at once, random assignment ensures
fairness by giving equally eligible participants a fair chance to receive the program.
4. Research Ethics: Evaluations involve studying human subjects, so ethical guidelines for
research on people must be followed to protect their rights and well-being. Review
boards or ethics committees usually monitor this.
This summarizes the key ethical concerns in conducting impact evaluations, focusing on
fairness, transparency, and protection of human subjects. Let me know if you need more details!
Impact Evaluation for Policy Decisions
Evaluations are particularly helpful for testing new and unproven approaches, as seen with the
Mexican conditional cash transfer program. The evaluation results were key in scaling the
program nationally and internationally.
Generalizing Results
One key challenge is generalizability: can results from one evaluation be applied to other
settings? By comparing multiple evaluations across different contexts, we can identify patterns
and build more reliable conclusions. This approach, called the cluster approach, groups
evaluations based on common research questions, helping policymakers make better decisions.
● World Bank's Strategic Impact Evaluation Fund (SIEF) and other initiatives use this
cluster approach to fill knowledge gaps. For instance, research on early childhood
development has shown certain programs work, but more research is needed on how to
scale them cost-effectively.
This simplifies the concepts related to how impact evaluations influence policy decisions,
offering insights on their use in decision-making processes and their role in testing various
program alternatives and innovations. Let me know if you need further clarification!
Not all programs need an impact evaluation. It’s important to use them selectively when the
question requires a detailed examination of causality. Impact evaluations can be expensive,
especially if you need to collect your own data, so it's important to be strategic with your budget.
Conclusion:
If the questions above suggest that an impact evaluation is worthwhile and the necessary
resources are available, then you’re on the right track. This book is designed to help you and
your evaluation team navigate the process successfully.
This version simplifies the decision-making process about whether to conduct an impact
evaluation, outlining key factors like the stakes, existing evidence, program characteristics, and
resource requirements. Let me know if you'd like further clarification!
Preparation of an Evaluation
Initial Steps in Setting Up an Evaluation
This chapter outlines the initial steps to take when preparing for an evaluation. These steps
include:
The best time to develop a theory of change is at the beginning of a program when all
stakeholders (such as program designers, policymakers, and implementers) can come together
to agree on the program’s objectives and the best way to achieve them. This ensures that
everyone has a shared understanding of how the program will work and what it aims to achieve.
In Mexico, the Piso Firme program aimed to improve the living conditions of poor families by
replacing dirt floors with cement ones. The theory of change for this program was as follows:
1. Input: The government provides cement and materials, and the community helps install
the floors.
4. End Goal: Health improvements, better nutrition, and even better cognitive development
in children.
This was based on the assumption that dirt floors were a major source of parasite
transmission, which causes illnesses in children. By replacing dirt with cement floors, the
program aimed to interrupt this cycle, leading to improved health and overall well-being.
A theory of change helps clarify how the program works and what it aims to achieve. It also
identifies research questions to explore, like the impact on health outcomes in this case. For
example, in the Piso Firme project, the evaluation asked whether cement floors really reduced
diarrhea and malnutrition, improving overall health and happiness.
By having a clear theory of change, stakeholders can better understand the program's logic,
make adjustments, and ensure that the right outcomes are measured.
1. Inputs:
These are the resources available to the program, such as the budget, staff, and
materials.
2. Activities:
These are the actions or work that are carried out to transform the inputs into tangible
outputs. For example, providing training or distributing materials.
3. Outputs:
These are the direct results of the program’s activities — the goods and services
produced. For example, if a program provides education, the output might be the number
of people trained.
4. Outcomes:
These are the short-term to medium-term results that occur once beneficiaries use
the program’s outputs. These are typically not directly controlled by the program but
depend on how the beneficiaries react to the program (e.g., improved knowledge or
skills).
● Implementation (Supply Side): This includes inputs, activities, and outputs, which are
within the control of the program.
● Results (Demand Side + Supply Side): These include the outcomes and final outcomes,
which are influenced by both the program’s implementation and the behavior of the
beneficiaries.
1. Inputs:
In short, a results chain helps to organize the steps needed to achieve a program’s goals,
shows the links between actions and results, and identifies the resources, activities, and
expected outcomes involved. It is a powerful tool for improving and evaluating program
effectiveness.
Specifying Evaluation Questions
An evaluation question is the central focus of any effective evaluation. It helps narrow down
the research and ensures the evaluation directly addresses key policy interests. In the case of
impact evaluation, the question typically asks, “What is the impact (or causal effect) of a
program on a specific outcome?” The goal is to identify the changes caused by the program,
program modality, or design innovation.
○ Effectiveness: Does the program achieve its intended results (e.g., improved
health, education, etc.)?
○ Cost-effectiveness: Is one program model more cost-efficient than another?
○ Behavioral Change: Does a program lead to changes in behavior, such as
increased enrollment or better health practices?
Sometimes, it’s not necessary to test the full program right away. Instead, you can test a
mechanism — a part of the program's causal pathway — to understand if the underlying
assumptions are correct.
For example, imagine you're concerned about obesity in poor neighborhoods. One potential
cause is lack of access to nutritious food. Instead of launching a full program to provide food
subsidies, you could first test the mechanism by offering free baskets of fruits and vegetables
to see if this actually increases consumption.
● Mechanism Question: Does giving free vegetables lead to healthier eating habits in
residents of poor neighborhoods?
Let’s break down how the evaluation question was formulated for the High School
Mathematics Reform program.
● Theory of Change:
1. Focus:
The question narrows the focus of the evaluation, ensuring it targets the key aspects of
the program that need to be assessed.
In summary, specifying the evaluation question is critical for guiding an evaluation. It ensures
that the research is focused on the most important aspects of the program and helps generate
clear, actionable findings.
SMART Indicators
It’s crucial to ensure that the selected indicators are effective measures of program
performance. A commonly used approach to ensure this is the SMART framework, which
ensures that each indicator meets the following criteria:
● Specific: The indicator should measure the desired outcome as precisely as possible.
● Measurable: The indicator must be something that can be easily quantified or obtained.
● Attributable: The indicator should be linked directly to the program’s efforts, so you can
trace the outcomes to the intervention.
● Realistic: The data for the indicator should be obtainable in a reasonable timeframe and
at a reasonable cost.
● Targeted: The indicator must relate to the target population (i.e., those intended to
benefit from the program).
Indicators should be identified not just at the outcome level, but throughout the entire results
chain to ensure that the program’s causal logic can be tracked. This includes monitoring both:
● Implementation Indicators: These track whether the program has been carried out as
planned, whether it has reached the target population, and if it has been delivered on
time.
● Outcome Indicators: These measure whether the program has achieved the intended
outcomes. Even if the focus is on outcomes, tracking implementation indicators is still
essential to explain why certain results were or were not achieved.
Without indicators across the entire results chain, an evaluation risks becoming a “black box”
that simply identifies whether outcomes were achieved, but cannot explain the reasons behind
the success or failure.
Once the indicators have been selected, the next step is to consider the practical aspects of
gathering data to measure those indicators. Below is a checklist to ensure that the indicators
can be reliably produced and used in the evaluation:
Conclusion
In summary, selecting the right outcome and performance indicators is a crucial step in
designing an effective impact evaluation. These indicators should be SMART, tied directly to
program goals, and span the full results chain to ensure that the causal logic of the program can
be tracked. Clear objectives and effect sizes must be defined early to guide the evaluation, and
practical considerations for data collection should be carefully planned to ensure reliable and
valid results.
Causal Inference and Counterfactuals in
Impact Evaluation
Causal Inference is the process of determining whether a program or intervention directly
causes a change in an outcome. For example, we may want to know if a vocational training
program leads to higher income for participants. It's not enough to just observe that someone's
income increased after completing the program; other factors, such as their effort or changes in
the job market, could also be responsible.
To establish causality, impact evaluations use specific methods to rule out other explanations
and isolate the program's effect. The goal is to determine how much of the observed change in
outcomes (like income) is due to the program itself, and not to other factors.
Δ = (Y | P = 1) − (Y | P = 0)
This formula tells us that the causal impact (Δ) is the difference between the outcome (income)
with the program (P = 1) and the same outcome without the program (P = 0). In essence, we're
trying to measure what would have happened to the person if they hadn’t participated in the
program.
Example: Imagine a person completes a vocational training program. The goal is to compare
their income after the program (P = 1) to what their income would have been without the
program (P = 0). If we could observe both scenarios at the same time, we would know exactly
how much the program changed their income, without being influenced by any other factors.
However, it's impossible to observe both scenarios for the same person at the same time, so
impact evaluations often use counterfactuals (what would have happened in the absence of
the program) to estimate the causal effect.
By comparing groups that did and didn’t receive the program, or using statistical methods to
simulate the counterfactual, we can isolate the program’s true impact on outcomes.
Let’s look at Miss Unique, a newborn baby who receives a cash transfer program for her
mother to take her to health checkups. The goal of the program is to improve Miss Unique’s
health (e.g., her height at age 3) by making sure she gets health services.
● What we can measure: We can measure Miss Unique’s height at age 3 after she has
received the cash transfer.
● What we can’t measure: We cannot know what Miss Unique’s height would have
been if her mother hadn’t received the cash transfer. This is the counterfactual — the
"what would have happened" scenario that we can’t directly observe.
Since Miss Unique actually received the cash transfer, we can’t know her height in a world
where she didn’t receive it. This makes it hard to say if the program caused any change in her
height. We would need to compare her to someone else who is very similar, but it’s impossible
to find someone who is exactly the same as Miss Unique. Every person has unique
circumstances, so just comparing Miss Unique to another child might not be accurate.
In an ideal world, to solve the counterfactual problem, we could create a perfect clone of Miss
Unique. This clone would be exactly the same as Miss Unique in every way (same family,
same health, same background, etc.), but the clone wouldn’t receive the cash transfer.
● We could then compare Miss Unique’s height (after receiving the program) to the
clone’s height (without the program). The difference would show the program’s impact.
For example:
However, it’s impossible to find a perfect clone of someone because there are always
differences between people. Even identical twins have differences.
Real-Life Challenges
In the real world, we can’t find these perfect clones, so we use other methods to try to estimate
the counterfactual. For example, we compare Miss Unique to other children who didn’t receive
the cash transfer, but we have to be careful because those children may not be exactly like Miss
Unique. They may live in different areas, have different parents, or other factors that could affect
their health.
Summary
The counterfactual is the key concept in impact evaluations because it represents what would
have happened without the program. Since we can’t observe the counterfactual directly
(because a person can’t exist in two states at once), we estimate it by comparing people who
participated in the program to those who didn’t, while trying to account for other differences
between them.
Great question! The problem you're asking about is called the "counterfactual problem" in
impact evaluations. It refers to the difficulty of knowing what would have happened to a person if
they hadn’t participated in a program (since we can only see what actually happened to them
after they joined the program).
Imagine we want to evaluate the effect of a vocational training program on a person's income.
Let's say that after the program, a person’s income increased from $30,000 to $40,000. The
question is: Did the program cause the income increase?
● With the Program: After completing the program, the person’s income is $40,000.
● Without the Program: What if they hadn’t taken the program? Would their income still
have increased, or would it have stayed the same? This is what we don’t know because
we can’t see the "alternate reality" where the person didn’t participate in the program.
We can't directly observe the "without program" income (since we only have the "with program"
data), so we compare the person’s income with other people who didn’t participate in the
program. These people form the comparison group.
Let’s say you find a group of people who are similar to the person in the training program (same
age, same education level, similar job history, etc.), but they didn’t take the training.
● You observe that these people’s income stayed around $30,000, the same as the
income of the person before they joined the program.
So, in this case, you could say that the program likely contributed to the increase in income,
since the comparison group (people who didn’t take the program) did not show any increase in
income.
● We can never truly know what would have happened to the person without the
program. This is why we need to use statistical methods to estimate the counterfactual.
We try to find someone similar to the person in the program and estimate what their
income would have been if they hadn't participated.
In short:
● Without the program: We can’t observe directly, so we estimate it using other people
who didn’t participate.
● With the program: We observe directly (income after the program).
This difference is what we try to measure in impact evaluation to determine if the program
caused the change.
Estimating the Counterfactual:
The Problem
In impact evaluation, we are trying to figure out what would have happened to a group of
participants if they had not received the program or treatment. This is the counterfactual. We
can’t observe a person at the same time in two states (with and without the program), so we use
a comparison group to estimate the counterfactual.
1. Treatment Group: This group receives the program or treatment. For example, people
who receive the extra pocket money.
2. Comparison (Control) Group: This group does not receive the program. They are used
to estimate what would have happened to the treatment group without the program.
The goal is to compare the treatment group to a comparison group that is as similar as
possible, except for the fact that one group receives the program and the other does not.
We need to make sure the treatment group and comparison group are statistically
identical, meaning that their characteristics are the same on average. If they are identical
except for receiving the program, then any differences in outcomes can be attributed to the
program itself.
For the comparison group to be valid and provide a good estimate of the counterfactual, it must
meet these three conditions:
○ The treatment and comparison groups must have the same average
characteristics in the absence of the program.
○ For example, if we are comparing candy consumption, both groups should have
the same average age, gender, and preferences for candy, so any differences in
candy consumption are because of the program, not other factors.
2. The Comparison Group is Unaffected by the Program:
○ The treatment should not affect the comparison group, either directly or indirectly.
○ For example, if the treatment group is given extra pocket money and this leads to
more trips to the candy store, we need to ensure the comparison group is not
affected by these trips. Otherwise, we wouldn’t be able to isolate the impact of
the pocket money itself.
3. Same Reaction to the Program:
○ The treatment and comparison groups should respond to the program in the
same way.
○ If one group’s income increases by $100 due to a training program, then we
expect the comparison group’s income to also increase by $100 if they had
received the same training. If this happens, any difference in outcomes between
the groups can be attributed to the program.
Let’s go back to the example of Mr. Fulanito receiving extra pocket money and consuming
more candy.
● Treatment Group: This group (say, Mr. Fulanito) receives the extra pocket money.
● Comparison Group: This group does not receive the extra pocket money, but they
should be similar to Mr. Fulanito in terms of age, preferences for candy, and other
characteristics.
○ The average candy consumption of the treatment group (those who received
pocket money).
○ For example, they consume 6 candies on average.
2. Comparison Group Outcome (Y | P = 0):
○ The average candy consumption of the comparison group (those who did not
receive pocket money).
○ For example, they consume 4 candies on average.
3. Estimate the Impact:
○ The impact of the program (extra pocket money) is the difference between the
two averages.
○ Impact = 6 (candies for treatment group) – 4 (candies for comparison group) = 2
candies.
In this case, we estimate that the pocket money program caused an increase of 2 candies in
candy consumption on average.
Why is this Comparison Group Important?
If we don’t have a valid comparison group, the estimated impact could be biased. This means
we could be measuring not only the effect of the program but also the effect of other differences
between the groups.
For example, if the comparison group is much older or lives in a different area where candy is
cheaper, their candy consumption might be different for reasons unrelated to the program. This
could distort the estimate of the program’s true impact.
Conclusion
In summary, the key to estimating the counterfactual and determining the true impact of a
program is finding a valid comparison group. This comparison group must:
When we find such a group, we can confidently say that any difference in outcomes between
the two groups is due to the program itself.
However, this method can often lead to misleading or counterfeit estimates of the
counterfactual. Here's why:
In this method, you are comparing the outcomes before the program (the baseline) and after
the program has been implemented. The assumption is that if the program hadn't existed, the
outcome for participants would have stayed the same as it was before the program. But this
assumption is usually not valid because outcomes can change due to other factors, not just the
program itself.
1. Before the program: In Year 0 (before the program starts), the farmers have an
average rice yield of 1,000 kg per hectare.
2. After the program: After one year (Year 1), with the microloan, the farmers' rice yield
has increased to 1,100 kg per hectare.
So, the before-and-after estimate suggests that the program increased rice yields by 100 kg
per hectare.
The Problem with Before-and-After Comparisons
The problem with this method is that it assumes that without the program, the farmers' yield
would have stayed the same at 1,000 kg per hectare (the baseline). But this assumption is
incorrect because there are many factors that can affect the outcome (such as weather or
market conditions) that are not accounted for in this analysis.
For example:
● If there was a drought in the year the program was implemented, the yield would have
likely been lower without the program, perhaps around 900 kg per hectare. In this case,
the actual program impact would be 1,100 kg - 900 kg = 200 kg (which is larger than the
100 kg estimated from the before-and-after comparison).
● If rainfall improved in the year the program was implemented, the yield would have
likely been higher even without the program, perhaps 1,200 kg per hectare. In this case,
the actual impact of the program would be 1,100 kg - 1,200 kg = -100 kg (a negative
impact).
Thus, the true impact could be larger or smaller than the 100 kg estimate, depending on
factors like weather (rainfall or drought) or other external influences.
The before-and-after method uses the baseline (Year 0) as the counterfactual (the "what
would have happened" scenario if the program hadn't existed). However, this is an incorrect
assumption because external factors (like weather) could have changed the outcome, making
the baseline an unreliable estimate of the counterfactual.
In summary:
● Before-and-after comparisons are risky because they ignore the fact that other
factors can affect the outcome over time.
● It assumes the outcome would have stayed the same without the program, but that's
rarely the case because of factors like weather, economic changes, etc.
● This leads to counterfeit estimates of the program’s true impact.
Key Takeaway
Before-and-after comparisons can give misleading results because they fail to account for the
many factors that can affect outcomes over time. Instead of using the baseline as the
counterfactual, a more valid method requires using a comparison group (a group not receiving
the program) that is similar in all aspects except for the program itself.
Counterfeit Counterfactual Estimate 2: Comparing Enrolled and
Nonenrolled (Self-Selected) Groups
This is another method of estimating the impact of a program, but it has its own pitfalls. The idea
here is to compare the outcomes of people who voluntarily choose to participate in a program
(the "enrolled" group) to those who choose not to participate (the "nonenrolled" group).
However, this approach can give you a counterfeit estimate of the program's true impact
because the two groups may not be comparable.
● The Enrolled Group: These are people who voluntarily chose to join the program.
● The Nonenrolled Group: These are people who decided not to join, even though they
were eligible for the program.
By comparing these two groups, you’re trying to estimate the counterfactual outcome for the
enrolled group. That is, you want to know how the enrolled group would have performed if they
hadn't participated in the program (the counterfactual situation).
In theory, if the two groups (enrolled vs. nonenrolled) were identical in all important ways except
for program participation, the difference in outcomes could be attributed to the program. But, in
reality, this is rarely the case.
The issue with using this approach is that the two groups are fundamentally different, and
these differences can affect the outcome you're measuring. Specifically, people who choose to
enroll in a program are often different from those who don’t, in ways that are hard to observe
and measure. Here are a couple of key reasons why:
When you compare the enrolled group to the nonenrolled group, you are essentially comparing
two groups that are different in many ways. The counterfactual estimate you get from this
comparison is not valid because the nonenrolled group is not a fair representation of what
would have happened to the enrolled group if they had not participated in the program.
For example:
● If people who chose to enroll in a vocational training program already had better skills or
a higher level of motivation, then the difference in outcomes (e.g., higher income for
the enrolled group) might not be due to the program, but because the enrolled group
was already more likely to succeed than the nonenrolled group.
This means you could overestimate the program’s impact. You might wrongly attribute a higher
income or better performance to the program when, in fact, the enrolled group would have
performed better anyway, just because of their initial advantages.
What you’re dealing with here is called selection bias. This happens when the reason people
choose to enroll in the program is related to factors that affect the outcome, even in the absence
of the program.
In other words:
● If people who are more motivated or more skilled are the ones who enroll, then any
improvement in their outcomes can’t necessarily be attributed to the program itself. The
pre-existing differences (motivation, skills, etc.) are the real cause of their better
outcomes.
5. How This Impacts the Counterfactual Estimate
If you rely on comparing the enrolled group to the nonenrolled group, you will likely get a
biased estimate of the program’s true impact. This is because the enrolled group and
nonenrolled group are not comparable. The difference in their outcomes is not entirely due to
the program—it also reflects the underlying differences in their characteristics.
The estimate you get is a counterfeit estimate because it assumes that the nonenrolled group
represents the true counterfactual (i.e., what would have happened to the enrolled group if they
hadn’t participated in the program), but it does not. The nonenrolled group is likely to have very
different characteristics that affect their outcomes in ways that the enrolled group does not
share.
Key Takeaways
● Enrolled vs. Nonenrolled Comparison leads to selection bias, because the two
groups differ in ways that affect the outcome.
● These differences are not accounted for, meaning any difference you observe in
outcomes between the groups could be due to pre-existing differences (e.g.,
motivation, skills) rather than the program itself.
● The counterfactual estimate you get from this comparison is incorrect and
counterfeit, because the nonenrolled group is not a valid comparison for the enrolled
group.
● This method can overestimate the program’s impact, as it assumes the nonenrolled
group would have had the same outcomes as the enrolled group if they had participated
in the program.
In conclusion, comparing enrolled and nonenrolled groups can be misleading unless you
account for the differences between the two. Methods like randomized controlled trials
(RCTs) or techniques like propensity score matching are often used to overcome this issue by
making the groups more comparable, which helps produce a more accurate estimate of the
program's impact.
We want to evaluate the impact of this vocational training program on income. Specifically, we
want to know how much more (if anything) the enrolled individuals are earning after completing
the program compared to those who didn’t enroll.
The problem arises because people who choose to enroll in the program (the "enrolled" group)
might be fundamentally different from those who don't enroll (the "nonenrolled" group). These
differences may affect their income, regardless of the program.
● Enrolled Group (Treatment Group): 100 individuals who voluntarily chose to enroll in
the vocational training program.
● Nonenrolled Group (Comparison Group): 100 individuals who were eligible for the
program but decided not to enroll.
Let’s say, based on some survey data, we know the following about their average incomes
before the program:
First, let's calculate the average income change for both groups (before vs. after):
● Enrolled Group:
○ Before: $5,000
○ After: $7,000
○ Change in Income: $7,000 - $5,000 = $2,000 increase
● Nonenrolled Group:
○ Before: $5,000
○ After: $6,200
○ Change in Income: $6,200 - $5,000 = $1,200 increase
If we directly compare the enrolled group to the nonenrolled group, it looks like the program
had a pretty strong impact:
So, the program appears to have increased income by $800 more for those who participated
($2,000 vs. $1,200).
The $800 difference might seem like the program’s impact, but the comparison is not valid. The
enrolled group and the nonenrolled group are likely to be different in important ways, even
before the program started. These differences could be the reason why the enrolled group
earned more even before the program, and why they saw a larger increase in income. The
groups are not equivalent to begin with!
● Motivation: The enrolled group may be more motivated to improve their life, so even
without the program, they may have pursued other avenues to increase their income.
● Skills: The enrolled group might have higher skills or experience, making them more
likely to earn more, even before the program.
● External Factors: The nonenrolled group might be facing more financial hardship or
have fewer resources to improve their income, making them less likely to take part in the
program.
Because the two groups were not randomly selected and are likely different in important ways,
the income difference we observe could be due to those differences, not the program itself.
The income difference observed between the groups is due to selection bias. People who
chose to enroll in the program were likely different (in motivation, skills, etc.) from those who
chose not to. This makes it impossible to attribute the entire difference in income to the
program itself.
For example:
● The enrolled group may have been motivated to improve their income even without the
program, while the nonenrolled group might have been less motivated, leading to
different outcomes.
● If the nonenrolled group had chosen to participate, they might have experienced a
larger or smaller increase in income than we see with the enrolled group.
To properly estimate the program’s impact, we need to account for the differences between
the enrolled and nonenrolled groups. A randomized controlled trial (RCT) or matching
techniques could help us find a more accurate comparison group (nonenrolled individuals who
are similar to the enrolled individuals in terms of skills, motivation, and other characteristics).
For instance, if we randomly assigned people to either the enrolled or nonenrolled group, we
would have more confidence that any differences in income are due to the program and not
other factors.
Step 5: Conclusion
So, comparing the enrolled and nonenrolled groups without addressing the differences
between them results in a counterfeit estimate of the program's impact. The difference in
income of $800 could be due to factors other than the program itself, such as motivation or skill
levels, and is therefore not a reliable estimate of the program’s true impact.
To avoid this, we would need a method that accounts for these differences, such as random
assignment or matching to ensure we're comparing individuals who are truly similar in every
way except for program participation. This would allow us to make a more accurate estimate of
the program's effect.
Summary:
● Example: Imagine a program that is designed to help unemployed youth find jobs. If
there are more eligible youth than available slots, a random lottery is used to select
who will get the opportunity to participate in the program. Those selected become the
treatment group, and those not selected form the control group.
2. The Gold Standard of Impact Evaluation: Randomized assignment is considered the
gold standard for evaluating the impact of social programs because it gives us the best
estimate of what would have happened to the program participants if they hadn’t
participated. This is important because it helps us create an accurate counterfactual
(what would have happened in the absence of the program).
3. Avoids Selection Bias: In other methods (like comparing enrolled and non-enrolled
groups), participants might differ systematically from non-participants in ways that affect
the outcome. Randomized assignment eliminates this risk because participants are
chosen randomly and are therefore more likely to be similar to the non-participants
(control group) in key characteristics.
In most cases, programs have limited resources and cannot serve every eligible person. This is
where randomized assignment is particularly useful. Let’s say you have a program that aims
to help the poorest 20% of households in a country. If you cannot reach everyone at once due
to budget or capacity constraints, you can randomly select participants from the eligible
population.
● Example: If an education program can only provide school materials to 500 schools, but
there are thousands of eligible schools, a random lottery can be used to choose the
500 schools that will receive the materials.
This method ensures that the selection process is fair and transparent, and there is no room
for arbitrary decisions, favoritism, or corruption in choosing participants.
1. Target Population Larger than Available Slots: In many cases, the population of
eligible participants is larger than the number of spots available in the program. For
example, if a youth employment training program can only enroll a limited number of
youth, a random lottery can determine who gets in. This is often the case for programs
with budget constraints or capacity limitations.
2. Program Rules for Assignment: When a program has more applicants than slots, the
program needs to decide how to allocate the slots. If the program is designed to serve a
specific population, but demand exceeds capacity, a fair and transparent way to allocate
slots is by using a random process.
○ Example: If a rural road improvement program can only pave a few roads in a
given year, and there are many eligible roads in need of improvement,
randomized assignment ensures that the selection of roads is done fairly.
Programs often face the challenge of allocating benefits to a large pool of potential participants.
In such cases, even if there is a way to rank participants (e.g., based on income), the rankings
can be imprecise. Randomly assigning participants within a specific range (e.g., households
with incomes close to the threshold) ensures fairness and prevents errors in allocation.
1. Eliminates Bias: Because assignment is random, there are no systematic differences
between those who receive the program and those who don’t, at least not due to the
selection process itself. This removes the risk of selection bias, where the groups differ
in ways that could affect the outcome (such as motivation, skills, etc.).
3. Clear and Transparent Process: A public lottery is a clear and transparent way of
making decisions about program participation. When the process is open to everyone, it
minimizes the risk of misunderstanding or accusations of unfairness.
This box provides two examples of how randomized assignment has been used as a fair and
transparent method to allocate program benefits, even outside of impact evaluations.
1. Côte d'Ivoire: After a crisis, the government introduced a temporary employment
program for youth, offering jobs like road rehabilitation. Because demand exceeded the
available spots, a public lottery was used to fairly select participants. Applicants drew
numbers publicly, and those with the lowest numbers were selected. This process
helped ensure fairness in a post-conflict environment.
2. Niger: In 2011, the government launched a national safety net project but had more
eligible poor households than benefits available. Due to limited data, a public lottery
was used to select beneficiary villages within targeted areas. Village names were drawn
randomly, ensuring fairness and transparency. This method continued to be used in later
phases of the project due to its success in promoting fairness.
In both cases, using a randomized assignment approach (lottery) ensured that the allocation
was fair, transparent, and widely accepted by local authorities and participants.
Summary
2. Estimation of Impact: Once the program is implemented, any differences in outcomes
between the two groups can be attributed to the program itself, because the groups were
identical before the program started. This allows for a true estimate of the program's
impact.
3. Why It's Effective: By randomly assigning participants, we can be confident that any
observed differences in outcomes are due to the program, rather than other external
factors or biases. This eliminates the risk of selection bias, which could occur if
participants were chosen based on certain characteristics.
4. Simplified Process: The impact is calculated by simply comparing the average outcome
in the treatment group to that in the comparison group, and this difference represents the
true impact of the program. Randomized assignment, therefore, ensures that the
counterfactual is accurate, leading to more reliable estimates of a program's
effectiveness.
In essence, randomized assignment helps create two groups that are as similar as possible,
making it easier to attribute differences in outcomes to the program itself.
In randomized assignments, internal validity and external validity are key concepts to ensure
that the impact estimates are both accurate and generalizable.
Examples
Here are the examples explained in short:
1. Internal Validity:
Internal validity is concerned with whether the program or treatment itself (in your case, the
treatment group) is responsible for the observed effects. In other words, we want to ensure that
any differences in outcomes (e.g., employment rates, income, etc.) between the treatment
group and the control group are due to the program itself and not due to other external factors
(like individual characteristics, motivations, etc.).
● If, after the job training program, the treatment group finds more jobs than the control
group, internal validity ensures that the difference is caused by the program itself
(random assignment helps create two statistically equivalent groups to eliminate bias).
● If there’s no bias in how the two groups were formed, we can confidently attribute the
difference to the program, making the internal validity strong.
2. External Validity:
External validity is concerned with how well the results of your study can be generalized to
other people, settings, or times. Essentially, can the results of this program be applied to
other groups outside of your study?
● If the study was done on unemployed youth in one city, we’d want to know whether the
results would apply to other groups—like unemployed adults, people in rural areas, or
people in different countries.
● If the study sample (treatment and control groups) is representative of the larger
population, external validity is high, and we can generalize the findings to similar
groups outside of the study.
To summarize:
● Internal validity is about the study's design (was the treatment the cause of the
difference?), and it depends on how well the random assignment created equivalent
groups (treatment vs. control). It is about the accuracy of the cause-and-effect
relationship within the study itself.
● External validity is about generalizability (can the results be applied to the broader
population or different settings?). It depends on how well the study sample represents
the larger population you're trying to generalize to.
So:
● Internal Validity = Can we trust that the differences between the groups are due to the
program itself and not other factors?
● External Validity = Can we apply the results of this study to other groups or settings
beyond the study?
When randomized assignment can be used:
1. When the eligible population exceeds the available program spaces:
● Situation: When there are more eligible participants than there is capacity to serve
them, a lottery system can be used to select the treatment group.
● Example: Suppose a government wants to provide school libraries to public schools, but
there’s only enough budget for one-third of them. A lottery is held where each school has
a 1 in 3 chance of being selected. The schools that win the lottery get the library
(treatment group), while the remaining schools without a library serve as the comparison
group.
● Purpose: This ensures that the comparison group is statistically equivalent to the
treatment group, and no ethical issues arise because the schools left out are essentially
part of the natural limitation due to budget constraints.
● Situation: If a program is rolled out gradually and will eventually cover the entire eligible
population, randomizing the order in which people receive the program can create a
valid comparison group for evaluating impacts.
● Example: If the health ministry wants to train 15,000 nurses across three years, it could
randomly assign one-third of nurses to be trained in each year. After the first year, those
trained in year 1 become the treatment group, while those trained in year 3 are the
comparison group (since they haven’t been trained yet). This allows for the evaluation of
the effects of receiving the program for different amounts of time.
Key Point:
Randomized assignment in these cases helps ensure that the treatment and comparison groups
are statistically equivalent, and it provides a way to estimate the counterfactual (what would
have happened without the program). In both scenarios, either due to limited program spaces or
gradual implementation, randomized assignment can be used to assess the true impact of a
program while maintaining fairness and validity in the evaluation.
● Situation: The first step is to identify the population of units that are eligible for the
program. A unit can be a person, a school, a health center, a business, or even a whole
village, depending on the program.
● Example: If you’re evaluating a teacher training program for primary school teachers,
then only primary school teachers should be considered as eligible units. Teachers from
other levels (like secondary school teachers) would not be part of the eligible units.
● Situation: After defining the eligible units, you may not need to include all of them in the
evaluation due to practical constraints (like budget or time). In this case, you randomly
select a sample from the eligible units based on the evaluation’s needs.
● Example: If your eligible population includes thousands of teachers, you might select a
sample of 1,000 teachers from 200 schools to evaluate, as it would be more
cost-effective than assessing every teacher in the country.
● Situation: Once you have the evaluation sample, the next step is to randomly assign the
units to either the treatment group or the comparison group. Here are some methods to
do this:
1. Flipping a coin: For a 50/50 split between treatment and comparison groups, flip
a coin for each unit. Decide beforehand whether heads or tails will assign a unit
to the treatment group.
2. Rolling a die: If you want to assign one-third of the sample to the treatment
group, roll a die. For instance, decide that a roll of 1 or 2 means the unit goes into
the treatment group, and 3 to 6 means the comparison group.
3. Drawing names from a hat: Write the names of all units on pieces of paper, mix
them up, and draw the required number of names for the treatment group.
4. Automated process: For larger samples (like over 100 units), use a random
number generator (via software or a spreadsheet) to assign units to the treatment
or comparison group. For example, you might assign the 40 highest random
numbers to the treatment group.
Key Points:
By following these steps, you ensure a fair and valid randomization process that helps evaluate
the program’s effects effectively.
When Randomized Assignment Can Be Used:
○ Situation: When there are more eligible participants than available spots, use a
lottery to randomly select participants.
○ Example: If only a third of schools can receive a library due to budget
constraints, a lottery is held where 1 in 3 schools are selected as the treatment
group (with libraries), and the others form the comparison group (without
libraries).
○ Purpose: Ensures fairness and that the comparison group is similar to the
treatment group.
2. When a Program Needs to Be Phased In Gradually:
○ Situation: For programs rolled out over time, randomize the order of who
receives the program first to create a valid comparison group.
○ Example: If nurses are trained over three years, randomly assign them to be
trained in year 1, 2, or 3. Nurses trained in year 1 are the treatment group, and
those trained in year 3 are the comparison group.
○ Purpose: Allows for an evaluation of the program's impact over different
timeframes.
○ Situation: Identify the eligible population (e.g., primary school teachers for a
teacher training program).
○ Example: Teachers from other levels (e.g., secondary school) are excluded.
2. Step 2: Select the Evaluation Sample
○ Methods:
■ Coin flip: For a 50/50 split, flip a coin to assign to treatment or
comparison.
■ Rolling a die: For a 1/3 split, decide beforehand how the die will allocate
participants.
■ Drawing names: Write names on paper, randomly draw for treatment
group.
■ Automated process: Use random number generators for larger samples.
○ Key Points: Ensure transparency and documentation of the randomization
process.
Key Points:
● Challenge: When the level of randomization is higher (e.g., provinces or regions), it may
become harder to perform a valid impact evaluation due to a small number of regions.
This can make it difficult to balance characteristics between the treatment and
comparison groups.
● Example: If a country has only six provinces, and three are randomly assigned to the
treatment group, it may not be sufficient to ensure balanced groups. External factors (like
weather or local events) may affect regions differently, leading to biased results.
● Key Point: For unbiased impact estimates, it is crucial that factors such as rainfall, which
can vary over time, are balanced across treatment and comparison groups.
● Optimal Level: Ideally, randomized assignment should be done at the lowest level
possible to maximize the sample size of the treatment and comparison groups, as long
as spillovers can be minimized.
● Example: In a community-level program, it's important to consider whether individuals
within the same community will affect each other’s outcomes, especially if the program is
designed to target specific individuals but the entire community ends up benefiting.
Key Point: When determining the level of randomized assignment, one must consider the
trade-offs between ensuring a large sample size and minimizing the risks of spillovers or
imperfect compliance. Randomizing at lower levels (such as individuals or households) can lead
to more accurate results but requires careful management to avoid these issues.
Once the evaluation sample is selected and treatment is randomly assigned, estimating the
program's impact is straightforward. Here's a breakdown of how it works:
1. Measuring Outcomes: After the program has been implemented for a certain period,
you need to measure the outcomes for both the treatment group and the comparison
group. These groups are compared to determine the effect of the program.
2. Calculating Impact: The impact of the program is simply the difference between the
average outcomes of the treatment and comparison groups.
○ Formula:
Impact = Average Outcome (Treatment Group) - Average Outcome (Comparison
Group)
○ Example:
○ For example, in a teacher training program, it’s possible that not all teachers
assigned to the treatment group receive the training, or a teacher in the
comparison group may attend a training session.
5. Even in this case, randomized assignment still allows for an unbiased estimate of the
program's impact, though interpreting the results will require considering the degree of
compliance and crossover between groups.
Key Point:
○ Optimal Level: Randomizing at the lowest level possible maximizes sample size
and ensures accurate results, while managing risks like spillovers and imperfect
compliance.
○ Example: In a community-level program, ensure individuals do not interact in
ways that influence each other’s outcomes.
○ Key Point: When determining the level, weigh the need for a large sample size
against the risk of spillovers and imperfect compliance.
○ After the program is implemented, measure outcomes for both the treatment and
comparison groups.
2. Calculating Impact:
○ The impact is the difference between the average outcomes of the treatment and
comparison groups.
○ Formula:
○ In the simple example, it assumes everyone in the treatment group receives the
treatment, and no one in the comparison group does.
4. Incomplete Compliance (Real-World Scenario):
○ Not all units in the treatment group may receive the treatment, and some in the
comparison group might get the treatment.
In summary, IV is a powerful tool for evaluating programs where compliance is not guaranteed
or where participants can choose their treatment. By using an external instrument (like random
assignment), researchers can still estimate program impacts effectively.
○ Definition: The ITT measures the difference in outcomes between the group
offered the treatment (treatment group) and the group not offered the treatment
(comparison group), even if not everyone in the treatment group actually receives
the treatment.
○ Example: In the Health Insurance Subsidy Program (HISP), all households in
treatment villages were eligible for insurance, but only 90% enrolled. The ITT
compares the outcomes of all households in treatment villages (whether they
enrolled or not) with the outcomes in comparison villages (where no households
enrolled).
4. Treatment-on-the-Treated (TOT):
○ Definition: The TOT measures the impact on individuals who actually receive the
treatment. It is based only on those who participated, not just those offered the
program.
○ Example: In the HISP, the TOT would estimate the impact for the 90% of
households in treatment villages that actually enrolled in the health insurance
program. It provides insight into the effect of receiving the treatment, rather than
just being offered it.
5. When ITT and TOT Differ:
○ If there is full compliance (everyone in the treatment group participates), ITT and
TOT estimates are the same.
○ However, if there is non-compliance (not all offered the treatment participate),
ITT and TOT will differ because ITT includes both participants and
non-participants, while TOT only includes those who actually receive the
treatment.
6. Instrumental Variables Example - Sesame Street:
○ Study: Kearney and Levine (2015) used an instrumental variables (IV) approach
to evaluate the impact of the TV show Sesame Street on school readiness.
○ Instrument: They used households’ proximity to a television tower (which
affected access to UHF channels) as an instrument for participation. The
distance to the tower wasn’t related to household characteristics, but it influenced
whether they could watch Sesame Street.
○ Results: The study found that children in areas where the show was accessible
were more likely to advance through primary school on time, with notable effects
for African-American, non-Hispanic children, boys, and children from
economically disadvantaged backgrounds.
Key Takeaways:
○ Full Compliance: All individuals assigned to the treatment group participate, and
none of the comparison group participates.
○ Imperfect Compliance: Some individuals assigned to the treatment group may
not participate, or individuals from the comparison group may manage to
participate.
2. Impact Estimation with Imperfect Compliance:
○ The ITT compares outcomes between the treatment group (those assigned
treatment, regardless of participation) and the comparison group (those assigned
no treatment, regardless of participation).
○ Usefulness: ITT can still be useful for measuring the impact of offering the
program (especially when participants self-select) and is a common estimate
when noncompliance is primarily on the treatment side.
○ Example: If some teachers in the treatment group don’t enroll, ITT compares the
outcomes of all teachers offered the training with the outcomes of the comparison
group.
6. Bias Due to Noncompliance in Comparison Group:
Summary:
● Imperfect Compliance occurs when individuals don’t fully adhere to their treatment or
comparison group assignments.
● Intention-to-Treat (ITT) measures the effect of being offered the program, regardless of
participation.
● Treatment-on-the-Treated (TOT) estimates the impact only for those who actually
participate in the program.
● Local Average Treatment Effect (LATE) provides the treatment effect for a specific
group (the compliers) and is used when there is noncompliance in both treatment and
comparison groups.
1. Enroll-if-assigned: These individuals will enroll in the program if assigned to the
treatment group but will not enroll if assigned to the comparison group.
2. Never: These individuals will never enroll in the program, even if assigned to the
treatment group.
3. Always: These individuals will find a way to enroll in the program regardless of their
assignment, even if they are in the comparison group.
In the treatment group, the Enroll-if-assigned and Always individuals will enroll, while the
Never group will not. In the comparison group, the Always individuals will enroll, but the
Enroll-if-assigned and Never groups will not. The challenge in evaluating the program lies in
identifying these groups since some individuals can’t be easily distinguished based on their
behavior alone. This makes it difficult to measure the true impact of the program.
1. Intention-to-Treat (ITT) Estimate: This is the first step where we simply compare the
outcomes (e.g., wages) between those assigned to the treatment group and those in the
comparison group, irrespective of whether they actually enrolled in the program. For
instance, if the treatment group’s average wage is $110 and the comparison group’s
average wage is $70, the ITT impact is $40.
2. Local Average Treatment Effect (LATE) Estimate: The next step is to estimate the
impact of the program specifically for the Enroll-if-assigned group, those who would
only enroll if assigned to the treatment group. To do this, the ITT impact is adjusted for
the proportions of the three types of individuals (Never, Always, and Enroll-if-assigned).
For example, if 90% of the treatment group enrolls, and 10% do not (the Never group),
while 10% of the comparison group enrolls (the Always group), then the difference of
$40 in the ITT estimate must come from the 80% Enroll-if-assigned group. Thus, the
LATE for the Enroll-if-assigned group is $50, derived by adjusting the $40 ITT by the
80% enrollment rate.
The key challenge in these evaluations is that it’s difficult to distinguish between the three
groups (Never, Always, Enroll-if-assigned) for individual participants, as enrollment decisions
are not always observable.
Instrumental Variables (IV) Approach
An example of using IV in practice is the PACES program in Colombia, where secondary school
vouchers were randomly assigned through a lottery. Researchers used the lottery outcome as
an IV to estimate the effect of the vouchers on educational and social outcomes. Even with
some noncompliance (e.g., 90% of lottery winners used the voucher), the randomized
assignment allowed for a reliable estimate of the treatment effect.
In summary, the randomized assignment helps estimate impacts despite imperfect compliance
by acting as an IV to predict enrollment and recover LATE. The final estimate provides insights
into the program’s effect on those who comply with their assignment to treatment.
○ The LATE estimate provides the impact of the program on a specific subgroup of
the population: those who comply with their assignment (Enroll-if-assigned).
○ These compliers are different from Never and Always types. The Never group
(those who do not participate even if assigned to the treatment group) may
include people who expect little benefit from the program. The Always group
(those who would enroll even if assigned to the comparison group) may include
highly motivated individuals who are likely to benefit the most from participation.
2. For example, in a teacher-training program, the Never group might consist of teachers
who feel they don’t need training, have a higher opportunity cost (like a second job), or
face less supervision. On the other hand, the Always group might include teachers who
are highly motivated or are under strict supervision, making them more likely to enroll in
the training even if they were assigned to the comparison group.
○ The LATE estimate applies only to the Enroll-if-assigned group, and does not
reflect the impact on the Never or Always groups.
○ For example, if the ministry of education offers a second round of teacher training
and forces the Never group to participate, we do not know how their outcomes
would compare to those in the first round. Similarly, the LATE estimate does not
provide insights into the impact on the Always group (the most self-motivated
teachers).
4. The LATE estimate should not be generalized to the entire population, as it only applies
to the subgroup of individuals who would participate if assigned to the treatment group.
In summary, the LATE estimate gives the program’s impact only for the compliers—those who
enroll in the program if assigned to the treatment group—but does not apply to those who never
enroll or those who always find a way to participate. Therefore, the LATE estimate is not
representative of the entire population, and its interpretation should be confined to the specific
group of compliers.
In a voluntary enrollment program, individuals who are interested can choose to enroll. For
example, consider a job-training program where individuals can enroll freely. But since not all
will choose to participate, we encounter different types of individuals:
● Always: These are individuals who will enroll in the program regardless of any external
influence.
● Never: These individuals will never enroll in the program, regardless of external
incentives.
● Compliers or Enroll-if-promoted: These individuals will enroll only if encouraged or
promoted, such as through an additional incentive or outreach. Without the incentive,
they would not enroll in the program.
Imagine a job-training program with an open enrollment policy where anyone can sign up.
However, many unemployed individuals may not know about the program or may lack the
incentive to participate. To address this, an outreach worker is hired to randomly visit a subset
of unemployed people and encourage them to enroll in the program.
● The outreach worker does not force participation but instead incentivizes a random
group to participate.
● The non-visited group is also free to enroll, but they have to seek out the program on
their own.
● If the outreach effort works, those who are visited by the outreach worker are more likely
to enroll than those who are not visited.
To evaluate the impact of the job-training program, we cannot simply compare those who
enrolled with those who did not. The enrollees are likely different from the non-enrollees in ways
that affect their outcomes, such as education or motivation.
However, since the outreach worker's visits are randomized, we can compare the group that
was visited (promoted) with the group that was not visited (non-promoted). This random
assignment helps us create a valid comparison group because:
● Both groups (promoted and non-promoted) contain individuals who are Always enrolled
and individuals who are Never enrolled, based on their individual characteristics.
● The key difference is that in the promoted group, individuals who are compliers
(Enroll-if-promoted) are more likely to enroll because of the extra encouragement, while
the non-promoted group has these same individuals, but without the added incentive to
participate.
The variation between the two groups, with one group being encouraged to enroll, allows us to
estimate the Local Average Treatment Effect (LATE). Specifically, the LATE estimate tells us
the impact of the program on the Enroll-if-promoted group, which is the group that only enrolls
because of the random promotion.
Randomized promotion creates a random difference between the promoted and non-promoted
groups, making it an effective instrumental variable (IV). This helps us estimate the impact of
the program on the compliers (Enroll-if-promoted), but the result is still a LATE estimate. Just
like in randomized assignment with imperfect compliance, this estimate applies only to the
specific subgroup of individuals who are compliers and should not be generalized to the whole
population. The Always and Never groups, who behave differently, are not included in the
estimate.
In summary, randomized promotion is a strategy that can be used when a program has open
enrollment and it is possible to randomly encourage some individuals to participate. This
strategy allows us to use random promotion as an instrumental variable to estimate impact in an
unbiased way. However, as with randomized assignment with imperfect compliance, the impact
evaluation based on randomized promotion provides a LATE estimate, which is a local estimate
of the impact on a specific subgroup of the population, the Enroll-if-promoted group. This
estimate cannot be directly extrapolated to the entire population, as it does not account for the
Always or Never groups.
● Information campaigns: Reaching individuals who didn’t enroll because they were
unaware or didn't fully understand the program's content.
● Incentives: Offering small gifts, prizes, or transportation to motivate enrollment.
This strategy relies on the instrumental variable (IV) method to provide unbiased estimates of
program impact. It randomly assigns an encouragement to participate in the program, which
helps evaluate programs that are open to anyone eligible.
Key Concept
Randomized promotion is an instrumental variable method that allows for unbiased estimation
of program impact. It randomly encourages a selected group to participate, making it especially
useful for evaluating programs with open eligibility.
For randomized promotion to provide a valid estimate of the program’s impact, several
conditions must be met:
1. Define Eligible Units: Identify the individuals eligible for the program.
2. Select the Evaluation Sample: Randomly select individuals from the population to be
included in the evaluation. This can be a subset of the population or, in some cases, the
entire population if data is available.
4. Enrollment: After the promotion campaign, observe who enrolls in the program.
In the nonpromoted group, only individuals in the Always category will enroll. However, it’s not
possible to distinguish between the Never and Enroll-if-promoted groups because they both
do not enroll.
In the promoted group, both Enroll-if-promoted and Always individuals will enroll, while the
Never individuals will not. In this group, we can identify the Never group, but we cannot
distinguish between Enroll-if-promoted and Always individuals.
Once the population is identified, we can classify units into three groups:
1. Always: Individuals who will always enroll in the program, regardless of promotion.
2. Enroll-if-promoted: Individuals who will only enroll if they receive additional promotion
or encouragement.
3. Never: Individuals who will never enroll in the program, even if promoted.
Summary
Imagine a scenario where we are evaluating a program using randomized promotion. Suppose
there are 10 individuals per group in a study. In the nonpromoted group, 30% of individuals
enroll (which means 3 individuals, all of whom are "Always" enrollees). In the promoted group,
80% of individuals enroll (which means 3 "Always" individuals and 5 "Enroll-if-promoted"
individuals).
● The average outcome in the nonpromoted group is 70, and in the promoted group, it is
110.
● The difference between the average outcomes is 40 (110 - 70).
To estimate the LATE, we need to understand that the Enroll-if-promoted group represents
50% of the population in the promoted group (5 out of 10). The impact on this group is
calculated by dividing the total difference (40) by the percentage of individuals in the population
who are Enroll-if-promoted (50% or 0.5).
Thus, the Local Average Treatment Effect (LATE) for the Enroll-if-promoted group is 80.
This LATE estimate is valid because the promotion was assigned randomly, ensuring that the
promoted and nonpromoted groups have similar characteristics. Therefore, the observed
differences in outcomes can be attributed to the program's impact on the Enroll-if-promoted
individuals.
Important Notes
● The impact calculated here is specific to the Enroll-if-promoted group. This estimate
cannot be directly extrapolated to other groups (like the Never or Always groups)
because they are likely to be very different from the Enroll-if-promoted group in terms
of their characteristics, such as motivation or information.
● In this case, while the promoted group showed a higher average outcome (110), this
increase in outcomes is entirely due to the individuals who enrolled because of the
promotion. The Always and Never groups did not contribute to this impact.
In 1991, Bolivia scaled up a successful Social Investment Fund (SIF) aimed at improving rural
infrastructure, including education, health, and water. As part of the impact evaluation for the
education component, randomized promotion was used to encourage communities in the
Chaco region to apply for funding.
● Promoted communities received extra visits and encouragement from program staff.
● Non-promoted communities could apply independently.
The evaluation showed that the program succeeded in improving the physical infrastructure
of schools (e.g., electricity, sanitation, and textbooks), but it had little effect on educational
outcomes. However, there was a small reduction (about 2.5%) in the dropout rate.
The use of randomized promotion provided valuable insights into how physical infrastructure
improvements affect school quality. These findings helped adjust future priorities in Bolivia’s
education investment strategy.
○ The LATE (Local Average Treatment Effect) estimates are only for those
individuals who enroll in the program only when encouraged (the
Enroll-if-promoted group). This is a subset of the entire population.
○ If the program's goal is to help people who would enroll without encouragement
(the Always group), the randomized promotion method will not estimate impacts
for this group. The LATE estimate applies only to individuals who enroll when
encouraged.
○ In some cases, the Always group may be the target group for the program,
meaning that the randomized promotion approach won't fully capture the impact
on them.
Conclusion
Regression Discontinuity Design (RDD) is an evaluation method used to measure the causal
impact of programs or interventions when eligibility is determined by a threshold on a
continuous variable (e.g., income, test score, age). Essentially, this method takes advantage of
a situation where the eligibility for a program is based on whether a certain value crosses a
specific cutoff. It compares people who are just above and just below this cutoff to assess the
impact of the program.
RDD is particularly useful because it helps evaluate program effectiveness in situations where
random assignment (like in randomized controlled trials) is not feasible.
1. Continuous Eligibility Index: The program or policy uses a continuous index to
determine eligibility. This could be a score, income level, age, or any other measurable
factor.
2. Threshold or Cutoff: A specific value in the eligibility index is defined as the cutoff.
Individuals just below this value may be eligible for the program, while those just above
may not be.
3. Comparison of Groups: The individuals just above the cutoff are very similar to those
just below it, except for their eligibility for the program. RDD compares these two groups
(treated vs untreated) to determine the causal impact of the program.
2. Clearly Defined Cutoff: There must be a clear, unambiguous cutoff that separates
those who are eligible from those who are not. For instance, if the poverty index is used,
only households with scores below 50 are considered eligible.
3. Unique Cutoff for the Program: The cutoff should only be used for the program being
evaluated. If the same threshold is used for multiple programs, it could confuse the
impact measurement for a single program.
4. Non-Manipulability of the Score: The score that determines eligibility should not be
easily manipulated. This ensures that the assignment to treatment (program
participation) is random around the cutoff, making it possible to draw valid conclusions.
● Eligibility: The program provides fertilizer subsidies to farms with fewer than 50
hectares of land.
● Index: The number of hectares a farm has is the continuous eligibility index.
● Cutoff: Farms with fewer than 50 hectares qualify for the subsidy, while those with 50
hectares or more do not.
● Farms with 48, 49, and 49.9 hectares that are eligible for the subsidy.
● Farms with 50, 50.1, and 50.2 hectares that are ineligible.
RDD would compare the outcomes (e.g., rice yields) of farms just below the cutoff (49.9
hectares) with those just above the cutoff (50.1 hectares). Since these farms are very similar in
all aspects except for the subsidy (fertilizer), the difference in their outcomes can be attributed to
the impact of the fertilizer subsidy itself.
Impact Measurement:
● The average rice yield for farms just below 50 hectares is compared to those just above
50 hectares.
● Any difference in rice yield between these groups is considered the effect of the fertilizer
subsidy.
● Baseline (before the program): Rice yields are plotted against the number of hectares
of land. You would typically see a decline in yield as farm size increases (i.e., smaller
farms tend to have lower yields).
● Follow-up (after the program): You compare the yield after the subsidy is given. The
farms that received the subsidy (those just under 50 hectares) may show a noticeable
increase in rice yields, while those just above the cutoff (ineligible farms) do not.
● Local Average Treatment Effect (LATE): The impact estimated by RDD is valid only
near the cutoff (around 50 hectares). So, we can be confident in the results for
medium-sized farms just below the cutoff, but the results may not apply to very small
farms (e.g., 10 or 20 hectares).
● No Need for Control Group: Since the program rules assign eligibility strictly based on
the cutoff, there’s no need for a traditional control group in this evaluation. The
comparison group (farms just above the cutoff) acts as a valid counterfactual.
Advantages:
● Causal Inference: RDD is one of the best quasi-experimental methods for estimating
causal effects because it uses a natural cutoff to compare very similar individuals.
Limitations:
● Local Results: The impact estimated by RDD is local to the region around the cutoff,
which means it may not be generalized to all potential participants (e.g., very small or
very large farms).
● Data Requirements: RDD requires a large number of observations near the cutoff to
provide accurate estimates.
Conclusion
Regression Discontinuity Design (RDD) is a powerful tool for evaluating programs that use a
clear eligibility index and cutoff. By comparing individuals or units just above and just below the
cutoff, RDD allows researchers to estimate the causal impact of the program. However, the
results are most reliable for the "local" area around the cutoff, and the method assumes that the
index cannot be manipulated.
Example Recap:
● The difference in rice yields between these groups is attributed to the fertilizer subsidy,
giving us an estimate of the program’s impact.
● Sharp RDD: Full compliance with the treatment assignment based on the cutoff. If a unit
is eligible, they must participate.
● Fuzzy RDD: Some units do not comply with the eligibility assignment. For example,
those who qualify may choose not to participate, and some who do not qualify might find
a way to participate. In this case, we apply the instrumental variable (IV) approach to
account for this noncompliance.
The instrumental variable in a fuzzy RDD is the eligibility index, which determines whether a
unit is eligible for the program based on their score (for instance, the poverty index). The
instrument helps identify the local average treatment effect (LATE), which is only valid for the
subpopulation around the cutoff that complies with the treatment assignment.
The Jamaica PATH program provides a clear example of using RDD to evaluate the
effectiveness of a social safety net program targeting low-income households. Here's how the
researchers evaluated the program:
Validating RDD
Before using RDD, it is crucial to verify that there is no manipulation of the eligibility index.
Manipulation might occur if individuals or administrators adjust the index to gain access to the
program.
1. Density Tests: By plotting the distribution of the eligibility index, researchers can check
for signs of manipulation. If there is a bunching of units just below the cutoff and a
scarcity just above, that might suggest manipulation (e.g., people reporting lower poverty
scores to qualify for benefits).
2. Participation Tests: Checking the relationship between the eligibility index and actual
program participation helps confirm whether the program was administered as planned.
Example: Health Insurance Subsidy Program (HISP)
The HISP study offers another example where RDD was used to evaluate the impact of a
health insurance subsidy program. Here's the process:
1. Eligibility Criteria: A poverty index with a cutoff score of 58 determines who is eligible
for the health insurance subsidy. Households with scores below 58 are considered poor
and eligible for the program.
2. Density and Participation Check: No manipulation is found around the cutoff, as the
density of households across the poverty index is smooth, and only households below
the cutoff participate in the program.
○ The follow-up analysis shows a discontinuity at the cutoff (poverty index of 58),
indicating that those just below the cutoff have significantly higher health
expenditures due to the subsidy.
○ In Panel B (with manipulation), you can see a bunching effect, indicating that
some households might have manipulated their eligibility score to qualify for the
program.
Conclusion
RDD, whether sharp or fuzzy, provides a robust framework for evaluating programs that use
eligibility thresholds. Fuzzy RDD is particularly useful when there’s noncompliance with
treatment assignment. By using the instrumental variable approach, researchers can estimate
the local average treatment effect (LATE) for the population near the cutoff.
In practice, fuzzy RDD is applied when noncompliance is suspected. In contrast, sharp RDD is
valid when there is strict compliance, meaning the eligibility criteria are strictly adhered to. Both
designs require thorough checks for manipulation and careful validation of program
participation.
RDD provides an estimate of the treatment effect specifically for the group of individuals
around the cutoff (the local population), rather than the entire population. This can be a strength
or a limitation, depending on the policy question:
● Strength: If the policy question is about marginal decision-making (e.g., Should the
program be expanded or contracted near the eligibility cutoff?), then RDD gives the
exact estimate needed.
● Limitation: If the question is about the overall effectiveness of the program for the
entire population, RDD may not provide a representative estimate, as it only applies to
those near the cutoff.
Interpretation Issue: The generalizability of the results is limited to those close to the cutoff
score. Individuals far from the cutoff may have different characteristics or responses to the
program, which makes extrapolating the results less reliable for the broader population.
Another key challenge arises when there is noncompliance with the assignment rule. This
occurs when individuals who are supposed to receive the treatment (based on the eligibility
index) do not participate, or when individuals who are supposed to be in the control group
manage to participate in the program. This leads to fuzzy RDD, where the eligibility index
becomes an instrumental variable for participation in the program.
● Instrumental variable methodology: In this case, the eligibility cutoff serves as an
instrument for whether individuals receive the treatment. However, this means that the
estimated treatment effect only applies to those marginally compliant with the eligibility
rule (i.e., those close to the cutoff), rather than the broader population.
Interpretation Issue: The findings are localized to those on the margin of eligibility and may
not apply to those who are always compliant (always-takers) or never compliant (never-takers).
RDD typically relies on a smaller sample of units close to the cutoff, which can lower the
statistical power of the analysis compared to methods that use larger samples (e.g.,
randomized controlled trials). To address this, researchers must choose an appropriate
bandwidth around the cutoff:
● A larger bandwidth may increase sample size, but it could also introduce greater
heterogeneity between treatment and comparison units, potentially biasing the results.
● A smaller bandwidth may lead to fewer observations, reducing the power of the analysis.
Practical Tip: To mitigate this challenge, researchers often perform robustness checks by
testing the results using different bandwidths. This helps assess the sensitivity of the estimates
to the choice of bandwidth.
RDD relies on a regression model to estimate the treatment effect, and the functional form of
the relationship between the eligibility index and the outcome of interest plays a crucial role. If
the relationship is non-linear, but the model assumes a linear form, it could lead to incorrect
conclusions.
Practical Tip: Researchers should test the sensitivity of their results to different functional forms
(e.g., linear, quadratic, cubic) to ensure the robustness of their estimates. Failure to account for
complex relationships can lead to incorrect conclusions about the existence and magnitude of
a discontinuity at the cutoff.
For RDD to provide valid results, the eligibility rule and cutoff must be precisely defined and
resistant to manipulation. If the eligibility index can be manipulated by program participants,
enumerators, or other stakeholders (e.g., by altering reported values of assets or income), this
could lead to a discontinuity in the eligibility index that undermines the assumptions of the
RDD.
Example of Manipulation: In some cases, if participants know that a small adjustment (e.g., a
minor change in reported income or assets) could make them eligible for the program, they
might manipulate their eligibility score. This results in a bunched distribution of scores just
below the cutoff, which would undermine the validity of the RDD.
Practical Tip: Researchers should test for manipulation by checking the distribution of the
eligibility index around the cutoff (e.g., using density tests) to ensure there is no unusual
concentration of participants just below the cutoff.
RDD works best when the eligibility rule is specific and unique to the program being evaluated.
If the same eligibility index is used for multiple programs (e.g., multiple welfare or poverty
programs), it becomes difficult to isolate the effect of one program from the effects of others.
This issue arises when targeting rules overlap, and the eligibility cutoff might not be unique to a
single program.
Interpretation Issue: When eligibility rules are not unique, it becomes challenging to attribute
the observed effect to the program of interest. Multiple programs targeting the same individuals
can confound the results.
1. Locality of Estimates: The estimates apply to those near the cutoff, not the entire
population.
2. Noncompliance: Fuzzy RDD requires an instrumental variable approach, but the results
are only relevant for those who comply with the eligibility rule.
3. Statistical Power: A small sample size near the cutoff can reduce statistical power, and
bandwidth selection is crucial.
4. Sensitivity to Functional Form: Incorrect functional forms can distort results;
robustness checks are essential.
6. Eligibility Rule Uniqueness: If the eligibility rule is shared across multiple programs, it’s
hard to isolate the effect of a specific program.
Difference in Difference
The Difference-in-Differences (DD) method is a technique used in impact evaluation when a
program is implemented, but there is no clear rule for assignment or randomization. It is typically
employed when the program's assignment rules are less transparent or not feasible for more
precise methods like randomized controlled trials (RCTs), instrumental variables (IV), or
regression discontinuity design (RDD). This method uses two groups: a treatment group (those
who receive the program) and a comparison group (those who do not). The method compares
the changes in outcomes over time between these two groups.
Key Concepts:
1. Treatment Group: The group of individuals or entities receiving the program or
intervention.
2. Comparison Group: The group of individuals or entities that do not receive the
program, but otherwise face similar conditions.
4. Counterfactual Estimate: The DD method uses the comparison group to estimate what
would have happened to the treatment group if they had not received the intervention.
○ For both the treatment and comparison groups, the outcome of interest (e.g.,
employment rate) is measured before and after the intervention.
○ For the treatment group: Measure the change in the outcome from before to
after the intervention. This is denoted as (B - A).
○ For the comparison group: Measure the change in the outcome over the same
period. This is denoted as (D - C).
Example:
Imagine a road repair program where the goal is to improve access to labor markets, and
employment rates are used as the outcome measure. If certain districts (treatment group)
receive the program while others (comparison group) do not, we compare the changes in
employment rates over time:
● For the treatment group: The employment rate goes from 60% (A) before the program
to 74% (B) after the program.
● For the comparison group: The employment rate goes from 78% (C) to 81% (D) after
the program.
This suggests that the program led to an 11% increase in the employment rate, after accounting
for the general time trend that also affected the comparison group.
Assumptions:
● Parallel Trends Assumption: The key assumption in the DD method is that, in the
absence of the program, the treatment and comparison groups would have experienced
the same trend over time. This means that any difference in their outcomes can be
attributed to the program.
Advantages and Limitations:
● Advantages:
○ Helps control for unobserved factors that are constant over time within each
group.
● Limitations:
○ Requires the parallel trends assumption, which may not hold in all cases.
In summary, the DD method is a powerful tool when randomization isn't possible, as it combines
before-and-after comparisons with comparisons between treatment and control groups to better
estimate program impacts. However, it relies on strong assumptions, and results can be biased
if those assumptions are violated.
The Difference-in-Differences (DiD) method is a powerful statistical tool that helps evaluate
the causal impact of a treatment or intervention in observational settings where randomized
controlled trials are not feasible. Here's a breakdown of the key points related to the "Equal
Trends" Assumption and the testing methods mentioned in your provided text:
● What It Implies: Without the program or treatment, the outcomes for both the treatment
and comparison groups should have evolved in the same way (parallel trends).
● What Goes Wrong If This Assumption Is Violated: If the groups would have followed
different trends in the absence of treatment, the comparison of post-treatment
differences would lead to a biased estimate of the treatment effect. Specifically, you
might overestimate or underestimate the impact of the treatment, as the counterfactual
for the treatment group (i.e., what would have happened to them without the treatment)
is incorrectly modeled using the comparison group.
Example:
If a road repair program occurs in a treatment area at the same time a new seaport is
constructed, it would be impossible to separate the effects of the two events using DiD because
the comparison group may have had different trends or experiences (e.g., the impact of the
seaport) that could distort the interpretation of the program’s effects.
While you can’t directly observe what would have happened to the treatment group without the
program, there are ways to test the validity of the equal trends assumption:
○ To ensure that the trends of both groups were similar before the treatment, you
should compare the changes in outcomes for the treatment and comparison
groups before the intervention.
○ This test helps verify that no unaccounted-for differences are driving the results.
○ Fake treatment groups are created (e.g., using a cohort that was not affected
by the intervention) to check if any pre-existing differences between the groups
were in fact influencing the outcome.
○ If this “fake” treatment group doesn’t show any effect, it supports the assumption
of parallel trends for the actual groups.
○ Another variation of the placebo test is to check the assumption with an outcome
that is unaffected by the treatment. For example, if the intervention is supposed
to influence school attendance, you can check if the treatment has any impact
on something unrelated, such as number of siblings. A significant effect here
would indicate a flawed comparison group.
○ If different comparison groups (e.g., eighth graders vs. sixth graders) yield similar
results, it strengthens the case for the validity of the parallel trends assumption.
○ If they yield different results, it suggests that the assumption might not hold.
● Researchers used DiD to analyze the effects of water privatization on child mortality
rates.
● Placebo Test: They tested a fake outcome (mortality from causes unrelated to water),
and found no impact, suggesting that the program had a valid effect on mortality due to
water-related diseases.
● The study found that privatization was associated with reduced child mortality,
particularly in the poorest areas where the water network expansion was greatest.
● They tested the equal trends assumption by comparing age cohorts (18–24 vs. 12–17
years) in districts where school construction happened, and found no significant
differences in educational attainment pre-program.
● The results confirmed that the program had a positive impact on educational attainment
and wages for younger cohorts, showing parallel trends in the absence of the
intervention.
In the case of the Health Insurance Subsidy Program (HISP), DiD was used to evaluate how
the program affected household health expenditures:
● Before and After Comparison: The table you provided compares health expenditures
for enrolled and nonenrolled households before and after the program.
Conclusion
The difference-in-differences method is useful for controlling for both observed and
unobserved time-invariant characteristics that could otherwise confound the results.
However, its validity hinges on the assumption of equal trends in the pre-intervention period.
The various testing methods, including pre-treatment comparisons, placebo tests, and using
multiple comparison groups, help assess the robustness of this assumption and ensure that the
estimated treatment effects are not biased.
The Difference-in-Differences (DiD) method, while useful for estimating the impact of an
intervention, has several limitations that can lead to biased or invalid estimates of treatment
effects, even when the assumption of equal trends holds. Let’s break down these limitations in
more detail:
○ Explanation: The DiD method assumes that the only difference between the
treatment and comparison groups is the treatment itself. However, if there are
any other factors that affect one group more than the other at the same time
the intervention occurs, and if these factors are not controlled for in the
regression, the results will be biased.
○ The DiD method assumes that the only thing that changes over time for the two
groups is the intervention itself. However, if there are any time-varying factors
that affect the groups differently, such as natural disasters, policy changes, or
regional economic shifts, the assumption of equal trends can be violated.
○ If a study fails to account for variables that influence the treatment and
comparison groups differently over time, multivariate regression analysis
(which is typically used to control for confounding variables) might not fully adjust
for those differences. In this case, the DiD estimate will still be biased.
Conclusion
The Difference-in-Differences (DiD) method is not foolproof, and even when trends are equal
before the intervention, several factors can still introduce bias into the estimation. These
include:
● Unaccounted external factors (like droughts or policy changes) that affect the
treatment and comparison groups differently.
● Time-varying shocks that influence one group more than the other.
To mitigate these issues, researchers must be diligent in identifying and controlling for all
potential confounding factors and external shocks that could impact the treatment and
comparison groups differently during the study period. If these factors are not accounted for, the
DiD method may produce invalid or biased estimates.
MATCHING
Matching: Constructing an Artificial Comparison
Group
Matching is a statistical technique used to create a comparison group for estimating the impact
of a treatment or program when there is no clear assignment rule (e.g., randomization). The
goal is to find individuals from the non-treatment group (comparison group) who are as similar
as possible to those in the treatment group based on certain observed characteristics.
● Data with Treated and Non-Treated Groups: For example, you're trying to evaluate the
effect of a job training program on income. The dataset includes individuals who enrolled
in the program (treatment group) and those who did not (comparison group).
Challenges in Matching:
1. Curse of Dimensionality: When you try to match on too many characteristics (e.g., age,
education, employment history), it becomes difficult to find exact matches for each
treated individual. This is called the "curse of dimensionality."
2. Large Data Sets: If there are many characteristics or if each characteristic takes on
many values, finding a good match can be challenging unless you have a very large
dataset.
3. Balancing Characteristics: If too few characteristics are matched, the treatment and
comparison groups might still differ in important ways. If too many characteristics are
used, it can be hard to find matches.
● The figure shows matching on four characteristics: age, gender, months unemployed,
and whether the individual has a secondary school diploma.
● Matching tries to find a non-treated individual who has similar characteristics to the
treated individuals. For instance, if the treatment group includes individuals with certain
combinations of these four characteristics, the matching process finds non-treated
individuals who have the closest combination.
In summary, matching helps create a comparison group by finding non-treated individuals with
similar characteristics to those in the treatment group, allowing for more reliable estimation of
the treatment effect. However, increasing the number of characteristics makes the matching
process harder and can lead to difficulties in finding good matches.
○ The propensity score is a single value that summarizes the likelihood of a unit
receiving treatment based on observed characteristics. This score ranges from 0
to 1.
○ After computing the propensity score for each individual in both the treatment
(enrolled) and control (non-enrolled) groups, individuals in the treatment group
are matched with individuals in the control group who have similar propensity
scores.
○ The aim is to form a comparison group that resembles the treatment group as
closely as possible on the observed characteristics that influence the likelihood of
receiving treatment.
○ The average treatment effect (ATE) is derived from the difference in outcomes
between these matched groups.
○ If some treated units cannot find a close match due to a lack of common
support (i.e., no units in the control group have a similar propensity score), the
analysis may only provide estimates for those units that can be matched — the
local average treatment effect (LATE). This refers to the treatment effect for
those individuals for whom a match exists.
○ For each individual, estimate the probability of being treated based on observed
characteristics (using a statistical model like logistic regression).
○ Match treated units with non-treated units that have similar propensity scores.
○ Calculate the difference in outcomes between the matched treatment and control
units. This gives an estimate of the program's impact.
○ Matching can only account for observed characteristics. If there are unobserved
factors influencing both treatment assignment and outcomes (e.g., individual
motivations), the results may be biased.
○ Only pre-treatment data (before the program starts) should be used for
calculating the propensity score. Using post-treatment data (which could be
influenced by the program) would bias the results.
● The figure illustrates the distribution of propensity scores for both the treatment
(enrolled) and control (non-enrolled) groups.
● If the propensity score distributions do not overlap well, meaning treated units with high
propensity scores cannot be matched to control units, there is a lack of common
support.
● In this case, the treatment effect is only estimated for those units where both the treated
and control groups have similar propensity scores.
Conclusion:
Propensity Score Matching (PSM) offers a way to estimate treatment effects when random
assignment isn't possible. It reduces bias by matching treated and control units with similar
propensity scores. However, it relies on the assumption that all relevant characteristics are
observed, and there must be common support between the treatment and control groups to
ensure valid comparisons.
The matched difference-in-differences (DiD) method combines the strengths of matching and
difference-in-differences to provide a more reliable estimate of program effects. This is
particularly useful when there are baseline data on outcomes and concerns about unobserved
characteristics that could bias results.
Steps in Matched Difference-in-Differences:
1. Perform Matching: Match treatment and control units based on observed
characteristics (e.g., demographics, socio-economic factors).
2. Calculate First Difference (Treatment Group): For each treated unit, compute the
change in the outcome between the "before" and "after" periods (i.e., the difference in
outcomes for each individual before and after treatment).
3. Calculate Second Difference (Control Group): For each matched control unit,
compute the same change in the outcome between the before and after periods.
4. Difference-in-Differences: Subtract the second difference from the first difference. This
difference accounts for time-related changes that might affect both the treatment and
control groups similarly.
5. Average the Double Differences: Finally, calculate the average of these differences to
estimate the program's impact.
● Reduces Bias: By combining matching and DiD, this method reduces the bias that may
arise from unobserved factors that could affect both program participation and outcomes.
● Control for Time Effects: DiD accounts for time trends that might affect both the
treatment and control groups similarly, ensuring that the estimated effect reflects the
program’s impact, not just general trends.
Real-World Examples:
● Rural Roads and Market Development in Vietnam (Box 8.1): A study used matched
DiD to evaluate the impact of a rural road program on local market development. The
researchers matched treatment communes with control communes and used DiD to
estimate how the road rehabilitation affected market conditions.
● Cement Floors and Child Health in Mexico (Box 8.2): Another study combined
matching with DiD to assess the impact of the Piso Firme program, which replaced dirt
floors with cement floors in households. The method helped estimate improvements in
child health, maternal happiness, and other welfare indicators.
● Instead of using multiple untreated units, the synthetic control method constructs an
artificial comparison unit by weighting untreated units so that their characteristics
match those of the treated unit as closely as possible.
● The synthetic control is a weighted average of the untreated units that closely
resembles the treated unit in terms of pre-treatment characteristics, allowing for a valid
comparison of post-treatment outcomes.
● The method is particularly useful when the treated unit is unique and no other unit in the
sample is a good match.
● Ideal for Unique Cases: It’s particularly useful when dealing with the impact of policies
or interventions that only affect a single unit (e.g., a single country or region).
● Constructs a Better Comparison Group: The synthetic control is not a single unit but a
weighted average of several untreated units, making it a more flexible and reliable
comparison.
○ Combines matching (to account for differences between treatment and control
groups) with DiD (to control for time trends and potential confounding).
○ Ideal for when there is baseline data on outcomes and unobserved factors may
influence both treatment and outcomes.
○ Ideal for estimating impacts when only one treated unit is available for study.
Conclusion:
By combining matching with other methods like difference-in-differences and the synthetic
control method, researchers can significantly reduce biases and improve the accuracy of impact
estimates. These combined approaches are valuable tools in evaluating interventions where
randomization is not feasible, and they allow for a more nuanced understanding of how
programs affect different units.
● How it works: Instead of comparing a treated unit to just one untreated unit or a group
of untreated units, SCM creates a synthetic comparison by weighting untreated units in
such a way that their pre-treatment characteristics (e.g., GDP, unemployment rate)
closely match those of the treated unit. This synthetic unit represents what would have
happened to the treated unit without the intervention.
● Example (from the text): The economic effects of terrorism in Spain’s Basque Country
were studied using SCM. The Basque Country's economy was significantly impacted by
terrorism, so SCM combined other regions to create a synthetic Basque Country that
could reflect what the Basque economy might have looked like without the conflict. This
way, they could isolate the impact of terrorism.
Limitations in matching method
The matching method is a widely used technique for estimating the impact of programs or
interventions, but it has several important limitations that must be considered. Let’s go over the
key challenges highlighted in the passage:
Matching methods require extensive data on a large sample of units (e.g., households, regions,
etc.). This is because the method relies on comparing the treated units to non-treated ones
based on observed characteristics, and for a meaningful comparison, a broad set of
characteristics is necessary. In smaller datasets or cases with limited data, matching may not
produce reliable or valid results.
● Problem: Even when large datasets are available, there may not be enough overlap
between the treated and untreated groups in terms of observable characteristics (this is
called lack of common support). In these cases, the matching method can't find
suitable matches, which weakens the reliability of the estimated impact.
One of the most significant limitations of matching methods is that they can only match units
based on observed characteristics. It is impossible to incorporate unobserved factors (i.e.,
factors that are not included in the data) into the matching process.
● Problem: If there are differences between the treated and comparison groups in
unobserved characteristics (e.g., motivation, individual preferences, or hidden biases)
that affect both participation in the program and the outcome, then the matching results
will be biased. This could lead to misleading conclusions about the impact of the
intervention.
● Assumption: Matching methods rely on the assumption that there are no unobserved
confounders (unmeasured variables) that influence both treatment assignment and the
outcome. This is a strong assumption and, importantly, it cannot be tested. If this
assumption is violated, the estimated treatment effect may be biased.
Matching is often considered less robust than other methods like randomized controlled
trials (RCTs), instrumental variable (IV) methods, and regression discontinuity designs
(RDD). This is because:
● RCTs do not rely on assumptions about unobserved characteristics, as participants are
randomly assigned to treatment or control groups. This randomization eliminates the risk
of bias due to unobserved factors.
4. Ex Post Matching
The limitations of matching are particularly problematic when the matching is done after the
program has already started (referred to as ex post matching). In these cases, the matching is
performed based on characteristics that were observed after the intervention had already been
implemented.
Matching works best when there is baseline data available on the characteristics of individuals
or units before they received the treatment. If such data are available, matching on those
baseline characteristics helps ensure that the treated and untreated groups are similar prior to
the intervention, reducing the risk of bias in estimating the treatment effect.
● Problem: Without baseline data (i.e., data collected before the intervention), matching
becomes more risky because the characteristics you match on might already be
influenced by the program itself. In such cases, matching is unlikely to provide a valid
estimate of the causal effect.
Ideally, impact evaluations are best designed before the program starts, as this allows for the
collection of baseline data and the possibility of using more rigorous methods (like RCTs). Once
the program has already started and if there is no way to influence how it is allocated (for
example, when the treatment is non-randomly assigned), conducting a valid evaluation
becomes more challenging.
Summary of Limitations:
● Data requirements: Matching requires large datasets with extensive baseline data.
● Unobserved factors: It can’t account for unmeasured or hidden factors that may
influence both participation and outcomes, leading to potential bias.
● Ex post matching risks: Matching performed after the treatment has started is risky
and may lead to biased estimates if the characteristics being matched on were affected
by the treatment.
In practice, matching is often used when other more robust methods (like RCTs or IVs) are not
feasible, but it requires careful consideration of its limitations to avoid drawing invalid
conclusions.
Tab 2
Regression Discontinuity Design (RDD)
What is RDD?
Regression Discontinuity Design (RDD) is a method used to measure the causal impact of a
program or intervention when eligibility is determined by a specific threshold on a continuous
variable (e.g., income, test scores, age). It compares individuals just above and just below this
threshold to assess the program’s effect, making it useful when random assignment isn't
possible.
1. Continuous Eligibility Index: The program uses a continuous variable (e.g., test
scores, income) to determine eligibility.
2. Threshold (Cutoff): A specific cutoff separates eligible from ineligible individuals (e.g.,
test score ≥ 90 for a scholarship).
3. Comparison of Groups: Individuals just above and just below the cutoff are very
similar, except for program eligibility. RDD compares these groups to estimate the
program's impact.
● Smooth Index: The eligibility index must be continuous (e.g., income, test score).
● Unique Cutoff: The cutoff should be specific to the program being evaluated.
● Non-Manipulability of the Score: The eligibility score should not be easily manipulated.
Impact Measurement:
● Local Average Treatment Effect (LATE): RDD estimates the impact near the cutoff.
Results may not apply to individuals far from the cutoff.
● No Need for Control Group: The comparison group (just above the cutoff) serves as a
valid counterfactual.
Advantages of RDD:
● Causal Inference: RDD is one of the best quasi-experimental methods for estimating
causal effects.
● No Randomization Needed: Useful when randomized controlled trials are not feasible.
In the case of fuzzy RDD, we use the instrumental variable approach to correct for
noncompliance. The eligibility index serves as the instrumental variable, just as randomized
assignment does in randomized controlled trials. The key drawback of fuzzy RDD is that the
impact estimate becomes localized—valid only for the subgroup of the population near the cutoff
who participate based on eligibility.
RDD Limitations
● Statistical Power: Small sample sizes reduce power, and bandwidth choice is critical.
● Functional Form: Incorrect functional forms distort results; robustness checks are
needed.
The Difference-in-Differences (DD) method is used to estimate the impact of a program when
random assignment isn’t possible. It compares the changes in outcomes over time between a
treatment group (those receiving the program) and a comparison group (those not receiving the
program).
Key Concepts:
● Comparison Group: Individuals not receiving the program but otherwise facing similar
conditions.
● Before-and-After Comparison: Comparing outcomes before and after the program for
both groups.
● Counterfactual Estimate: The comparison group estimates what would have happened
to the treatment group without the program.
Steps in DD Method:
1. Measure Outcomes Before and After: For both groups, measure the outcome of
interest before and after the intervention.
Example:
A road repair program improves employment rates.
● Parallel Trends Assumption: In the absence of the program, the treatment and
comparison groups would have followed the same trend over time.
Advantages:
● Controls for unobserved factors constant over time within each group.
Limitations:
● Relies on the parallel trends assumption, which may not always hold.
How DD is Helpful
DD assumes that, in the absence of treatment, both groups would have followed similar trends
over time. If they would have had different trends, DD estimates may be biased.
Example: A road repair program and a nearby seaport might skew results, as the groups would
experience different trends.
Testing the Equal Trends Assumption
○ Issue: DiD assumes the only difference between treatment and comparison
groups is the intervention itself. If other factors affect one group more than the
other at the same time, results may be biased.
2. Time-Varying Factors
○ Issue: DiD assumes that changes over time only come from the intervention. If
other factors (e.g., policy changes, natural disasters) affect the groups differently,
the equal trends assumption can be violated.
3. Failure to Control for Confounding Variables
○ Issue: DiD may still be biased if important confounders are not included in the
analysis, such as socioeconomic factors that change over time.