Early Childhood Program Evaluations: A Decision-Maker's Guide
Early Childhood Program Evaluations: A Decision-Maker's Guide
A Decision-Maker’s Guide
National Forum on Early Childhood Program Evaluation
A collaborative project involving Harvard University, Columbia University, Georgetown
University, Johns Hopkins University, Northwestern University, University of Nebraska,
and University of Wisconsin
Jack P. Shonkoff, M.D., Co-Chair Katherine Magnuson, Ph.D.
Julius B. Richmond FAMRI Professor of Child Health and Assistant Professor, School of Social Work, University
Development; Director, Center on the Developing Child, of Wisconsin-Madison
Harvard University
Deborah Phillips, Ph.D.
Greg J. Duncan, Ph.D., Co-Chair Professor of Psychology and Associated Faculty, Public
Edwina S. Tarry Professor of Human Development and Policy Institute; Co-Director, Research Center on Children
Social Policy; Faculty Fellow, Institute for Policy Research, in the U.S., Georgetown University
Northwestern University
Helen Raikes, Ph.D.
Jeanne Brooks-Gunn, Ph.D. Professor, Family and Consumer Sciences, University
Virginia and Leonard Marx Professor of Child Develop- of Nebraska-Lincoln
ment and Education; Co-director, National Center for
Children and Families, Columbia University Hirokazu Yoshikawa, Ph.D.
Professor of Education, Harvard Graduate School
Bernard Guyer, M.D., M.P.H. of Education
Zanvyl Kreiger Professor of Children’s Health,
Johns Hopkins Bloomberg School of Public Health
age 5 years;
■■ conducting rigorous analyses of the findings of well-designed studies of programs designed to improve outcomes for young children
communications to assure both broad and targeted dissemination of high quality information.
partners
The FrameWorks Institute
The National Conference of State Legislatures
The National Governors Association Center for Best Practices
SPONSORS
The Buffett Early Childhood Fund
The McCormick Tribune Foundation
An Anonymous Donor
Please note: The content of this paper is the sole responsibility of the authors and does not necessarily represent the opinions of
the funders and partners.
Suggested citation: National Forum on Early Childhood Program Evaluation (2007). Early Childhood Program Evaluations: A Decision-
Maker’s Guide. http://www.developingchild.harvard.edu
© December 2007, National Forum on Early Childhood Program Evaluation, Center on the Developing Child at Harvard University
Early Childhood Program Evaluations
despite increasing demands for evidence-based early childhood services, the evalua-
tions of interventions such as Head Start or home-visiting programs frequently contribute more
heat than light to the policy-making process. This dilemma is illustrated by the intense debate that
often ensues among dueling experts who reach different conclusions from the same data about
whether a program is effective or whether its impacts are large enough to warrant a significant in-
vestment of public and/or private funds.
Because the interpretation of program evaluation research is so often highly politicized, it is es-
sential that policymakers and civic leaders have the independent knowledge needed to be able to
evaluate the quality and relevance of the evidence provided in reports. This guide helps prepare
decision-makers to be better consumers of evaluation information. It is organized around five key
questions that address both the substance and the practical utility of rigorous evaluation research.
The principles we discuss are relevant and applicable to the evaluation of programs for individuals
of any age, but in our examples and discussion we focus specifically on early childhood.
1. . Is the evaluation design strong enough to 3.. How much impact did the program have?
produce trustworthy evidence? The difference between the outcomes for
Evaluations that randomly assign children children and/or families who received ser-
to either receive program services or to a no- vices versus those of the comparison group
treatment comparison group provide the are often expressed as “effect sizes.” This sec-
most compelling evidence of a program’s tion will explain what these mean and how
likely effects. Other approaches can also to think about them.
yield strong evidence, provided they are
done well. 4.. Do the program’s benefits exceed its costs?
A key “bottom line” issue for any interven-
2.. What program services were actually re- tion is whether the benefits it generates ex-
ceived by participating children and families ceed the full costs of running the program.
and comparison groups? This document will explain how costs and
Program designers often envision a model benefits are determined and what they
set of services, but children or families who mean for a program that is being considered
are enrolled in “real” programs rarely have for implementation.
perfect attendance records and the qual-
ity of the services received rarely lives up to 5.. How similar are the programs, children, and
their designers’ hopes. Thus, knowing the families in the study to those in your constit-
reality of program delivery “on the ground” uency or community?
is vital for interpreting evaluation results. Program evaluations have been conducted
At the same time, sometimes a comparison in virtually every state and with children
group is able to access services in their com- of diverse ethnicities and socioeconomic
munity that are similar to those provided as backgrounds. Knowing how the character-
part of the intervention. If so, then differ- istics and experiences of comparison-group
ences between the services provided to the children compare to the characteristics and
program and contrast groups may be small- experiences of children in your own con-
er than would exist in a community where stituency or community is important for
those services are not available. determining the relevance of any evaluation
findings.
For guidelines and explanations that can help leaders use these key questions to determine the relevance
of program evaluations for policy decisions, please continue.
check list # 2 , cont. realistic to expect that a program implemented Do implementation or take-up problems point
in your own community would be lucky enough to more promising practices? No intervention
Examine multiple to avoid all of the problems encountered by the is perfect. Changing behavior and shifting the
characteristics of the
poor-implementation sites. Thus, impacts that course of children’s development is challenging,
program that was
delivered (e.g., inten- are averaged across all locations are probably a and even promising programs can be strength-
sity, duration, skills better guide to what to expect than impacts at- ened. Increasingly, contemporary intervention
and credentials of the tained by only the best sites. programs are turning to “continuous improve-
service providers, and In some circumstances, a program could be ment” or “action research” frameworks, guided
participation rates). implemented exactly as intended, but the par- by a knowledge base that assists service provid-
If important services
ticipation rates could still be low. This may be ers and policymakers in improving program ef-
were not provided as
intended, the program a sign that the program is not attractive or ac- fectiveness. To this end, a supplementary set of
is not likely to be as cessible to potential participants. An example inquiries beyond the simple “did it work?” ques-
effective as hoped. would be a parent outreach service connected to tion can be very useful. This approach is par-
Remember that the an early education program that offers home ticularly important for evaluations of programs
evaluation assesses visits in the afternoon, when most working par- that must be provided, such as public schools.
the program as deliv-
ents cannot participate because of difficulty in Don’t hesitate to contact evaluators directly and
ered, not as designed.
adjusting their work schedules. Another exam- ask, “What do the data tell us about how the
Look carefully for
les- ple is a program whose services are not a good fit program can be improved?”
sons about program with the cultural norms of the particular popu-
improvement. Do the lation being targeted (e.g., home-based services What type of services did the comparison chil-
reports include a sec- for a cultural group that may have strong values dren receive? Another important question about
tion on implications
concerning privacy of the home). In such cir- program receipt is the extent to which children
for other programs?
Is there information cumstances, the failure to “take up” the home in the comparison group were able to access
about implementation visitation piece does not necessarily mean that similar services. Good evaluations detail exact-
or program design this program component could not be benefi- ly what services or programs were received by
that can be trans- cial to families. It may simply mean that the pro- children and families in the comparison group.
lated into practical gram delivery needs to be designed to fit with In some studies, children in the comparison
guidelines for further
the daily routines, values, and preferences of group could not have participated in a similar
program refinement?
the specific group being served. Issues related to program because it was not available to them. In
Find
out as much as language for families who do not speak English other studies, however, children and families in
you can about the are also very important in this context. the comparison group were able to seek out and
experiences of the Participation (sometimes called program access similar programs. Over time and across
evaluation’s control “take up”) refers to the services that children communities, there is considerable variation in
group. Often the “does
and families actually receive. The measurement the extent to which alternative programs and
a program work”
question should be of participation has two dimensions—how services are available to comparison group chil-
rephrased as “does many of the parents or children participated dren. Sometimes the contrast of the program
the program work in and, for those who were involved, how much and comparison group service experiences is
comparison to the service did they receive. The first dimension is quite small, and thus the program may appear
experience of those measured by take-up fractions (i.e., the number to be less effective.
who didn’t receive
of families who were engaged divided by the to- For example, a couple of decades ago, most
the same services?”
tal number of possible participants). Every eval- children who were not assigned to participate
uation should include information about how in an early education program simply stayed
many families never enrolled or dropped out of home and were cared for by their mothers.
the program. The second dimension includes The world has changed dramatically since
measures of program “dosage,” such as numbers that time, and most young children today—
of visits, hours of service received, and weeks, even infants—do not spend all of their time
months, or years of program participation. In at home. In fact, child care and family sup-
addition to including information about these port services are pervasive throughout the na-
two dimensions of participation, studies are tion, although there is striking variability in
even more useful if they include data from sys their quality, accessibility, and affordability.
tema tically conducted interviews or focus These changes have important implications
groups that describe what parents and children for drawing lessons from program evaluations
actually experienced. that were conducted in the past or for guidance
subtracting the outcomes of the control group mean that if we could somehow conduct 100 provides a valuable
from the outcomes of the treatment group, evaluation trials, we would expect to confirm judgment of how likely
an estimated impact is
we get an effect (e.g., raising SAT scores by 20 those impacts in 95 of them. That is a good bet
real and truly different
points). By dividing that effect by the study’s that the impacts are real. from zero.
“standard deviation” (which indicates how As the number of children or families in the
widely dispersed the results are from the mean), treatment and control groups increases, smaller Distrust evaluations
we get an effect size—a fraction that indicates effect sizes become more statistically significant, that report only mea-
how large the effects are in comparison to the simply because a larger sample means a lower sures with statistically
significant impacts.
scale of results. probability of a chance finding. Typically, evalu-
Every rigorous evalua-
The SAT test, for example, is scaled with a ations involving fewer than 100 children require tion is likely to gener-
standard deviation of 100, so a program that very large effect sizes to be judged statistically ate a mix of significant
boosted SAT scores by 10 points would have significant, while evaluations based on several and non-significant
an effect size of 0.1, or one-tenth of a standard thousand children are much more likely to cal- findings. The overall
deviation—which is considered very small. IQ culate small effects as statistically significant. pattern of effects is
most important.
tests are typically scaled with a standard devia- All other things being equal, bigger studies are
tion of 15, so a program that boosted IQ scores better. Even in large studies, however, small ef- It is important toun-
by 10 points would have an effect size of 0.66, fect sizes imply that the program is not likely to derstand whether the
or two-thirds of a standard deviation—which is change outcomes very much, so policymakers offer of services (ITT) or
much larger. Generally speaking, the larger the should consider carefully the cost required to the receipt of services
effect size, the better. Conventional guidelines achieve small benefits. (TOT) is being evalu-
ated and whether there
consider effect sizes of at least 0.8 as “large”; 0.3 are some groups of
to 0.8 as “moderate”; and less than 0.3 as “small.” Pattern of results. Good program evaluations participants that may
Nevertheless, since inexpensive programs can present or summarize results for all of the benefit from the pro-
hardly be expected to perform miracles, we will outcomes they measure, not just the ones that gram more than others.
produced statistically significant impacts. It is to move, or only for those families that actually
unrealistic to expect that even highly effective moved in conjunction with the program?
programs will produce statistically significant Effects assessed across all children or families
impacts on all of the measured outcomes. And a offered program services, regardless of whether
quirk of the standard practice of applying tests they actually used them, are called “intent to
of statistical significance is that even if a pro- treat” (or ITT) impacts. They answer the vital
gram were completely ineffective, for every 100 policy question about the effects of the program
outcomes tested, you would still expect five of on all families that are offered services. Suppose,
them to emerge as statistically significant simply however, that services are highly effective for
by chance! “Cherry picking” small numbers of those who participate, but only a small fraction
statistically significant results can be very decep- of the targeted children or families actually use
tive. Generally speaking, it is the overall pattern them. The intent to treat impact estimates will
of results that matters the most. show that the overall impact on targeted fami-
lies is small and will point to implementation
Relevance. In reading evaluation reports, it or program take-up as a key problem in pro-
is always useful to ask how much measured gram design.
program outcomes are relevant to the desired
outcomes for your constituents or community. “Treatment on the treated” impacts. Under cer-
Of the outcomes measured, which do you care tain circumstances, it is also possible to isolate
most about? Was the program more effective program impacts on the subset of families that
for those outcomes than for others? If you care actually use the services and compare them
about boosting children’s school achievement, to families that did not use similar services.
are most of the achievement impacts in the These are sometimes called “treatment on the
evaluation statistically significant? If one pur- treated” (or TOT) impacts, and amount to scal-
pose of the intervention is to save money for ing up intent-to-treat estimates in proportion
school districts, did the program produce sta- to program take-up. Treatment-on-the-treated
tistically significant impacts on school-related estimates address important policy questions
measures that have financial effects, such as about program impacts on the children or
grade failure and enrollment in special edu- families who actually use the services. If pro-
cation? Use these kinds of questions to guide gram take up is not a concern and you want to
your assessment of the program’s relevance concentrate on how a program affects children
to your goals for the health and development or families who participate in it, then TOT es-
of children. timates are most relevant. Finally, when com-
paring across studies it is important to compare
“Intent to treat” impacts. In evaluations of in- like with like—ITT with ITT impacts or TOT
terventions in which substantial numbers of with TOT impacts.
children or families fail to take up any of the
offered services, there is an important techni- Subgroup effects. Some programs are more ef-
cal detail that must be addressed. Should pro- fective for some subsets of children or families
gram effects be considered for only those who over others. For example, an intensive pro-
receive the services or for all families who are gram designed to help low birth-weight babies
offered the program, regardless of whether they was found to be considerably more effective
participate? This question is illustrated in pro- for children whose birth weights were close to
grams designed to promote residential mobility normal than for children with very low birth
among public housing residents, in which be- weights, some of whom exhibited serious neu-
tween one-quarter and one-half of the families rological problems. It is common for evalu-
that are offered financial assistance and mobil- ations to report effects on various subgroups
ity counseling fail to take advantage of the of- of participants. These findings may be useful
fer. Thus, an evaluation of child and family for forecasting potential program impacts on
outcomes influenced by the mobility program the children, particularly if the measured im
faces a choice—should outcome differences be- pacts are largest among subgroups with charac
tween the program and comparison group be teristics similar to likely participants in your
calculated across all families offered the chance own community.
be important, as reductions in obesity and ue, independent of their financial return. For not the only measure
smoking rates can be linked to savings in health example, if the policy goal is reducing crime or of a program’s worth.
expenditures. high school drop-out rates, policymakers and Some public invest-
ments are made as a
the public may simply be interested in achiev-
matter of social re-
Return on investment. Economists tell us that the ing the goal, regardless of what any cost-benefit sponsibility. In such
most profitable investments are not necessarily analysis might show. In other cases, investments cases, costs are viewed
generated by programs that produce the biggest in children who are highly vulnerable (such in terms of efficiency.
as those who have been abused or seriously outcomes. In such cases, cost-effectiveness
neglected) may be justified solely because of studies that tell us how to deliver services in
their humanitarian significance, independent the most efficient manner will be more useful
of the long-term financial gains that may be re- than cost-benefit studies that assess their eco-
alized from better health and developmental nomic payback.
checklist # 5 let’s say you are a businessman in cleve- are they different? If the study was conducted
land, Ohio, wanting to know whether a success- years ago, the circumstances for children with
Look forspecific ful program that was evaluated in Hawaii in 1990 identical characteristics today may differ in im-
information about the would work as well for your community to- portant ways. Both the nature and the extent of
program. Can you form
day. Your first question should be: What kinds the diversity of your target group of families in
a clear picture of the
services offered and of children or families would receive services if Cleveland is important to consider.
how they differ from the program were implemented in Cleveland? Finally, carefully examine the description
what is currently avail- Would it be targeted toward children from low- of the program. Is it tailored to the particular
able in your commu- income families? Children of immigrant parents group in that study in a specific way (e.g., in its
nity? Does this match from particular groups? Children with disabil- language, materials, cultural values, staffing, or
the way in which your
ities? The more precisely you can characterize approach)? Is it difficult to imagine how the
own community would
provide these services?
the intended recipients of the services and how program might be “refitted” for your commu-
the services differ from what is currently avail- nity? Does it require specially trained and quali-
Considerthe constitu- able in your community, the easier it will be to fied staff who may be too scarce or costly in your
ency or population determine the relevance of the findings of a given community? Some programs might be easier to
for whom you might evaluation study. The more closely the use of adapt than others. For example, an intervention
provide a particular
services by children in the study’s comparison that provides a high-quality preschool experi-
program. How well
does the study group matches those of children in your own ence might be easier to reproduce than a child
sample approximate community, the more relevant the study findings literacy intervention that is based on folk tales
this population? will be. among a particular cultural group.
Next, compare the characteristics of the There is much to be learned from rigorous
If it does not overlap
Cleveland target population with those of the evaluations of early childhood interventions.
substantially with your
children or families in the Hawaiian program Applying those lessons to one’s own communi-
own constituency,
examine the study evaluation. On how many dimensions (e.g., ty, however, requires a careful eye toward under-
carefully to deter- poverty status, inner-city location, languages standing which aspects of the interventions are
mine which aspects used at home and in other settings, parent ed- most likely to be replicable given your current
of the program, if ucation levels, cultural beliefs, and parenting situation, target population, and goals.
any, might need to practices) are they similar? On what dimensions
be adapted to fit your
community’s needs.