Thanks to visit codestin.com
Credit goes to www.scribd.com

100% found this document useful (1 vote)
1K views277 pages

Arlene G. Fink - Evaluation Fundamentals

Evaluation fundamentals

Uploaded by

darkmoul21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views277 pages

Arlene G. Fink - Evaluation Fundamentals

Evaluation fundamentals

Uploaded by

darkmoul21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 277

2

This book is dedicated to the ones I love: John C. Beck and, of course, Ingvard.

3
4
Copyright © 2015 by SAGE Publications, Inc.

All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including
photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher.

Printed in the United States of America

Library of Congress Cataloging-in-Publication Data

Fink, Arlene.
Evaluation fundamentals : insights into program effectiveness, quality, and value / Arlene Fink, University of California at Los Angeles, the
Langley Research Institute. — Third edition.

pages cm

Includes bibliographical references and index.

ISBN 978-1-4522-8200-8 (pbk. : alk. paper) —


ISBN 978-1-4833-1283-5 (web pdf)

1. Medical care—Evaluation. 2. Public health—Evaluation.


3. Medical care—Quality control. 4. Health planning.
5. Outcome assessment (Medical care) I. Title.

RA399.A1F563 2015
362.1—dc23 2013038069

This book is printed on acid-free paper.

14 15 16 17 18 10 9 8 7 6 5 4 3 2 1

5
FOR INFORMATION:

SAGE Publications, Inc.


2455 Teller Road
Thousand Oaks, California 91320
E-mail: [email protected]

SAGE Publications Ltd.


1 Oliver’s Yard
55 City Road
London EC1Y 1SP
United Kingdom

SAGE Publications India Pvt. Ltd.


B 1/I 1 Mohan Cooperative Industrial Area
Mathura Road, New Delhi 110 044
India

SAGE Publications Asia-Pacific Pte. Ltd.


3 Church Street
#10-04 Samsung Hub
Singapore 049483

Acquisitions Editor: Vicki Knight


Assistant Editor: Katie Guarino
Editorial Assistant: Jessica Miller

Production Editor: Stephanie Palermini


Copy Editor: Janet Ford

Typesetter: C&M Digitals (P) Ltd.


Proofreader: Susan Schon

Indexer: Sylvia Coates


Cover Designer: Anupama Krishnan

Marketing Manager: Nicole Elliott

6
Brief Contents

Preface

About the Author

1. Program Evaluation: A Prelude

2. Evaluation Questions and Evidence of Merit

3. Designing Program Evaluations

4. Sampling

5. Collecting Information: The Right Data Sources

6. Evaluation Measures

7. Managing Evaluation Data

8. Analyzing Evaluation Data

9. Evaluation Reports

Answers to Exercises

Index

7
Detailed Contents

Preface
What Is New in the Third Edition

About the Author

1. Program Evaluation: A Prelude

A Reader’s Guide to Chapter 1


What Is Program Evaluation?

The Program or Intervention


Program Objectives and Outcomes

Program Characteristics
Program Impact
Program Costs

Program Quality
Program Value
Evaluation Methods
Evaluation Questions and Hypotheses
Evidence of Merit: Effectiveness, Quality, Value

Designing the Evaluation


Selecting Participants for the Evaluation

Collecting Data on Program Merit


Managing Data So That It Can Be Analyzed
Analyzing Data to Decide on Program Merit

Reporting on Effectiveness, Quality, and Value


Who Uses Evaluations?
Baseline Data, Formative Evaluation, and Process Evaluation

Baseline Data

Interim Data and Formative Evaluation

Process or Implementation Evaluation


Summative Evaluation
Qualitative Evaluation
Mixed-Methods Evaluation
Participatory and Community-Based Evaluation
Evaluation Frameworks and Models
The PRECEDE-PROCEED Framework

RE-AIM

The Centers for Disease Control’s Framework for Planning and Implementing Practical Program Evaluation

Logic Models
Right-to-Left Logic Model

Left-to-Right Logic Model

8
Evaluation Reports Online
Summary and Transition to the Next Chapter on Evaluation Questions and Evidence of Program Merit
Exercises
References and Suggested Readings
Suggested Websites

2. Evaluation Questions and Evidence of Merit

A Reader’s Guide to Chapter 2


Evaluation Questions and Hypotheses
Evaluation Questions: Program Goals and Objectives
Evaluation Questions: Participants
Evaluation Questions: Program Characteristics
Evaluation Questions: Financial Costs
Evaluation Questions: The Program’s Environment
Evidence of Merit

Sources of Evidence
Evidence by Comparison
Evidence From Expert Consultation: Professionals, Consumers, Community Groups

Evidence From Existing Data and Large Databases (“Big Data”)


Evidence From the Research Literature
When to Decide on Evidence
Program Evaluation and Economics
The QEV Report: Questions, Evidence, Variables
Summary and Transition to the Next Chapter on Designing Program Evaluations
Exercises
References and Suggested Readings

3. Designing Program Evaluations

A Reader’s Guide to Chapter 3


Evaluation Design: Creating the Structure
Experimental Designs

The Randomized Controlled Trial or RCT

Parallel Controls

Wait-List Controls

Factorial Designs
Randomizing and Blinding

Random Assignment

Random Clusters

Improving on Chance: Stratifying and Blocking

Blinding
Nonrandomized Controlled Trials

Parallel Controls

The Problem of Incomparable Participants: Statistical Methods Like ANCOVA to the Rescue
Observational Designs
Cross-Sectional Designs
Cohort Designs

9
Case Control Designs
A Note on Pretest-Posttest Only or Self-Controlled Designs
Comparative Effectiveness Research and Evaluation
Commonly Used Evaluation Designs
Internal and External Validity
Internal Validity Is Threatened
External Validity Is Threatened
Summary and Transition to the Next Chapter on Sampling
Exercises
References and Suggested Readings

4. Sampling

A Reader’s Guide to Chapter 4


What Is a Sample?
Why Sample?
Inclusion and Exclusion Criteria or Eligibility
Sampling Methods
Simple Random Sampling

Random Selection and Random Assignment


Systematic Sampling

Stratified Sampling
Cluster Sampling
Nonprobability or Convenience Sampling
The Sampling Unit
Sample Size
Power Analysis and Alpha and Beta Errors
The Sampling Report
Summary and Transition to the Next Chapter on Collecting Information
Exercises
References and Suggested Readings
Suggested Websites

5. Collecting Information: The Right Data Sources

A Reader’s Guide to Chapter 5


Information Sources: What’s the Question?
Choosing Appropriate Data Sources
Data Sources or Measures in Program Evaluation and Their Advantages and Disadvantages

Self-Administered Surveys

Achievement Tests
Record Reviews
Observations

Interviews
Computer-Assisted Interviews

Physical Examinations

Large Databases
Vignettes
The Literature

10
Guidelines for Reviewing the Literature
Assemble the Literature
Identify Inclusion and Exclusion Criteria
Select the Relevant Literature
Identify the Best Available Literature
Abstract the Information
Consider the Non-Peer-Reviewed Literature
Summary and Transition to the Next Chapter on Evaluation Measures
Exercises
References and Suggested Readings

6. Evaluation Measures

A Reader’s Guide to Chapter 6


Reliability and Validity
Reliability
Validity
A Note on Language: Data Collection Terms
Checklist for Creating a New Measure
Checklist for Selecting an Already Existing Measure
The Measurement Chart: Logical Connections
Summary and Transition to the Next Chapter on Managing Evaluation Data
Exercises
References and Suggested Readings

7. Managing Evaluation Data

A Reader’s Guide to Chapter 7


Managing Evaluation Data Management: The Road to Data Analysis
Drafting an Analysis Plan
Creating a Codebook or Data Dictionary
Establishing Reliable Coding

Measuring Agreement: The Kappa


Entering the Data
Searching for Missing Data
What to Do When Participants Omit Information
Cleaning the Data
Outliers

When Data Are in Need of Recoding


Creating the Final Database for Analysis
Storing and Archiving the Database
Summary and Transition to the Next Chapter on Data Analysis
Exercises
References and Suggested Readings

8. Analyzing Evaluation Data

A Reader’s Guide to Chapter 8


A Suitable Analysis: Starting With the Evaluation Questions
Measurement Scales and Their Data

11
Categorical Data
Ordinal Data
Numerical Data
Selecting a Method of Analysis
Hypothesis Testing and p Values: Statistical Significance
Guidelines for Hypothesis Testing, Statistical Significance, and p Values
Clinical or Practical Significance: Using Confidence Intervals
Establishing Clinical or Practical Significance
Risks and Odds
Odds Ratios and Relative Risk
Qualitative Evaluation Data: Content Analysis

Assembling the Data


Learning the Contents of the Data
Creating a Codebook or Data Dictionary

Entering and Cleaning the Data


Doing the Analysis
Meta-Analysis
Summary and Transition to the Next Chapter on Evaluation Reports
Exercises
References and Suggested Readings

9. Evaluation Reports

A Reader’s Guide to Chapter 9


The Written Evaluation Report
Composition of the Report

Introduction

Methods

Results

Conclusions or Discussion

Recommendations
The Abstract

The Executive Summary


Reviewing the Report for Quality and Ethics
Oral Presentations
Recommendations for Slide Presentations
Posters
Ethical Evaluations
Evaluations That Need Institutional Review or Ethics Board Approval
Evaluations That Are Exempt From IRB Approval
What the IRB Will Review

Informed Consent
The Internet and Ethical Evaluations

Communication Between the Evaluator and the Participant


Communication Between the Participant and the Website
Communication Between the Website and the Evaluator

Data Protection

12
Sample Questionnaire: Maintaining Ethically Sound Online Data Collection
Example: Consent Form for an Online Survey
Research Misconduct
Exercises
Suggested Websites

Answers to Exercises

Index

13
Preface

O ver the past decade, program evaluation has made important contributions to practice, policy,
administration, education, and research in the health, nursing, medicine psychology, criminal justice,
social work, and education fields. Program evaluation is an unbiased investigation of a program’s merits,
including its effectiveness, quality, and value. Programs are designed to improve health, education, and social
and psychological well-being. An effective program provides substantial benefits to individuals, communities,
and societies, and these benefits are greater than their costs. A high-quality program meets its users’ needs
and is based on sound theory and the best available research evidence. Value is measured by how efficiently a
high-quality program achieves important individual, community, and societal outcomes.
This book explains the evaluator’s goals and methods in order to provide readers with the skills they need
to conduct and participate in program evaluations. Its nine chapters contain the following uniform elements:

• Brief reader’s guides to the topics covered in the chapter


• Examples illustrating all major concepts reviewed
• Separate guidelines for conducting key evaluation activities
• Examples of forms to use in completing evaluation tasks
• Checklists highlighting key concepts
• Summaries of the main ideas and topics covered
• Exercises and answer key
• Lists of suggested readings and online resources

The chapters are organized according to the main tasks involved in conducting an evaluation: selecting and
justifying evaluation questions or hypotheses and evidence of program merit, designing the evaluation,
sampling participants, selecting information sources, ensuring reliable and valid measurement, managing data,
analyzing data, and reporting the results in written and oral form. Among the special topics covered are the
following:

• What program evaluation is and how it differs from other research, development, and practice
• The origins and purposes of evaluation questions and hypotheses
• The meaning of program effectiveness, quality, and value
• Reliability and validity of research design and research measurement
• True experiments and quasi-experiments
• Sample size
• Effect size
• Qualitative evaluation methods, including participatory evaluation

14
• Community-based evaluation
• Data management
• Confidence intervals
• Statistical and practical significance
• Meta-analysis
• Evaluation frameworks and models
• Ethical evaluation and institutional review boards
• Informed consent forms
• Research misconduct
• Evaluation reports online
• Evaluation models online
• Economic evaluations: cost-effectiveness, cost-benefit, cost utility
• Factorial designs
• Codebooks
• Data entry
• On-and offline presentations of evaluation findings

This book is recommended for anyone who is responsible for either selecting or evaluating programs or
policies in health (nursing, public health, medicine); education; social work; and criminal justice. It is also
intended for students and faculty in courses designed for program funders, planners and administrators, and
policy makers. The examples presented throughout the book are taken from actual program evaluations
conducted worldwide and include both younger and older participants and cover the costs and quality of
programs in public health and medicine, education, psychology, social work, and criminal justice.

What Is New in the Third Edition


• Over 50 new references and hundreds of new examples, most of them accessible online and covering a
range of topics from health and psychology to education and social work
• Expanded definition of program evaluation to reflect the most recent thinking about quality, safety,
equity, accessibility, and value
• Extensive discussion on how to design and report on evaluations that not only provide information on
improvement and effectiveness, but also on program quality and value
• Addition of two new evaluation frameworks: PRECEDE-PROCEED and logic models
• Guidelines for searching for and evaluating the quality of the literature contained in online libraries (e.g.
PubMed, ERIC, Web of Science)
• Standards for selecting the “best available evidence,” which is the key to evidence-based or science-based
practice

15
• Guide to comparative effectiveness program evaluation and its similarities and differences from other
evaluations in design and purpose
• New section on ethical considerations for online evaluations and surveys
• Expanded section on qualitative methods
• Discussion, explanation, and examples of mixed-methods research
• Guide to presenting evaluation results in poster form
• Discussion and rationale for using standardized reporting checklists, such as CONSORT (for RCTs)
and TREND (for non RCTs)

On a personal note, I would like to thank my editor, Vicki Knight, who has been a champ in supporting
me in my vagaries. She has been extraordinary (and fun) throughout the writing and editing processes.
I would also like to acknowledge the special role of the reviewers. I am so very grateful for their time and
excellent comments. Their reviews were crucial to my revision and the writing of this edition. I am most
grateful to the following people:

Patricia Gonzalez, San Diego State University

Moya L. Alfonso, Jiann-Ping Hsu College of Public Health

Sharon K. Drake, Iowa State University

Young Ik Cho, University of Wisconsin–Milwaukee

Mesut Akdere, University of Wisconsin–Milwaukee

Marie S. Hammond, Tennessee State University

Lillian Wichinsky, University of Arkansas at Little Rock

An accompanying website at www.sagepub.com/fink3e includes mobile-ready eFlashcards and links to


suggested websites.

16
About the Author

Arlene Fink (PhD) is professor of medicine and public health at the University of California, Los Angeles,
and president of the Langley Research Institute. Her main interests include evaluation and survey research
and the conduct of research literature reviews as well as the evaluation of their quality. Dr. Fink has conducted
scores of literature reviews and evaluation studies in public health, medicine, and education. She is on the
faculty of UCLA’s Robert Wood Johnson Clinical Scholars Program and is a scientific and evaluation advisor
to UCLA’s Gambling Studies and IMPACT (Improving Access, Counseling, & Treatment for Californians
with Prostate Cancer) programs. She consults nationally and internationally for agencies, such as L’Institut de
Promotion del la Prévention Secondaire en Addictologie (IPPSA) in Paris, France, and Peninsula Health in
Victoria, Australia. Professor Fink has taught and lectured extensively all over the world and is the author of
over 135 peer-reviewed articles and 15 textbooks.

17
Purpose of This Chapter

Program evaluation is an unbiased exploration of a program’s merits, including its effectiveness,


quality, and value. What does a program evaluation look like? Why do one? This chapter provides
examples of program evaluations and describes their purposes, methods, and uses. It also discusses
formative and summative evaluations and qualitative and mixed-methods research. The chapter
introduces comparative and cost-effectiveness evaluations and describes their methods and
importance. The chapter also provides examples of program evaluation models and offers guidelines
for identifying relevant online evaluation reports.

18
1
Program Evaluation:
A Prelude

A Reader’s Guide to Chapter 1

What Is Program Evaluation?


The Program or Intervention
Program Objectives and Outcomes
Program Characteristics
Program Impact
Program Costs
Program Quality
Program Value

Evaluation Methods
Posing evaluation questions and hypotheses; deciding on evidence of merit: effectiveness, quality,
value; designing the evaluation; selecting participants for the evaluation; collecting data on program
merit; managing data so that it can be analyzed; analyzing data to decide on program merit; reporting
on effectiveness, quality, and value.

Who Uses Evaluations?

Baseline Data, Formative Evaluation, and Process Evaluations

Summative Evaluation

Qualitative Evaluation

Mixed-Methods Evaluation

Participatory and Community-Based Evaluation

Evaluation Frameworks and Models


PRECEDE-PROCEED, RE-AIM, Centers for Disease Control Framework, Practical Program
Evaluation, and Logic Models

Evaluation Reports Online

Exercises

Suggested References and Readings

19
Suggested Websites

What Is Program Evaluation?

Program evaluation is an unbiased exploration of a program’s merits, including its effectiveness, quality, and
value. An effective program provides substantial benefits to individuals, communities, and societies and these
benefits are greater than their human and financial costs. A high-quality program meets its users’ needs and is
based on sound theory and the best available research evidence. A program’s value is measured by its worth to
individuals, the community, and society.

The Program or Intervention

At the core of an evaluation is a program, intervention, treatment, or policy. Programs, interventions, and
treatments are systematic efforts to achieve explicit objectives for improving health, education, and well-
being. They occur in all fields, including medicine, education, and business and law and involve individuals,
communities, and society. A program may be relatively small (e.g., a course in web design for seniors in two
high schools; a new community health center for persons over 75 years of age), or relatively large (e.g., a
nation’s health plan or a global initiative to eliminate poverty). Programs can take place in differing
geographic and political settings, and they vary in their purposes, structures, organization, and constituents. A
policy is a system of laws, regulatory measures, courses of action, and funding priorities associated with
private and public governing agencies, trusts, and boards. A policy can be used to support or discontinue
programs, interventions, treatments, and evaluations.
Are the following objectives likely to come from program evaluations?

Objective: To determine the effectiveness of an abuse-prevention curriculum designed to empower


women with mental retardation to become effective decision makers.

The answer is yes. The evaluation is for an abuse-prevention program.


What about this objective?

Objective: To investigate the effectiveness of acupuncture compared with sham acupuncture and with no
acupuncture in patients with migraine.

This objective is also likely to come from an evaluation. The investigators compare three interventions:
acupuncture, sham acupuncture, and no acupuncture. (No acupuncture is considered an intervention because
the absence of acupuncture does not mean the absence of anything at all. The no acupuncture group may be
on medication or other forms of therapy.)
Finally, is this objective likely to come from a program evaluation?

Objective: To assess whether the Acquittal Project has effectively exonerated wrongfully convicted

20
people through DNA testing.

Similarly, this objective is likely to come from a program evaluation. The program, the Acquittal Project, is
designed to achieve a specific objective: to exonerate wrongfully convicted people.
Now, consider whether this objective is typical of program evaluations.

Objective: To clarify the concepts of coping with pain and quality of life, and to present a literature
review of the strategies that children with recurrent headaches use to cope with their pain; the impact of
recurrent headaches on children’s quality of life; and the influence of personal characteristics (i.e., age,
family support) on headaches, coping, and quality of life in children.

No. This objective is not typical of program evaluations. The researchers are not planning to investigate the
effectiveness, quality, or value of a specific program.

Program Objectives and Outcomes

A program’s objectives are its anticipated outcomes—for example, to improve skills in primary school math,
prevent gambling problems in adolescents, or provide increased access to social services for young families.
The aim of a major program evaluation is to provide data on a program’s progress toward achieving its
objectives.
The ultimate, desired outcomes of most social programs are usually lofty goals, such as providing efficient,
high-quality health care and education to all people. These outcomes are often difficult to measure (or
achieve) because of a lack of consensus on definitions, and because evaluators rarely have sufficient time to
observe and assess the programs accurately. As a result, many evaluations focus on the extent to which
programs achieve more easily measured goals and objectives, such as improving 4th grade reading skills,
helping adolescents stop smoking, or teaching older adults to become better consumers of online health
information. The idea is that if programs can foster the achievement of these interim objectives,
accomplishment of the loftier outcomes may eventually become possible.

Program Characteristics

Evaluations answer questions about a program’s characteristics and social and cultural contexts. Typical
questions of this type include: Is the individual, communal, or societal need for the program explained? Is it
justified? Who is responsible for program development and program funding? Which principles of learning,
social justice, or health-behavior change guide the program development? Was the program implemented as
planned? What were the barriers to implementation? Were changes made to the original objectives? If so,
why were the changes needed, and who made them? What is the duration of the program? What is its
content?

Program Impact

Evaluators often examine a program’s impact—that is, the scope of its effects, the duration of its outcomes,

21
and the extent of its influence in varying settings and among different groups of people. For example,
consider the evaluations of two programs to improve mental health status. Evaluation A reports that Program
A improved mental health status for its participants, and that the gains were sustained for at least 3 years;
moreover, when Program A was tried out in another country, participants in that country also improved.
Evaluation B reports that Program B also improved mental health status and sustained the improvement for 3
years, however, for fewer participants. When program B was tested in another country, the evaluators found
there were few gains. The evaluators of Programs A and B agreed that Program A had greater impact because
its benefits reached more people over a longer period of time.

Program Costs

Evaluations are also concerned with how much a program costs and the relationship of cost to effectiveness
and benefit. Program costs include any risks or problems that adversely affect program participants. For
instance, program participants in a group therapy program may feel embarrassed about revealing personal
information, or they may become unexpectedly ill from the treatment being evaluated. Program costs also
include the financial costs of facilities, staff, and equipment. Typical questions about costs include: If two
programs achieve similar outcomes, which one is least costly? For each dollar spent, how much is saved on
future use of services?

Program Quality

High-quality programs meet their users’ needs and are based on accepted theories of human behavior and
the best available research evidence. They have sufficient funding to ensure that their objectives are achieved
and have strong leadership, trained staff, and a supportive environment.
Commonly asked questions about program quality include:

• Has the program been studied systematically before implementation so that its risks and benefits are
predictable?
• Is the program grounded in theory or supported by the best available research?
• Does the program provide a safe, healthy, and nurturing environment for all participants?
• Is the infrastructure well-developed, and is the fiscal management sound?
• How well does the program develop and nurture positive relationships among staff, participants,
parents, and communities?
• Does the program recruit, hire, and train a diverse staff who value each participant and can deliver
services as planned at the highest level?
• Has the program established a partnership with communities in order to achieve program goals?
• Does the program have a coherent mission and a plan for increasing capacity so that the program is
sustained or continues to grow?
• Is a system in place for measuring outcomes and using that information for program planning,
improvement, and evaluation?

22
• Are resources appropriately allocated so that each component of the program and its evaluation are likely
to produce unbiased and relevant information?

Program Value

Value is defined as the importance, worth, or usefulness of something. The word “evaluation” in program
evaluation implies that the discipline’s purpose is to analyze and judge the value of a program. The term value
is subjective, and the whole enterprise of program evaluation is based on identifying strategies to minimize
the subjectivity—bias—that can consume the process of analyzing or judging a program’s merit.
Despite the term’s subjectivity, most evaluators agree that “value” should be defined to suit the recipients of
services (students, patients) rather than the suppliers (teachers, nurses, physicians, social workers,
psychologists, funders, policy makers). Typical evaluation questions about program value include:

• Do the program’s risks or benefits outweigh its costs?


• Does the program meet a need that no other service can or does provide?
• Does the program provide the most improvement and benefits possible with its available resources?

Evaluation Methods

Program evaluators use many of the methods social and health scientists, educators, and psychologists rely on
to gather reliable and valid evidence. These methods typically include the following activities:

1. Selecting questions for formulating hypotheses about program and participant characteristics,
outcomes, impact, and costs

2. Deciding on evidence of program merit: effectiveness, quality, value

3. Designing the evaluation

4. Selecting participants for the evaluation

5. Collecting data on program merit

6. Managing data so that it can be analyzed

7. Analyzing data to decide on merit

8. Reporting the results

Evaluation Questions and Hypotheses

Evaluations directly or indirectly answer questions about a program’s implementation, outcomes, impact,
and costs. Some evaluations design their studies to test hypotheses rather than ask questions, although the
two are related.
Typical evaluation questions and hypotheses include:

23
• Question: Did the program achieve its goals and objectives?
Hypothesis: When compared to a similar program, program A will achieve significantly more goals and
objectives than Program B.
• Question: Which program characteristics (e.g., theoretical foundation, use of technology, funding) are
most likely responsible for the best and worst outcomes?
Hypothesis: The online course will achieve significantly better results than the traditional course.
• Question: For which individuals or groups was the program most effective?
Hypothesis: Boys will learn as quickly as girls.
• Question: How applicable are the program’s objectives and activities to other participants in other
settings?
Hypothesis: Participants in other schools will do as well as participants in the local school.
• Question: How enduring were the program’s outcomes?
Hypothesis: Participants will maintain their gains over a five-year period after the program’s conclusion.
• Question: What are the relationships among the costs of the program and its outcomes?
Hypothesis: For every dollar spent, there will be at least one reading level improvement.
• Question: To what extent did social, political, and financial support influence the program’s outcomes
and acceptability?
Hypothesis: Local support is associated with greater program satisfaction.
• Question: Is the program cost-effective?
Hypothesis: New Program A and old Program B achieve similar outcomes, but Program A costs less to
implement.
• Question: Were there any unanticipated outcomes (beneficial as well has harmful)?
This is a research question. No hypothesis is associated with it because the evaluators have no basis for
stating one. They do not have a theory or any research evidence to support assumptions about outcomes.

Some evaluations answer just a few questions or test just a few hypotheses, while others answer many
questions and test numerous hypotheses.

Evidence of Merit: Effectiveness, Quality, Value

Evidence of merit consists of the facts and information that demonstrate a program’s effectiveness, quality,
and value. Consider each of the following six possible indications of merit for a program whose objective is
“to improve children’s dietary and other health habits”:

1. Testimony from children in the program (and from their parents and teachers) that their habits have
improved.

2. The evaluator’s observations of improved health habits (e.g., through studies of children’s choices of
snacks during and between meals).

24
3. Proof of children’s improved health status found in physical examinations by a nurse practitioner or a
physician.

4. The evaluator’s finding of statistically significant differences in habits and in health status between
children who are in the program compared with children who are not. Children in the program do
significantly better.

5. The evaluator’s finding of statistically significant and sustained differences in habits and in health status
between children who are in the program compared with children who are not. Children in the
program continue to do significantly better over time.

6. Statistical and qualitative evidence that Program A achieves the same aims as Program B, and
demonstrates that it is less costly.

Which of these indications of merit is best? How much and what types of evidence are needed? Merit is a
subjective term: It varies across individuals, communities, institutions, and policy makers.
The evaluator’s challenge is to identify evidence that is unbiased, convincing to the evaluation’s users and
funders, and possible to collect with the available resources. For instance, evaluators are unlikely to be able to
provide data on sustained program benefits in evaluations that are scheduled to last a year or less even if they
have the resources, and even if that is what the users indicate they want. Bias in evaluations often comes from
faulty research methods or failure to properly implement the program or the evaluation.
Many evaluators consult and form partnerships with users and funders to ensure that the evidence they
plan to collect is appropriate and likely to meet expectations. Evaluators find that working with clients
typically creates mutual respect, promotes client cooperation with data collection during the evaluation’s
implementation, and improves the usefulness of the results.

Designing the Evaluation

An evaluation’s design is its structure. Evaluators do their best to design a project so that any benefits that
appear to result from the program are real and not influenced by expectation or preference. A standard
evaluation design includes comparing the participants in a new program with participants in an alternative
program. The comparison can occur once or several times. For example, suppose five universities plan to
participate in an evaluation of a new program to teach the basic principles of program evaluation to Education
Corps trainees. In designing the study, the evaluator has to answer questions like these:

• Which program is a fair comparison to the “new” one? Evaluators sometimes compare the new program
to an already existing one with similar characteristics, or they compare the new program to “usual
practice.” If the resources are available, they may compare the new program to an already existing
program and also to usual practice. Another option is to compare two versions of the new program and
usual practice. For instance, a crime prevention program for teens may compare a smartphone app with
peer counseling [version 1], the same app without the counseling [version 2], and the school’s usual
monthly webinar [usual practice].
• Which criteria are appropriate for including institutions? (size, resources, location)

25
• Which criteria are appropriate for excluding institutions? (unwillingness to implement program as
planned; lack of staff commitment to the evaluation)
• What should I measure? (understanding of principles of program evaluation, application of the
principles when designing an evaluation)
• When should I measure learning? (before and after program participation? How long after program
participation?)

Selecting Participants for the Evaluation

Suppose you are asked to evaluate a program to provide school-based mental health services to children
who have witnessed or have been victims of violence in their communities. Here are some questions you need
to ask:

• Who should be included in the evaluation? (Which grades should be included? How much exposure to
violence should eligible children have?)
• Who should be excluded? (Should children be excluded if, in the opinion of the mental health clinician,
they are probably too disruptive to participate in the program’s required group therapy sessions?)
• How many children should be included? (What is a sufficient number of participants to allow the
evaluation to detect change in children’s behavior if the program is effective?)

Collecting Data on Program Merit

Conclusions about a program’s merit, quality, and value come from the data an evaluator collects to answer
questions and test hypotheses. Data collection includes

• identifying the variables (individual knowledge, attitudes, or behaviors; community practices; social
policies) that are the program’s target outcomes;
• identifying the characteristics of the participants who will be affected by the program (men between the
ages of 45 and 54; rural and urban communities);
• selecting, adapting, or creating measures of the variables (knowledge tests, direct observations of
behavior; analysis of legal documents);
• demonstrating the reliability (consistency) and validity (accuracy) of the measures;
• administering the measures; and
• analyzing and interpreting the results.

Some common measures or sources of evaluation data are

• literature reviews;
• archival reviews (school and medical records);
• existing databases, such as those maintained by governments and schools;
• self-administered questionnaires (including in-person, mailed, and online surveys);

26
• interviews (in-person and on the telephone);
• achievement tests;
• observations;
• physical examinations; and
• hypothetical vignettes or case studies.

Managing Data So That It Can Be Analyzed

Data management includes the following activities

• Drafting an analysis plan that defines the variables to be analyzed


• Creating a codebook
• Establishing the reliability of the coders or coding
• Entering data into a database and validating the accuracy of the entry
• Reviewing the evaluation’s database for incomplete or missing data
• Cleaning the data
• Creating the final data set for analysis
• Storing and archiving the data set and its operations manual

Analyzing Data to Decide on Program Merit

Data analysis consists of the descriptive (qualitative) and statistical (quantitative) methods used to
summarize information about a program’s effectiveness. The choice of which method of analysis to use is
dependent on several considerations.

• The characteristics of the evaluation question and evidence of effectiveness (Do the questions ask about
differences over time among groups or about associations between program characteristics and benefits?
If the questions ask about differences, then a statistical method that tests for differences is needed. If the
questions ask about associations, then different statistical methods are probably warranted.)
• How the variables are expressed statistically: categorically (“did or did not pass the test”); with ordinals
(Stage I, II, or III of a disease; ratings on a scale ranging from 1 to 5); or numerically (average scores on
a mathematics test)
• How the variables are expressed qualitatively (e.g., themes from an analysis of a focus group)
• The reliability and validity of the data

Reporting on Effectiveness, Quality, and Value

An evaluation report describes, justifies, and explains the purposes of the evaluation, the program, the
setting, and the methods that are used to arrive at unbiased conclusions about effectiveness, quality, and value.
The methods include descriptions and explanations of the evaluation question and evidence-selection
processes, the research design, sampling strategy, data collection, and data analysis. The report also states the

27
results and arrives at conclusions about program merit based on the evidence. Many scholarly journals also
require proof that the evaluation respected and protected participants from risk. This is done by asking for the
evaluator to state that the evaluation received a formal review by an ethics board.
Evaluation reports may be oral or may be presented in written form, as books, monographs, or articles.
Consider the summaries in Example 1.1.

Example 1.1 Summaries of Program Evaluations

1. Evaluation of a Healthy Eating Program for Professionals Who Care for Preschoolers (Hardy, King,
Kelly, Farrell, & Howlett, 2010)

Background. Early childhood services are a convenient setting for promoting healthy eating and physical
activity as a means of preventing overweight and obesity. This evaluation examined the effectiveness of a
program to support early childhood professionals in promoting healthy eating and physical activity among
children in their care.

Setting and Participants. The evaluation included 15 intervention and 14 control preschools with 430
children whose average age was 4.4 years.

Methods. Preschools were randomly allocated to the intervention or a control program. The evaluators did
not know which schools were in each program. They collected data before and after program implementation
on children’s lunchbox contents; fundamental movement skills (FMS); preschool policies, practices, and staff
attitudes; knowledge and confidence related to physical activity; healthy eating; and recreational screen time.

Results. Using statistical methods, the evaluators found that over time, FMS scores for locomotion and object
control, and total FMS scores significantly improved in the intervention group compared with the control
group by 3.4, 2.1, and 5.5 points (respectively). The number of FMS sessions per week increased in the
intervention group compared with the control group by 1.5. The lunchbox audit showed that children in the
intervention group significantly reduced sweetened drinks by 0.13 servings.

Conclusion. The findings suggest that the program effectively improved its targeted weight-related
behaviors.

2. A Psychological Intervention for Children With Symptoms of Posttraumatic Stress Disorder (Stein et al.,
2003)

Context. Are psychological interventions effective for children with symptoms of posttraumatic stress
disorder (PTSD) resulting from personally witnessing or being personally exposed to violence?

Objective. To evaluate the effectiveness of a collaboratively designed school-based intervention for reducing
children’s symptoms of PTSD and depression resulting from exposure to violence.

Design. A randomized controlled trial conducted during one academic school year.

Setting and Participants. Sixth-grade students who reported exposure to violence and had clinical levels of
symptoms of PTSD at two large middle schools in a large U.S. city.

28
Intervention. Students were randomly assigned to a 10-session standardized cognitive-behavioral therapy
early intervention group (61 students), or to a wait-list delayed intervention comparison group (65 students)
conducted by trained school mental health clinicians.

Main Outcome Measures. Students were assessed before the intervention and 3 months after the
intervention on measures assessing child reported symptoms of PTSD and depression.

Results. The evaluation found that compared with the wait-list delayed intervention group (no intervention),
after 3 months of intervention, students who were randomly assigned to the early intervention group had
significantly lower scores on symptoms of PTSD and depression.

Conclusion. A standardized 10-session cognitive-behavioral group intervention can significantly decrease


symptoms of PTSD and depression in students who are exposed to violence and can be effectively delivered
on school campuses by trained school-based mental health clinicians.

By examining the summaries in Example 1.1, it is apparent that doing an evaluation involves the use of
multiple skills in research design, statistics, data collection, and interpretation. Since very few individuals have
perfected all of these skills, evaluators almost always work in teams, as is illustrated in Example 1.2.

Example 1.2 Program Evaluations as


an Interdisciplinary Discipline

• A 4-year evaluation of a new workplace literacy program was conducted by a team composed of two
professional evaluators, a survey researcher, a statistician, and two instructors. The evaluation team also
consulted an economist and an expert in information science.
• A 3-year evaluation of a 35-project program to improve access to and use of social services for low-income
women relied on two professional evaluators, a social worker, an epidemiologist, a nurse practitioner, and
an economist.
• An evaluation of a program using nurses to screen community dwelling elderly individuals for
hypertension, vision, and hearing disorders relied on a nurse, a nurse practitioner, a statistician, and a
professional evaluator.

Who Uses Evaluations?

At least seven different groups use the information that results from program evaluations:

1. Government agencies

2. Program developers (a director of a community health clinic, a curriculum committee, or a nursing


school’s curriculum committee)

3. Communities (geographically intact areas, such as a city’s “skid row”; people with a shared health-

29
related problem, such as HIV/AIDS; or individuals with a common culture, such as Armenians or
Latinos)

4. Policy makers (i.e., elected officials; the school board)

5. Program funders (philanthropic foundations or trusts and the various agencies of the National Institutes
of Health)

6. Students, researchers, and other evaluators (specific to schools and universities, government agencies,
businesses, and public agencies)

7. Individuals interested in new programs

Baseline Data, Formative


Evaluation, and Process Evaluation

The need for a program is demonstrated when there is a gap between what individuals or communities need
and their current services. Baseline data are collected to document program participants’ status before they
begin the program. Interim data, which are collected during the course of the program, show the program’s
progress in meeting the participants’ needs. These interim data are used to evaluate the program while in its
formative stage.

Baseline Data

Baseline data are collected before the start of the program to describe the characteristics of participants
(e.g., their social, educational, health status, and demographic features, such as age), information that is
important later on when the evaluator is interpreting the effects of the program. Example 1.3 illustrates some
of the reasons program evaluators collect baseline data.

Example 1.3 Baseline Data and Program Evaluation

The Agency for Drug and Alcohol Misuse has published extensive guidelines for identifying and counselling
adolescents whose alcohol use is interfering with their everyday activities, such as attendance at school. An
evaluation of the guideline’s effectiveness is being conducted nationwide. Before the evaluators begin the
formal evaluation process, they collect baseline data on the extent to which health care professionals in
different settings (e.g., schools and community clinics) already follow the practices recommended by the
guidelines, the prevalence of alcohol misuse among adolescents in the communities of interest, and the
number of adolescents that are likely to use services in the evaluation’s proposed duration of 3 years.

Interim Data and Formative Evaluation

In formative evaluation, data are collected after the start of a program, but before its conclusion—for
example, 12 months after the beginning of a 3-year intervention. An evaluator collects interim data to

30
describe the progress of the program while it is still developing or “forming.” Formative evaluation data are
mainly useful to program developers and funders. Program developers and funders may want to know if a new
program is feasible as is, or whether it needs to be improved. A feasible program is one that can be
implemented according to plan and is likely to be beneficial.
Data from formative evaluations are always preliminary, and require cautious interpretation. Example 1.4
illustrates why evaluators need to take care in interpreting formative findings.

Example 1.4 Formative Evaluation and


Interim Data: Proceed With Caution

In a 3-year study of access to prenatal care, the results of a 14-month formative evaluation found that three of
six community clinics had opened on schedule and were providing services to needy women exactly as
planned. Preliminary data also revealed that 200 women had been served in the clinics and that the
proportion of babies born weighing less than 2,500 grams (5.5 pounds) was 4%, well below the state’s average
of 6%. The evaluators concluded that progress was definitely being made toward improving access to prenatal
care. After 3 years, however, the evaluation results were quite different. The remaining three scheduled clinics
had never opened, and one of the original three clinics had closed. Many fewer women were served than had
been anticipated, and the proportion of low birth weight babies was 6.6%.

As this example shows, data from a formative evaluation can be misleading. Good interim results may be
exhilarating, but poor ones can adversely affect staff morale. With programs of relatively short duration—say,
2 years or less—the collection of interim data is expensive and probably not very useful. Consider the
evaluation described in Example 1.5.

Example 1.5 Questions Asked in a Formative Evaluation of a Program for Critically Ill Children

Many experts agree that the emergency medical services needed by critically ill and injured children differ in
important ways from those needed by adults. As a result, a number of health regions have attempted to
reorganize their emergency services to provide better care to children. One region commissioned a 3-year
evaluation of its program. It was specifically concerned with the characteristics and effectiveness of a soon to
be implemented intervention to prevent transfers from adult inpatient or intensive care units and to maximize
quality of care for children with cardiopulmonary arrest in hospital emergency departments and intensive care
units.

In planning the evaluation, the evaluators decided to check a sample of medical records in 15 of the state’s 56
counties to see whether sufficient information was available for them to use the records as a main source of
data. Also, the evaluators planned to release preliminary findings after 12 months, which involved reviews of
records as well as interviews with physicians, hospital administrators, paramedics, and patients’ families. An
expert’s review of the evaluation’s design raised these questions for the evaluators:

31
1. Does the description of this evaluation as a “3-year evaluation” mean that there will be 3 years of data
collection, or do the 3 years include evaluation planning, implementation, and reporting as well as data
collection? Assume interim data are promised in a year. Can you develop and validate medical record
review forms in time to collect enough information to present meaningful findings?

2. Can you develop, validate, and administer the survey forms in the time available?

3. To what extent will the interim and preliminary analyses answer the same or similar questions? If they
are very different, will you have sufficient time and money to effectively conduct both?

4. Will a written or oral interim report be required? How long will that take to prepare?

Some program evaluations are divided into two phases. In Phase 1, the evaluation is designed to focus on
feasibility and improvement, and in Phase 2, it focuses on effectiveness, cost, and value. Some funders prefer
to have Phase 1 done internally (i.e., by the participating schools or clinics), and Part 2 done externally (by
professional evaluation consultants). External evaluations are presumed to be more objective and less inclined
to bias than internal evaluations. Increasingly, however, many agencies and institutions involve program
evaluators in both study phases for continuity and efficiency.

Process or Implementation Evaluation

A process evaluation is concerned with the extent to which planned activities are implemented, and its
findings may be reported at any time. Process evaluations are almost always useful. For example, in an
evaluation of three interventions to increase the rates at which women returned to follow up on Pap smears, a
process evaluation concluded that implementation of the intervention protocols was less than perfect and thus
introduced a bias into the results of the outcome evaluation. This study is the subject of Example 1.6.

Example 1.6 Process or Implementation


Evaluation: Follow-Up of Abnormal Pap Smears

During the course of a 2-year evaluation, all women were to be surveyed at least once regarding whether they
received the program and the extent to which they understood its purposes and adhered to its requirements.
Telephone interviews after 18 months revealed that 74 of 100 women (74%) in the slide-tape intervention
had seen the entire 25-minute presentation, 70 of 111 (63%) had received mailed reminders from their
physicians’ offices to come back for another Pap smear, and 32 of 101 (about 32%) had received phone calls
from their physicians’ offices. These findings helped explain the apparent failure of the third intervention to
achieve positive results when compared with the other two.

Summative Evaluation

32
Summative evaluations are historical studies that are compiled after the program has been in existence for a
while (say, two years), or all program activities have officially ceased. These evaluations sum up and
qualitatively assess the program’s development and achievements.
Summative evaluations are descriptive rather than experimental studies. Funding sources sometimes
request these evaluations because summative reports usually contain details on how many people the program
served, how the staff was trained, how barriers to implementation were overcome, and if participants were
satisfied and likely to benefit. Summative evaluations often provide a thorough explanation of how the
program was developed and the social and political context in which the program and its evaluation were
conducted.

Qualitative Evaluation

Qualitative evaluations collect data through interviews, direct observations, and review of written documents
(for example, private diaries). The aim of these evaluations is to provide personalized information on the
dynamics of a program and on participants’ perceptions of the program’s outcomes and impact.
Qualitative evaluation is useful for examining programs where the goals are in the process of being defined,
and for testing out the workability of particular evaluation methods. Because they are personalized, qualitative
methods may add emotion to otherwise purely statistical findings and provide a means of gauging outcomes
when reliable and valid measures of those outcomes are unlikely to be available in time for inclusion in the
evaluation report.
Qualitative methods are employed in program evaluations to complement the usual sources of data (such as
standardized surveys and medical record reviews, physical examinations, and achievement tests). Example 1.7
illustrates four uses of qualitative methods in program evaluation.

Example 1.7 Uses of Qualitative


Methods in Program Evaluation

1. To evaluate the effectiveness of a campaign to get heroin addicts to clean their needles with bleach, the
evaluators spend time in a heroin “shooting gallery.” They do not have formal observation measures,
although they do take notes. The evaluators discuss what they have seen, and although needles are being
cleaned, agree that the addicts use a common dish to rinse needles and dilute the drug before shooting.
The evaluators recommend that the community’s program should be altered to take into account the
dangers of this practice.

2. To evaluate the quality and effectiveness of an education counseling program for mentally ill adults, the
evaluation team lives for 3 months in each of five different residential communities. After taping more
than 250 counseling sessions, the evaluators examine the tape to determine if certain counseling
approaches were used consistently. They conclude that the quality of the counseling varies greatly both
within and among the communities, which helps to explain the overall program’s inconsistent results.

3. To evaluate the impact of a school-based health program for homeless children, the evaluators teach a

33
cohort of children to keep diaries over a 3-year period. The evaluation finds that children in the program
are much more willing to attend to the dangers of smoking and other drug use than are children in schools
without the program. The evaluators do an analysis of the content of the childrens’ diaries. They find that
children in the program are especially pleased to participate. The evaluators conclude that the children’s
enjoyment may be related to the program’s positive outcomes.

4. An evaluation of the impact on the county of a program to improve access to and use of prenatal care
services asks “opinion leaders” to give their views. These people are known in the county to have expertise
in providing, financing, and evaluating prenatal care. The interviewers encourage the leaders to raise any
issues of concern. The leaders share their belief that any improvements in prenatal care are probably due
to medical advances rather than to enhanced access to services. After the interviews are completed, the
evaluators conclude that major barriers to access and use continue to exist even though statistical registries
reveal a decline in infant mortality rates for some groups of women.

In the first evaluation in Example 1.7, the evaluators are observers at the heroin shooting gallery. They rely
on their observations and notes to come to agreement on their recommendations. In the second illustration,
the evaluators tape the sessions, and then interpret the results. The interpretations come after the data are
collected; the evaluators make no effort to state evaluation questions in advance of data collection. In the third
illustration, diaries are used as a qualitative tool, allowing participants to say how they feel in their own words.
In the fourth illustration in Example 1.7, experts are invited to give their own views; the evaluators make little
attempt to require the opinion leaders to adhere to certain topics.

Mixed-Methods Evaluation

Mixed methods is most commonly interpreted as a type of research in which qualitative and quantitative or
statistical data are combined within a single study. Example 1.8 outlines at least three reasons for mixing
methods: to better understand experimental study results, to incorporate user perspectives into program
development, and to answer differing research questions within the same study. Consider these examples.

Example 1.8 Reasons for Mixed-Methods Evaluations

1. Mixed Methods to Incorporate User Perspectives into Program Development

The study’s main purpose was to develop online education to improve people’s use of web-based health
information. The investigators convened five focus groups and conducted in-depth interviews with 15 people
to identify preferences for learning [user perspectives]. They asked participants questions about the value of
audio and video presentations. Using the information from the groups and interviews, the investigators
developed an online education tutorial and observed its usability in a small sample. Once they had evidence
that the education was ready for use in the general population, they evaluated its effectiveness by using

34
statistical methods to compare the knowledge, self-efficacy, and Internet use among two groups. Group 1 was
assigned to use the newly created online tutorial, and Group 2 was given a printed checklist containing tips
for wise online health information searches.

2. Mixed Methods to Answer Different Research Questions (Marczinski & Stamates, 2012; Yu, 2012)

A. The investigators in this study want to find out if alcohol consumed with an artificially sweetened
mixer (e.g., diet soft drink) results in higher breath alcohol concentrations (BrACs) compared with the
same amount of alcohol consumed with a similar beverage containing sugar [Research Question 1].
They were also interested in determining if individuals were aware of the differences [Research
Question 2]. BrACs were recorded, as were self-reported ratings of subjective intoxication, fatigue,
impairment, and willingness to drive. Performance was assessed using a signaled go/no-go reaction
time task. Based on the results, the investigators found that mixing alcohol with a diet soft drink
resulted in elevated BrACs, as compared with the same amount of alcohol mixed with a sugar-
sweetened beverage. Individuals were unaware of these differences, a factor that may increase the
safety risks associated with drinking alcohol.

B. A mixed-methods project was devoted to understanding college students’ justification for digital
piracy. The project consisted of two studies, a qualitative one and a quantitative one. Qualitative
interviews were conducted to identify main themes in students’ justification for digital piracy; the
findings were subsequently tested in a quantitative manner using a different sample of students.

3. Mixed Methods to Better Understand Experimental Results

The investigators found that experimental program participants reported significantly more discomfort with
study participation than did control program participants. This finding surprised the evaluation team. To
help them understand the findings, the team conducted interviews with each of the experimental program
participants and asked them about the causes of their discomfort.

Participatory and Community-Based Evaluation

A participatory evaluation invites representatives of the organizations and communities that will be affected by
the evaluation’s findings to join the evaluation team as partners in some or all of the evaluation activities.
Proponents of community-based evaluations assert that when community participation is encouraged, there
are at least four reasons why an evaluation’s findings can be particularly useful in helping to reduce disparities
in health, education, and well-being based on characteristics, such as race, ethnicity, age, sexual orientation,
socioeconomic status, and geography.

1. Participation helps to improve the quality and validity of research by giving it a basis in local
knowledge, culture, and history. In participatory evaluations, public concerns are viewed ecologically—
that is, in their political and social context as well as in their clinical setting.

2. Including the expertise of community members enhances the relevance of the evaluation questions, the

35
quality and quantity of data gathered, and the use of the data. Community members as well as
researchers “own” the data and therefore want to see the data used.

3. Participatory evaluation projects can assist in providing community members with resources and
possible employment opportunities. For example, community members can help evaluators in
translating surveys and in conducting interviews.

4. Participatory evaluations can lead to improvements in the health and well-being of communities by
studying and addressing important community needs and increasing community members’ power and
control over the research process. The community can keep the evaluators on track, preventing them
from taking an approach that is too academic or theoretical.

Participatory evaluators must be skilled in working with diverse groups of individuals. They must learn how
to lead meetings, encourage consensus, and inform participants about the objectives and purposes of
evaluation studies in general and their own evaluations in particular. At the same time, participatory
evaluators must lead the process of collecting unbiased data and interpreting those data objectively. Not all
programs—no matter how well-intentioned—are effective, and even those that have positive effects may not
be cost-effective, and so the participatory evaluator must be prepared to be the bearer of bad news.
Participatory evaluations themselves also tend to be extremely costly because they are labor-intensive: They
require individuals from the evaluation team and the community to spend time agreeing on evidence of
effectiveness and assisting with technical activities, including research design, data collection, and report
writing.
Example 1.9 provides illustrations of participatory evaluation in action.

Example 1.9 Participatory Evaluations in Action

1. An evaluation of a cancer control program involves the community in all phases of the project, from the
development of the grant proposal through to interpretation of the data. The purpose of the project is to
evaluate the effectiveness of a culturally appropriate intervention as a means of increasing breast and
cervical cancer screening practice among the communities’ women. The results show a community-wide
impact on cancer-related knowledge, attitudes, and behaviors; increased research capabilities; and
improvements to the health systems and services available to the community.

2. A mental health intervention is designed to diminish symptoms of depression in urban schoolchildren


who have witnessed or participated in community violence. A group of parents assist the evaluators in
developing evaluation questions, translating some of the surveys into Spanish and Russian, and collecting
data from other parents. They also review the evaluation’s findings and comment on them. The comments
are incorporated into the final report of the intervention’s effectiveness.

3. The directors of a health care clinic, interested in improving patient education, intend to organize a series
of staff seminars and then evaluate whether patient education improves after all staff have attended the
seminars. As part of the evaluation, the evaluation team convenes a series of four noon meetings with
clinic staff to identify the nature and extent of current problems in the clinic’s education for patients and

36
to examine alternative solutions. The clinic staff agrees to form a committee to work with the evaluators
and decide on evidence of effectiveness for the seminars and the patient education. The staff also agrees to
advise the evaluators on questions to ask patients about their experiences at the clinic and to review and
comment on the report of the evaluation’s findings.

In the first illustration in Example 1.9, members of the community are actively included in all phases of the
evaluation study, including the writing of the proposal for funding and the interpretation of the data. In the
second instance, parents work with the evaluators on many activities, including the formulation of evaluation
questions, data collection, and reporting. They are not necessarily involved in designing the evaluation (e.g.,
determining which children are eligible for participation and the characteristics of the control or comparative
intervention) or in the data analysis. The third illustration is a participatory evaluation because the staff and
evaluators work together to decide on evidence of effectiveness, identify appropriate questions to ask patients
about their experiences, and review the evaluation report.

Evaluation Frameworks and Models


The PRECEDE-PROCEED Framework

Evaluation frameworks provide guidance for program planners and evaluators, helping ensure that the
evaluation’s overall design considers the origins and contexts of the programs examined. One commonly used
framework is the PRECEDE-PROCEED Framework (Figure 1.1).
The acronym PRECEDE stands for predisposing, reinforcing, and enabling constructs in
education/environmental diagnosis and evaluation. The acronym PROCEED stands for policy, regulatory,
and organizational constructs in educational and environmental development. Although developed to study
how effectively programs affect changes in health behavior, PRECEDE-PROCEED is being used
increasingly in the fields or disciplines of education and psychology.
The PRECEDE-PROCEED MODEL begins on the far right of the figure and moving counter clockwise
has 8 phases.

1. Social assessment to determine perceptions of people’s needs and quality of life. For instance, evaluators
use focus groups with parents, students, and teachers to find out how to improve attendance at after
school programs.

2. Epidemiological assessment to identify the problems that are most important in the community. For
instance, evaluators conduct interviews with providers at local clinics to find out why neighborhood
children visit the clinics; evaluators review county records to study inoculation rates; program planners
conduct interviews with children and families to learn more about their culture, family history, and
lifestyle.

3. Educational and ecological assessment to identify the factors that might be needed to foster changes in
behaviors. These may include assessments of knowledge, beliefs, and self-efficacy (referred to as

37
predisposing factors); social support (reinforcing factors); and programs and services necessary for good
outcomes to be realized (enabling factors).

4. Administrative and policy assessment and intervention alignment to review policies and resources that
facilitate or hinder program implementation.

5–8. Implementation and evaluation of process, impact, and outcomes. Using the assessments as a guide,
program developers implement programs and evaluators study the programs’ activities and immediate
and long-term outcomes.

RE-AIM

The acronym RE-AIM stands for reach, efficacy (or effectiveness), adoption, implementation, and maintenance.
Reach refers to the percentage of potential participants who are exposed to an intervention and how
representative they are of others who might benefit from the program. Efficacy, or effectiveness, concerns both
the intended effects of a program and the possible unintended outcomes. Adoption refers to the participation

38
rate of eligible subjects and how well the setting and the people who deliver the intervention reflect future
participants. Implementation denotes the extent to which various components of the program are delivered as
intended. Maintenance is related to two questions: What are the program’s long-term effects? To what extent
is the program continued after the completion of the evaluation? All five of these dimensions are considered
equally important in the RE-AIM framework.

The Centers for Disease Control’s Framework for


Planning and Implementing Practical Program Evaluation

Figure 1.2 illustrates the framework for planning and implementing “practical” program evaluation
recommended by the U.S. Centers for Disease Control and Prevention (CDC). This framework consists of
six steps for accomplishing the evaluation (e.g., beginning with engaging stakeholders) and includes four
standards for assessing the evaluation: accuracy, utility, feasibility, and propriety.

Figure 1.2 The CDC’s Framework for Planning and Implementing Practical Program Evaluation

The three frameworks described above share several important features. First, they are all more precisely
described as frameworks rather than models. That is, their purpose is to provide guidance in program
planning and evaluation. Strictly speaking, models “predict” behavior or outcomes and are based on
theoretical expectations or empirical evidence gained through experience and experiment. Frameworks leave
the theories and methods of implementation to the evaluator.
These three frameworks are all-inclusive. PRECEDE-PROCEED, for example, contains a comprehensive
set of factors that should be considered in program planning and evaluation. It is unlikely, however, that
evaluators will ever find themselves involved in all aspects of program development and evaluation as
characterized in this framework. In many cases, evaluators are called in to appraise the merits and
effectiveness of existing programs. In other cases, evaluators are asked to be part of research teams that are
developing interventions. No one really expects an individual evaluator to be the expert in the planning
process or in the development of a program. The evaluator’s primary domain is collecting and interpreting

39
valid data on the implementation of the program and its effectiveness.
Frameworks, such as PRECEDE-PROCEED, RE-AIM, and the CDC’s approach to practical evaluation
may be useful in encouraging evaluators to pay attention to the origins and development of the programs they
examine (even if the evaluator had little to do with establishing the need for the programs or with their
implementation). Any knowledge an evaluator gains may help to design more realistic and relevant studies.
The CDC’s framework is different from PRECEDE-PROCEED and RE-AIM in that it incorporates
standards for a good evaluation that specifically include propriety. Propriety refers to the legal and ethical
considerations involved in evaluation research. With the exception of very small, local studies, and some
larger studies conducted under certain circumstances, most evaluation studies are now required by law and
institutional practice to demonstrate their ethical nature in writing, specifying how they will show respect for
their participants and protect participants’ privacy.

Logic Models

A logic model is a planning tool to clarify and graphically display what your evaluation intends to do and
what it hopes to accomplish. The most basic model consists of a depiction (often in graphic form) and an
explanation of the resources that go into a program, the activities it undertakes, and the changes or benefits
that result. The relationships are logical. In most cases, the relationships have not been tested empirically.
Figure 1.3 shows the components of a basic logic model developed by the Centers for Disease Control and
Prevention.
According to its supporters, a logic model describes the sequence of events presumed to produce or
generate benefits or change over time. It portrays the chain of reasoning that links investments to results.
Additionally a logic model is termed a systems model because it shows the connection of interdependent parts
that together make up the whole.

40
There is no single correct way to create a logic model. The stage of development of the program (i.e.,
planning, implementation, or maintenance) leads to one of two approaches used to create the model: right-
to-left or left-to-right (Figure 1.4).

Right-to-Left Logic Model. This approach, also called reverse logic, starts with desired outcomes and requires
working backwards to develop activities and inputs. Usually applied in the planning stage, this approach
ensures that program activities logically lead to the specified outcomes if the arrow bridges are well-founded.
As you progress from left to right in the logic model, ask the question: “How?” This approach is also helpful
for a program in the implementation stage that still has some flexibility in its program activities.

Figure 1.4 Right-to-Left Logic Model

41
Left-to-Right Logic Model. This approach (Figure 1.5) also called forward logic, may be used to evaluate a
program in the implementation or maintenance stage that does not already have a logic model. You start by
articulating the program inputs and activities. To move to the right in your model, you continually ask the
question, “Why?” You can also think of this approach as an “If-then” progression.

Evaluation Reports Online

Published evaluation reports provide information on promising programs and effective research methods. You
can find evaluation reports in their entirety online. One good place to start is PubMed, the free access
bibliographic database of the National Library of Medicine (http://www.ncbi.nlm.nih.gov/pubmed). If you
search for program evaluation, you will be given detailed options as shown in Figure 1.6.
Suppose you are interested in reviewing program evaluations in nursing. You enter the words: “program
evaluation, nursing.” This yields over 27,000 articles pertaining to program evaluation in nursing. Most
reviewers would find this number overwhelming.
PubMed allows to you filter the search or narrow it down so that it produces fewer studies, but studies
which are more likely to be on target. For example, you can ask PubMed to provide only evaluations that are
clinical trials, published in the past five years and for which a full text is available (Figure 1.7). This produces
223 evaluations. You can reduce the number of citations even more by adding other filters (publication date
within one year rather than five) or other terms like “community-based.” If you simply add “community-
based,” you are left with 22 evaluations to consider (Figure 1.8). The challenge is to weigh the need for a
comprehensive search with the available resources (time, skill).

Figure 1.5 Left-to-Right Logic Model

42
Figure 1.6 Evaluation Search Terms Provided by PubMed

Source: Pubmed at http://www.ncbi.nlm.nih.gov/pubmed

Figure 1.7 Restricting a Search in PubMed to Type of Study, Years Published, and Availability of Free
Text

43
Source: Pubmed at http://www.ncbi.nlm.nih.gov/pubmed

Another useful database for evaluation reports is ERIC (Education Resources Information Center): Go to
ERIC and enter the words, “program evaluation.” If you limit your search to the last three years (as of
December 2012), you will get 9,127 potentially usable evaluations. Similar to PubMed, ERIC (an online
library of education research and information, sponsored by the Institute of Education Sciences (IES) of the
U.S. Department of Education) gives you the opportunity to filter your search so that you do not have to
review over 9,000 articles, many of which will not be relevant. For instance, if you restrict the search to
evaluations that are primarily of interest to teachers, you will find 147 articles; for researchers, there are 38.
Figure 1.9 shows an ERIC search for journal articles published in the last six months (December 2012) on
the effectiveness of elementary education programs. The search uncovers 10 articles.
Other databases containing program evaluations in the social, psychological, health, and educational fields
include the Web of Science and PsycINFO. Almost all databases have tutorials to help you use a search
strategy that will retrieve the articles and evaluations you need. Learning to be efficient in conducting online
literature reviews is becoming an increasingly important evaluation skill (Fink, 2013).

Figure 1.8 Filtering Evaluation Reports

Source: Pubmed at http://www.ncbi.nlm.nih.gov/pubmed

Figure 1.9 An ERIC Search for Recent Articles Evaluating Elementary Education Programs

44
Source: Eric.ed.gov sponsored by the Institute of Education Sciences (IES) of the U.S. Department of Education.

Summary and Transition to


the Next Chapter on Evaluation
Questions and Evidence of Merit

Program evaluation is an unbiased exploration of a program’s merits, including its effectiveness, quality, and
value. An effective program provides substantial benefits to individuals, communities, and societies, and these
benefits are greater than their human and financial costs. A high-quality program meets its users’ needs and is
based on sound theory and the best available research evidence. A program’s value is measured by its worth to
individuals, the community, and society.
To conduct evaluations, researchers pose questions and decide on evidence of effectiveness, quality, and
value; choose study designs and sampling methods; and collect, analyze, interpret, and report information.
The information produced by program evaluations is used by the financial supporters of the programs as well
as by consumers (patients, students, teachers, and health care professionals), program developers, policy
makers, and other evaluators and health researchers.
Collecting and analyzing interim or formative evaluation data is expensive and may produce misleading
results; evaluators who choose to collect interim data should proceed with caution. However, formative
evaluation data are helpful in determining feasibility. Process evaluations are useful because they provide data
on what and when something was done, which is helpful in understanding the program’s dynamics.
Evaluations may use qualitative or statistical data or both in the same study. Participatory evaluations
involve the community in all phases of the evaluation. Comparative effectiveness evaluations compare two
existing programs in naturalistic settings and provide information designed to help consumers make informed
choices.
Before evaluators can decide on the evaluation’s design and data collection methods, they must choose
evaluation questions and decide on evidence of program effectiveness. The next chapter discusses how to
select and state evaluation questions and choose appropriate, justifiable evidence.

45
Exercises

Exercise 1

Directions

Read the two evaluations below and using only the information offered, answer these questions:

1. What are the evaluation questions?

2. What is the evidence of merit?

3. What data collection measures are being used?

1. Effectiveness of Home Visitation by public-health nurses in prevention of the recurrence of child physical
abuse and neglect (MacMillan et al., 2005)

Objective: Recurrence of child maltreatment is a major problem, yet little is known about approaches to
reduce this risk in families referred to child protection agencies. Since home visitation by nurses for
disadvantaged first-time mothers has proved effective in prevention of child abuse and neglect, the aim is to
investigate whether this approach might reduce recurrence.

Programs: 163 families with a history of one child being exposed to physical abuse or neglect are randomly
assigned so as to compare standard treatment with a program of home visitation by nurses in addition to
standard treatment. The main outcome was recurrence of child physical abuse and neglect based on a
standardized review of child protection records.

Results: At 3-year follow-up, recurrence of child physical abuse and neglect in the intervention group did not
differ between groups. However, hospital records showed significantly higher recurrence of either physical
abuse and/or neglect.

2. Evaluating a Mental Health Intervention for Schoolchildren Exposed to Violence: A Randomized


Controlled Trial (Stein et al., 2003)

Objective: To evaluate the effectiveness of a collaboratively designed school-based intervention for reducing
children’s symptoms of posttraumatic stress disorder (PTSD) and depression that resulted from exposure to
violence.

Program: Students were randomly assigned to a 10-session standardized cognitive-behavioral therapy (the
Cognitive-Behavioral Intervention for Trauma in Schools) early intervention group or to a wait-list delayed
intervention comparison group conducted by trained school mental health clinicians.

Results. Compared with the wait-list delayed intervention group (no intervention), after three months of
intervention, students who were randomly assigned to the early intervention group had significantly lower
scores on the Child PTSD Symptom Scale, the Child Depression Inventory, Pediatric Symptom Checklist,

46
and the Teacher-Child Rating Scale. At six months, after both groups had received the intervention, the
differences between the two groups were not significantly different for symptoms of PTSD and depression.

EXERCISE 2

Directions

Define program evaluation.

EXERCISE 3

Directions

Explain whether each of these is an evaluation study or not.

1. The purpose of the study was to evaluate a randomized culturally tailored intervention to prevent high
HIV risk sexual behaviors for Latina women residing in urban areas.

2. The researchers aimed to determine the effectiveness of an intervention regarding the use of spit
tobacco (ST) designed to promote ST cessation and discourage ST initiation among male high school
baseball athletes.

3. To study drivers’ exposure to distractions, unobtrusive video camera units were installed in the vehicles
of 70 volunteer drivers over 1-week time periods.

References and Suggested Readings

Fink, A. (2013). Conducting research literature reviews: From the Internet to paper. Thousand Oaks, CA: Sage.
Galvagno, S. M., Jr., Haut, E. R., Zafar, S. N., Millin, M. G., Efron, D. T., Koenig, G. J., Jr., . . . Haider,
A. H. (2012, April 18). Association between helicopter vs ground emergency medical services and
survival for adults with major trauma. JAMA: The Journal of the American Medical Association, 307(15),
1602–1610.
Hammond, G. C., Croudace, T. J., Radhakrishnan, M., Lafortune, L., Watson, A., McMillan-Shields, F.,
& Jones, P. B. (2012, September). Comparative effectiveness of cognitive therapies delivered face-to-face
or over the telephone: An observational study using propensity methods. PLoS One, 7(9).
Hardy, L., King, L., Kelly, B., Farrell, L., & Howlett, S. (2010). Munch and move: Evaluation of a preschool
healthy eating and movement skill program. International Journal of Behavioral Nutrition and Physical
Activity, 7(1), 80.
MacMillan, H. L., Thomas, B. H., Jamieson, E., Walsh, C. A., Boyle, M. H., Shannon, H. S., & Gafni, A.
(2005). Effectiveness of home visitation by public-health nurses in prevention of the recurrence of child
physical abuse and neglect: A randomised controlled trial. The Lancet, 365(9473), 1786–1793.
Marczinski, C. A., & Stamates, A. L. (2012). Artificial sweeteners versus regular mixers increase breath
alcohol concentrations in male and female social drinkers. Alcoholism: Clinical and Experimental Research,

47
37(4), 696–702.
Porter, M. E. (2010). What is value in health care? New England Journal of Medicine, 363(26), 2477–2481.
Stein, B. D., Jaycox, L. H., Kataoka, S. H., Wong, M., Tu, W., Elliott, M. N., & Fink, A. (2003, August 6).
A mental health intervention for schoolchildren exposed to violence: A randomized controlled trial.
Jama: The Journal of the American Medical Association, 290(5), 603–611.
Volpp, K. G., Loewenstein, G., & Asch, D. A. (2012). Assessing value in health care programs. JAMA: The
Journal of the American Medical Association, 307(20), 2153–2154.
Yu, S. (2012, October 1). College students’ justification for digital piracy: A mixed methods study. Journal of
Mixed Methods Research, 6(4), 364–378.

Suggested Websites

PubMed: http://www.ncbi.nlm.nih.gov/pubmed

ERIC: http://www.eric.ed.gov

PsychINFO

Web of Science

The PRECEDE-PROCEED model: Community Tool Box:


http://ctb.ku.edu/en/tablecontents/sub_section_main_1008.aspx

The RE-AIM framework, go to http://www.re-aim.org and


http://www.cdc.gov/aging/caregiving/assuring.htm

CDC’s Practical Framework: http://www.cdc.gov/eval/materials/frameworkoverview.PDF

Logic Models: (http://www.cdc.gov/nccdphp/dnpao/hwi/programdesign/logic_model.htm

http://www.childtrends.org/files/child_trends-2008_02_19_eva18programquality.pdf

48
Purpose of This Chapter

A program evaluation is an unbiased exploration of a program’s effectiveness, quality, and value.


Evaluations answer questions like: Did the program benefit all participants? Did the benefits
endure? Is the program sustainable? Did the program meet the needs of the community, and was it
done more efficiently than current practice? The answers to these questions require not just a “yes”
or a “no,” but evidence for the answers. This chapter begins with a discussion of commonly asked
evaluation questions and hypotheses. It continues with an explanation on how to select and justify
evidence of program merit.

Evaluation questions focus on programs, participants, outcomes, impact, and costs and provide data
on whether a program achieves its objectives, in what context, with whom, and at what cost.
Evidence of merit may be based on statistical significance, but the evidence should also have
practical or clinical meaning. Sources of practical or clinical significance include experts and
consumers, large databases (“big data”) and the research literature. Experts can provide information
on what to expect from a program, and consumers can tell the evaluator what is acceptable. Large
databases provide information on populations and programs that can assist in program development
(“What have others done?”) and guide the evidence-selection process (“What did others achieve?”).

The relationships among evaluation questions and hypotheses, evidence of merit, and independent
and dependent variables are also examined in this chapter. Their connection is illustrated through
the use of a special reporting form: The QEV or Questions, Evidence, Variables Report.

49
2
Evaluation Questions
and Evidence of Merit

A Reader’s Guide to Chapter 2

Evaluation Questions and Hypotheses


Evaluation Questions: Program Goals and Objectives
Evaluation Questions: Participants
Evaluation Questions: Program Characteristics
Evaluation Questions: Financial Costs
Evaluation Questions: The Program’s Environment
Evidence of Merit
Sources of Evidence
Evidence by Comparison

Evidence From Expert Consultation: Professionals, Consumers, Community Groups


Evidence From Existing Data and Large Databases
Evidence From the Research Literature

When to Decide on Evidence

Program Evaluation and Economics

The QEV Report: Questions, Evidence, Variables

Summary and Transition to the Next Chapter on Designing Program Evaluations

Exercises

References and Suggested Readings

Evaluation Questions and Hypotheses

Program evaluations are done to provide unbiased information on a program’s effectiveness, quality, and
value. They provide answers to questions like: Did the program achieve beneficial outcomes with all its
participants? In what specific ways did participants benefit, and what were the costs? Evaluation questions are
sometimes accompanied by hypotheses, such as: “No difference in program benefits or costs will be found

50
between men and women,” or “Younger people will benefit more than older people, but at increased cost.”
Almost all evaluations ask the question: Did the program achieve its objectives?

Evaluation Questions: Program Goals and Objectives

A program’s goals are usually relatively general and long-term, as shown in Example 2.1.

Example 2.1 Typical Program Goals

• For the public or the community at large

Optimize health status, education, and well-being

Improve quality of life

Foster improved physical, social, and psychological functioning

Support new knowledge about health and health care and social and economic well-being

Enhance satisfaction with health care and education

• For practitioners

Promote research

Enhance knowledge

Support access to new technology and practices

Improve the quality of care or services delivered

Improve education

• For institutions

Improve quality of leadership

Optimize ability to deliver accessible high-quality health care and superior education

Acquire on-going funding

• For the system

Expand capacity to provide high-quality health care and education

Support the efficient provision of care and education

Ensure respect for the social, economic, and health care needs of all citizens

A program’s objectives are its specific planned outcomes, to be achieved relatively soon (within six months
or a year), although their sustainability can be monitored over time (every year for 10 years).

51
Consider the program objectives in the brief description given in Example 2.2 of an online tutorial.

Example 2.2 The Objectives of a Tutorial to Teach People to Become Savvy Consumers of
Online Health Information

Many people go online for information about their health. Research is consistent in finding that online health
information varies widely in quality from site to site. In a large city, the Department of Health, Education,
and Social Services sponsored the development of a brief online tutorial to assist the community in becoming
better consumers of health information [the general, long-range goal]. They conducted focus groups to find out
what people wanted to learn and how they like to learn and used the findings as well as a model of adult
learning to guide the program’s instructional features. The evaluators did a preliminary test of the program,
and found it worked well with the test group.
The tutorial had these three objectives [hoped-for program outcomes for the present time].

At the conclusion of the tutorial, the learner will be able to:

• list at least five criteria for defining the quality of health information;
• name three high-quality online health information sites; and
• give an unbiased explanation of the symptoms, causes and treatment of a given health problem using
online information.

These objectives became the basis for the evaluation questions, specifically, whether the program is
effective in achieving each of the three objectives (Example 2.3).

Example 2.3 Program Objectives and Evaluation Questions

The evaluation team designed a study to find out if an online health information skills tutorial achieved each
of its objectives. The team invited people between 20 and 35 years of age who had completed at least two
years of college to participate in the evaluation. One group had access to the tutorial, and the other was given
a link to a respected site that had a checklist of factors to look for in high-quality health information searches.

The evaluators asked these questions.

1. How did tutorial participants compare on each of the objectives to the comparison group?

2. If the comparisons are in favor of the tutorial group, were they sustained over a 12-month period?

Study hypotheses are often associated with one or more evaluation questions. A hypothesis is a tentative
explanation for an observation, or a scientific problem that can be tested by further investigation. Hypotheses
are not arbitrary or based on intuition. They are derived from data collected in previous studies or from a

52
review of the literature. For example, consider this evaluation question, hypothesis, and justification for the
evaluation of the online tutorial:

Evaluation Question: How did tutorial participants compare to the control participants on each of the
program’s objectives?

Hypothesis: The tutorial participants will perform significantly better than the control participants on each
objective.

Justification: The research literature suggests that adults learn best if their needs and preferences are
respected. This evaluation conducted focus groups to identify needs and preferences. These were
incorporated into the tutorial. Also, a model of adult learning guided the program’s development. A
preliminary test of the program suggested that it would be effective in its targeted population. The
comparison group intervention, although it relied on a well-known and respected site, did not contain
incentives to learn.

Evaluation questions can be exploratory. Exploratory questions are asked when preliminary data or research
is not available to support hypothesis generation. For example, suppose the evaluators of the online tutorial
are interested in finding out if differences exist in the achievement of the tutorial’s objectives among frequent
and infrequent Internet users. If no information is available to justify a hypothesis (e.g., frequent users will do
better), the evaluators can decide to ask an exploratory question, such as “Do frequent Internet users do better
than infrequent users?” Exploratory questions are often interesting and can contribute new knowledge.
However, they also consume resources (data collection and analysis time) and result in information that may
be of little interest to anyone except the evaluators. One of the differences between evaluations and other
research is that similar to all scientific studies, the best evaluations rely on the scientific method to minimize
bias, but evaluations are primarily practical studies whose main purpose is to inform decisions about programs
and policies.

Evaluation Questions: Participants

Evaluation questions often ask about the demographic and social, economic, and health characteristics of
the evaluation’s participants. The participants include all individuals, institutions, and communities affected
by the evaluation’s findings. Participation in an evaluation of a school program, for instance, can involve
students, teachers, parents, the principal, the school nurse, and representatives of the school’s governing
boards and local council.
Questions about evaluation participants are illustrated in Example 2.4.

Example 2.4 Evaluation Questions and Participants

The developer of a new program evaluation course for first- and second-year graduate students was concerned

53
with finding out whether the program was effective for all students or only a subset. One measure of
effectiveness is the student’s ability to prepare a satisfactory evaluation plan. The evaluator asked the
following evaluation questions:

• What are the demographic characteristics (age, sex) of each year’s students?
• Is the program equally effective for differing students (for example, males and females)?
• Do first- and second-year students differ in their learning?
• At the end of their second year, did the current first-year students maintain their learning?

Evaluation questions should be answerable with the resources available. Suppose that the evaluation
described in Example 2.4 is only a one-year study. In that case, the evaluator cannot answer the question
about whether this year’s first-year students maintained their learning over the next year. Practical
considerations often dampen the ambitions of an evaluation.

Evaluation Questions: Program Characteristics

A program’s characteristics include its content, staff, and organization. The following are typical questions
about a program’s characteristics:

Evaluation Questions About Program Characteristics

What is the content of the program? The content includes the topics and concepts that are covered. In an
English course, the topics might include writing essays and research papers and naming the parts of a
sentence. In a math class, the concepts to cover may include negative and imaginary numbers.

What is the theory or empirical foundation that guides program development? Many programs are built
on theories or models of instruction, decision making, or behavior change. Some program developers also
rely on previous studies (their own or others) to provide information about previous programs to guide
current program development and evaluation.

Who is in charge of delivering the content? Content may be delivered by professionals (teachers,
physicians, social workers, nurses) or members of the community (trained peer counselors).

How are the participants “grouped” during delivery? In nearly all evaluations, the program planners or
evaluators decide how to assemble participants to deliver the intervention. For instance, in an education
program, participants may be in classrooms, the cafeteria, or the gym. In a health setting, participants may
attend a clinic or physician’s office.

How many sessions or events are to be delivered and for how long? The intervention may be a 10-minute
online weight-loss program accessed once a month, a weight-loss App (application software) accessed on
an as-needed basis, or a medically supervised weight-loss program requiring monthly visits to a clinic.

54
How long does the intervention last? Is it a one-year program? A one-year program in which participants
agree to be followed-up for 5 years?

Evaluation Questions: Financial Costs

Program evaluations can be designed to answer questions about the costs of producing program outcomes.
A program’s costs consist of any outlay, including money, personnel, time, and facilities (e.g., office
equipment and buildings). The outcomes may be monetary (e.g., numbers of dollars or Euros saved) or
substantive (e.g., years of life saved, or gains in reading or math). When questions focus on the relationship
between costs and monetary outcomes, the evaluation is termed a cost-benefit analysis. When questions are
asked about the relationship between costs and substantive outcomes, the evaluation is called a cost-
effectiveness analysis. The distinction between evaluations concerned with cost-effectiveness and those
addressing cost-benefit is illustrated by these two examples:

• Cost-effectiveness evaluation: What are the comparative costs of Programs A and B in providing the
means for pregnant women to obtain prenatal care during the first trimester?
• Cost-benefit evaluation: For every $100 spent on prenatal care, how many dollars are saved on neonatal
intensive care?

In the past, program evaluations usually ignored questions about costs. Among the reasons are the
difficulties inherent in defining costs and measuring benefits and in adding an economic analysis to an already
complex evaluation design. Additionally, evaluators questioned why study the costs of an intervention when
the effectiveness is not yet proved?
Conducting cost studies requires knowledge of accounting, economics, and statistics. It is often wise to
include an economist on the evaluation team if you plan to analyze costs.
Example 2.5 illustrates the types of questions that program evaluators pose about the costs, effects,
benefits, and efficiency of health care programs.

Example 2.5 Evaluation Questions: Costs

• What is the relationship between the cost and the effectiveness of three prenatal clinic staffing models:
physician-based, mixed staffing, and clinical nurse specialists with physicians available for consultation?
Costs include number of personnel, hourly wages, number of prenatal appointments made and kept, and
number of hours spent delivering prenatal care. Outcomes include maternal health (such as complications
at the time of delivery), neonatal health (such as birth weight), and patient satisfaction.
• How efficient are health care centers’ ambulatory clinics? Efficiency is defined as the relationship between
the use of practitioner time and the size of a clinic, waiting times for appointments, time spent by faculty
in the clinic, and time spent supervising house staff.
• How do the most profitable private medical practices differ from the least profitable in terms of types of
ownership, collection rates, no-show rates, percentage of patients without insurance coverage, charge for a

55
typical follow-up visit, space occupancy rates, and practitioner costs?
• To what extent does each of three programs to control hypertension produce an annual savings in reduced
health care claims that is greater than the annual cost of operating the program? The benefits are costs per
hypertensive client (the costs of operating the program in each year, divided by the number of hypertensive
employees being monitored and counseled that year). Because estimates of program costs are produced
over a given 2-year period, but estimates of savings are produced in a different (later) period, benefits have
to be adjusted to a standard year. To do this, one must adjust the total claims paid in each calendar year by
the consumer price index for medical care costs to a currency standard of a 2013 dollar. The costs of
operating the programs are similarly adjusted to 2013 dollars, using the same index.

As these questions illustrate, evaluators must define costs and effectiveness or benefits and, when
appropriate, describe the monetary value calculations. Evaluators who answer questions about program costs
sometimes perform a sensitivity analysis when measures are not precise or the estimates are uncertain. For
example, in a study of the comparative cost-effectiveness of two state funded school-based health care
programs, the evaluators may analyze the influence of increasing each program’s funding, first by 5% and then
by 10% to test the sensitivity of the program’s effectiveness to changes in funding level. Through this analysis,
the evaluators will be able to tell whether or not increases in effectiveness keep pace with increases in costs.

Evaluation Questions: The Program’s Environment

All programs take place in particular institutional, social, cultural, and political environments. For instance,
Program A, which aims to improve the preventive health care practices of children under age 14, takes place
in rural schools and is funded by the national government and the district. Program B has the same aim, but
it takes place in a large city and is supported by the city and a private foundation. The social, cultural, and
political values of the communities in which the program and the evaluation take place may differ even if the
programs have the same aim. It is these values that are likely to have an effect on the choice of questions and
evidence of merit.
Environmental matters can get complicated. If an evaluation takes place over several years (say, 3 years or
longer), the social and political context can change. New people and policies may emerge, and these may
influence the program and the evaluation. Among the environmental changes that have affected programs in
health care, for example, are alterations in reimbursement policies for hospitals and physicians, the
development of new technologies, and advances in medical science. In fact, technology and the need for new
types of workers and professionals has altered the context in which most programs operate and will continue
to operate.

Evaluation questions about environment include the following:

• Setting: Where did the program take place? A school? A clinic? Was the location urban or rural?
• Funding: Who funded the program? Government? Private Foundation or Trust?

56
• The managerial structure: Who is responsible for overseeing the program? Who is responsible and what is
the reporting structure? How effective is the managerial structure? If the individuals or groups who are
running the program were to leave, would the program continue to be effective?
• The political context: Is the political environment (within and outside the institution) supportive of the
success of the program? Is the program’s support secure?

Evidence of Merit

Evidence of merit consists of the facts and observations that are designed to convince the users of an
evaluation that the program’s benefits outweigh its risks and costs. For example, consider a program to teach
students to conduct evaluations. The program’s developers hope to instruct students in many evaluation skills,
one of which is how to formulate evaluation questions. The developers also anticipate that the program effects
will last over time.
Consider the following evaluation questions and their associated evidence.

• Evaluation question: Did the program achieve its learning objectives?


Evidence: Of all the students in the new program, 90% will learn to formulate evaluation questions.
Learning to formulate questions means identifying and justifying program goals, objectives, and benefits
and stating the questions in a comprehensible manner. Evidence that the questions are comprehensible
will come from review by at least three potential users of the evaluation.
• Evaluation question: Did the program’s effects last over time?
Evidence: No decreases in learning will be found between the students’ second and first years.

In this case, unless 90% of students learn to formulate questions by the end of the first year and first-year
students maintain their learning over a one-year period, the evaluator cannot say the program is effective.
Evidence should be specific. The more specific it is, the less likely you will encounter any later disagreement.
Specific evidence (90% of students) is also easier to measure than ambiguous evidence (almost all students).
Ambiguity arises when uniformly accepted definitions or levels of performance are unavailable. For example,
in the question “Has the Obstetrical Access and Utilization Initiative improved access to prenatal care for
high-risk women?” the terms improved access to prenatal care and high-risk women are potentially ambiguous.
To clarify these terms and thus eliminate ambiguity, the evaluators might find it helpful to engage in a
dialogue like the one presented in Example 2.6.

Example 2.6 Clarifying Terms: A Dialogue Between Evaluators

Evaluator 1: “Improved” means bettered or corrected.


Evaluator 2: For how many women and over what duration of time must care be bettered? Will all women
be included? 100% of teens and 90% of the other women?
Evaluator 1: “Improved access” means more available and convenient care.

57
Evaluator 2: What might render care more available and convenient? I did a systematic review of the
literature and found that care can be made more available and convenient if some or all the
following occur: changes in the health care system to include the provision of services
relatively close to clients’ homes; shorter waiting times at clinics; for some women, financial
help, assistance with transportation to care, and aid with child care; and education regarding
the benefits of prenatal care and compliance with nutrition advice.
Evaluator 1: “High-risk women,” according to the Centers for Disease Control, are women whose health
and birth outcomes have a higher-than-average chance of being poor.
Evaluator 2: Which, if not all, of the following women will you include? Teens? Users of drugs or alcohol?
Smokers? Low-income women? Women with health problems, such as gestational diabetes or
hypertension?

The evaluators in Example 2.6 do not arbitrarily clarify the ambiguous terms. Instead, they rely on the
research literature and a well-known public health agency for their definitions of improved access and high
risk. Evaluators must use trustworthy sources (the research literature, well-known experts) for the evaluation
to be credible as well.
After they have clarified the question “Has the Obstetrical Access and Utilization Initiative improved
access to prenatal care for high-risk women?” the evaluators might develop standards, such as those listed in
Example 2.7.

Example 2.7 Evidence of Merit: Access to and Use of Prenatal Care Services

• At least four classes in nutrition and “how to be a parent” will be implemented, especially for teenagers.
• All clinics will provide translation assistance in English, Spanish, Hmong, and Vietnamese.
• Over a 5-year period, 80% of all pregnant women without transportation to clinics and community health
centers will receive transportation.

Notice that the evidence refers to changes in the structure of how health care is provided: specially
designed education, translation assistance, and transportation. A useful way to think about evidence of
effectiveness, especially for health care programs, is to decide whether you want to focus on structure, process,
or outcomes.
The structure of care refers to the environment in which health care is given as well as the characteristics of
the health care practitioners (including the number of practitioners and their educational and demographic
backgrounds), the setting (a hospital or doctor’s office, for example), and the organization of care (e.g., how
departments and teams are run).
The process of care refers to what is done to and for patients and includes the procedures and tests used by

58
the health care team in prevention, diagnosis, treatment, and rehabilitation.
The outcomes of care are the results for the patient of being in the health care system. These include
measures of morbidity and mortality; social, psychological, and physical functioning; satisfaction with care;
and quality of life.
Example 2.8 presents illustrative standards for the evaluation question “Has the Obstetrical Access and
Utilization Initiative improved access to care for high-risk women?”

Example 2.8 Structure, Process, and Outcome Evidence

• Structure evidence: All waiting rooms will have special play areas for patients’ children.
• Process evidence: All physicians will justify and apply the guidelines prepared by the College of Obstetrics
and Gynecology for the number and timing of prenatal care visits to all women.
• Outcome evidence: Significantly fewer low birth weight babies will be born in the experimental group than
in the control group, and the difference will be at least as large as the most recent findings reported in the
research literature.

Sources of Evidence

Evidence of program effectiveness comes from (1) studies designed to compare the new program to an
older one or usual practice; (2) the advice of experts, including professionals and community members; (3)
information in existing databases; and (4) a review of the research literature.

Evidence by Comparison

Evaluators compare programs by systematically observing one or more groups of participants, one group of
which is in the new program. If participants in Program A benefit more than Participants in Program B, and
the comparison is carefully chosen, it may be possible to argue that Program A is more effective.
It is important to note that just because an evaluation finds differences, and the difference favors
participants in the new program, you cannot automatically assume that the new program is effective. At least
three questions must be asked before any such judgment is possible:

1. Were the programs comparable to begin with? One program may have better resources or commitment
from the staff than the other.

2. Were the participants comparable to begin with? By coincidence, the individuals in one program might
be smarter, healthier, more cooperative, or otherwise different from those in the comparison program.

3. Is the magnitude of the difference in outcomes large enough to be meaningful? With very large
samples, small differences (in scores on a standardized test of achievement, for example) can be
statistically, but not practically or clinically significant. Also, the scale on which the difference is based

59
may not be meaningful. A score of 12 versus a score of 10 is only important if there is research evidence
that people with scores of 12 are observably different from people with scores of 10. Are they smarter?
Healthier?

Suppose an evaluator is asked to study the effectiveness of an 8-week cognitive-behavioral therapy program
for children with measurable symptoms of depression. The evaluation design consists of an experimental
group of children who receive the program and a control group of children who do not. In a study design of
this type, the participants in the control group may get an alternative program, may get no program, or may
continue doing what they have been doing all along (usual practice). To guide the evaluation design, the
evaluator hypothesizes that the children who make up the two groups are the same in terms of their symptoms
before and after the program.
The evaluator administers standardized measures of depression symptoms to all the children in both groups
before the experimental program begins and within one week of its conclusion. After analyzing the data using
traditional statistical tests, the evaluator finds that in fact the children in the experimental program improve
(have better scores on the depression symptom measure) whereas the scores of those participants in the
control program do not improve. Using these statistical findings, the evaluator rejects the hypothesis that the
two groups are the same after the program and concludes that because they differ statistically, the program is
effective.
Some of the participant children’s teachers, however, challenge the evaluator’s conclusion by asking if the
statistical difference is clinically meaningful. The teachers are not convinced that the improvement in scores
in the experimental participants means much. After all, the depression symptoms measure is not perfect, and
the gains indicated may disappear over time. Through this experience, the evaluator learns that if you rely
solely on statistical significance as evidence of effectiveness, you may be challenged to prove that the statistics
mean something practical in clinical terms.
The difference between statistical and clinical or practical significance is particularly important in program
evaluations that focus on impact, outcomes, and costs. Assume that students’ scores on a standardized test of
achievement increase from 150 to 160. Does this 10-point increase mean that students are actually more
capable? How much money are policy makers willing to allocate for schools to improve scores by 10 points?
On the other hand, consider a group of people in which each loses 10 pounds after being on a diet for 6
months. Depending on where each person started (some may need to lose no more than 10 pounds, but some
may need to lose 50), a loss of 10 pounds may be more or less clinically significant.
Another way to think of evidence of clinical significance is through the concept of effect size. Consider this
conversation between two evaluators:

Evaluator A: We have been asked to evaluate a web-based program for high school students whose aim is
to decrease their risky behaviors (such as drinking and driving; smoking) through interactive
education and the use of online support groups. We particularly want students to stop
smoking. How will we know if the program is effective? Has anyone done a study like this?
Evaluator B: I don’t know of anyone offhand, but we can e-mail some people I know who have worked
with online programs to reduce health risks. Maybe they have some data we can use. Also, we

60
can do a search of the literature.
Evaluator A: What do we look for?
Evaluator B: Programs with evidence of statistically significant reductions in proportions of students who
smoke. Once we find them, we will have to decide if the reduction is large enough to be
clinically as well as statistically meaningful. What proportion of teens in a given program need
to quit smoking for us to be convinced that the program is effective? For instance, if 10% of
teens quit, is that good enough, or do we expect to see 15% or even 20%?

Evaluator B is getting at the concept of effect size when she talks about observing a sufficiently large
number of students who quit smoking so that the outcome, if statistically significant, is also clinically
meaningful.
Effect size is a way of quantifying the size of the difference between two groups. It places the emphasis on
the most important aspect of a program—the size of the effect—rather than its statistical significance. The
difference between statistical significance and effect as evidence of merit is illustrated by these two questions:

Question 1: Does participation in Program A result in a smaller percentage of smokers than Program B,
and when using standard statistical tests is the difference significant? Any reduction in percentage is
acceptable, if the difference is statistically significant.

Question 2: Does Program A result in 25% fewer smokers or an even greater reduction [effect] than
Program B? Anything less than 25% is not clinically important even if it is statistically significant.

How do you determine a desirable effect size? As Evaluator B (above) suggests, one source of information
is data from other programs. These data provide a guide to what similar programs should be able to achieve.
Unfortunately, such data are not always available, especially if the program you are evaluating is relatively
innovative in design or objectives. Without existing data, you may have to conduct a small scale or pilot study
to get estimates of effect sizes you can hope for in your program.
A typical effect size calculation considers the standardized mean (average) difference between two groups.
In other words:

A rule of thumb for interpreting effect sizes is that a “small” effect size is .20, a “medium” effect size is .50,
and a “large” effect size is .80. However, not only do evaluators want to be sure that the effect is meaningful,
they want to be certain they have a large enough number of people in their study to detect a meaningful effect
if one exists. The technique for determining sample sizes to detect an effect of a given size (small, medium, or
large) is called power analysis (see chapter 4).

Evidence From Expert Consultation:


Professionals, Consumers, Community Groups

Experts can assist evaluators in deciding on evidence of effectiveness and in confirming the practical or

61
clinical significance of the findings. An expert is any individual or representative of a professional, consumer,
or community group who is likely to use or has an interest in the results of an evaluation. Evaluators use a
variety of techniques to consult with and promote agreement among experts. These usually include the
selection of representative groups of experts who then take part in structured meetings. For example, for an
evaluator who is concerned with selecting evidence of effectiveness for a program to improve the quality of
instruction in health policy and health services research, an appropriate group of advisers would include
experts in those fields, experts in education, and consumers of health services research and policy (such as
representatives from the public).
The fields of health and medicine make extensive use of expert panels. For example, the National Institutes
of Health has used consensus development conferences to help resolve issues related to knowledge about and
use of particular medical technologies, such as intraocular lens implantation, as well as the care of patients
with specific conditions, such as depression, sleep disorders, traveler’s diarrhea, and breast cancer. The
American College of Physicians, the American Heart Association, the Institute of Medicine, and the Agency
for Health Care Research and Quality are some of the many organizations that consistently bring experts
together to establish “guidelines” for practice concerning common problems, such as pain, high blood
pressure, and depression.
The main purpose of seeking consensus among experts is to define levels of agreement on controversial
subjects and unresolved issues. When no comparison group data are available, these methods are extremely
germane to setting evidence against which to judge the effectiveness of new programs. True consensus
methods, however, are often difficult to implement, because they typically require extensive reviews of the
literature on the topic under discussion as well as highly structured methods.
The use of expert panels has proved to be an effective technique in program evaluation for setting evidence
of performance, as illustrated in Example 2.9.

Example 2.9 Using Experts to Decide on Evidence

Sixteen U.S. teaching hospitals participated in a 4-year evaluation of a program to improve outpatient care in
their group practices. Among the study’s major goals were improvements in amount of faculty involvement in
the practices, in staff productivity, and in access to care for patients. The evaluators and representatives from
each of the hospitals used evidence for care established by the Institute of Medicine as a basis for setting
evidence of program effectiveness before the start of the study. After 2 years, the evaluators presented interim
data on performance and brought experts from the 16 hospitals together to come to consensus on evidence for
the final 2 years of the study. To guide the process, the evaluators prepared a special form, part of which
appears in Figure 2.1.

Figure 2.1 Selected Portions of a Form Used in Deciding on Evidence of Merit

62
It is interesting to note that a subsequent survey of the participants in the evidence-setting process
discussed in Example 2.9 found that they did not use the interim data to make their choices: No association
was found between how well a medical center had previously performed with respect to each of 36 selected
indications of evidence and the choice of a performance level for the remaining 2 years of the evaluation.
Interviews with experts at the medical centers revealed that the evidence they selected came from informed
estimates of what performance might yet become and from the medical centers’ ideals; the experts considered
the interim data to be merely suggestive.
Program evaluators use a number of methods when they rely on panels of experts to promote
understanding of issues, topics, and evidence for evaluation, but the most productive of these depend on a few
simple practices, as discussed in the following guidelines:

Guidelines for Expert Panels

1. The evaluator should clearly specify the evaluation questions. If the questions are not clearly specified, the
experts may help in clarification and in specification. Here are examples:

Not quite ready for evidence setting: Was the program effective with high-risk women?

More amenable to evidence setting: Did the program improve the proportion of low-weight births among
low-income women?

Evidence: Significantly fewer low-weight births are found in the experimental versus the control group.

2. The evaluator should provide data to assist the experts. These data can be about the participants in the
experimental program, the intervention itself, and the costs and benefits of participation. The data can
come from published literature, from ongoing research, or from financial and statistical records. For
example, in an evaluation of a program to improve birth weight among infants born to low-income
women, experts might make use of information about the extent of the problem in the country. They
might also want to know how prevalent low-weight births are among poor women and, if other

63
interventions have been used effectively, what were their costs.

3. The evaluator should select experts based on their knowledge, their influence, or how they will use the findings.
The number of experts an evaluator chooses is necessarily dependent on the evaluation’s resources and the
evaluator’s skill in coordinating groups. (See Example 2.10 for two illustrations concerning the choice of
experts.)

4. The evaluator should ensure that the panel process is carefully structured and skillfully led. A major purpose of
the expert panel is to come to agreement on the criteria for appraising a program’s performance. To
facilitate agreement, and to distinguish the panel process from an open-ended committee meeting, the
evaluator should prepare an agenda for the panel in advance, along with the other materials noted above
(such as literature reviews and other presentations of data). When possible, the evaluator should try to
focus the panel on particular tasks, such as reviewing a specific set of data and rating the extent to which
those data apply to the current program. For example, the evaluator might give the experts data on past
performance (e.g., 10 of 16 hospitals had continuous quality improvement systems for monitoring the
quality of inpatient care) and then ask them to rate the extent to which that evidence should still apply
(e.g., on a 5-point scale ranging from strongly agree to strongly disagree).

Example 2.10 Choosing Experts to


Guide the Choice of Evaluation Evidence

• The New Dental Clinic wants to improve patient satisfaction. A meeting was held where three patient
representatives, including a nurse, a physician, and a technician defined the “satisfied patient” and decided
on how much time to allow the clinic to produce satisfied patients.
• The primary goals of the Adolescent Outreach Program are to teach teens about preventive health care and
to make sure that all needed health care services (such as vision screening and immunizations) are
provided. A group of teens participated in a teleconference to help the program developers and evaluators
decide on the best ways to teach teens and to set evidence of learning achievement. Also, physicians,
nurses, teachers, and parents participated in a conference to determine the types of services that should be
provided and how many teens should receive them.

Evidence From Existing Data


and Large Databases (“Big Data”)

Large databases, such as those maintained by the U.S. Centers for Disease Control and Prevention
(CDC), and other nations through their government agencies and registries come from surveys of whole
populations and contain information on individual and collective health, education, and social functioning.
The rules and regulations for gaining access to these databases vary. Some evaluators also compile their own
data sets and may make them available to other investigators.

64
The information in large databases (and their summaries and reports) can provide benchmarks against
which evaluators measure the effectiveness of new programs. For instance, an evaluator of a hypothetical
drivers’ education program might say something like this: “I used the county’s Surveillance Data Set to find
out about the use of seat belts. The results show that in this county about 5 out of 10 drivers between the ages
of 18 and 21 years do not use seat belts. An effective driver education program should be able to improve on
that number to reduce it to 2 drivers out of 10 within 5 years.”
Suppose you were asked to evaluate a new program to prevent low-weight births in your county or region.
If you know the current percentage of low-weight births in the county, then you can use that figure as a
benchmark for evaluating the effectiveness of a new program that aims to lower the rate. Example 2.11
illustrates how evaluators use existing data as evidence of effectiveness.

Example 2.11 Using Existing Data as Evidence

• The Obstetrical Access and Utilization Initiative serves high-risk women and aims to reduce the numbers
of births of babies weighing less than 2,500 grams (5.5 pounds). One evaluation question asks, “Is the birth
of low-weight babies prevented?” In the state, 6.1% of babies are low birth weight, but this percentage
includes babies born to women who are considered to be at low or medium risk. The evidence used as
evidence that low-weight births are prevented is as follows: “No more than 6.1% of babies will be born
weighing less than 5.5 pounds.”
• The city’s governing council decides that the schools should become partners with the community’s health
care clinics in developing and evaluating a program to reduce motor vehicle crashes among children and
young adults between the ages 10 to 24 years. According to the Centers for Disease Control’s findings
from the Youth Risk Behavior Surveillance System (accessed through the CDC website at
http://www.cdc.gov), the leading cause of death (31% of all deaths) among youth of this age is motor
vehicle accidents. Council members, community clinic representatives, teachers and administrators from
the schools, and young people meet to discuss evidence for program effectiveness. They agree that they
would like to see a statistically and clinically meaningful reduction in deaths due to motor vehicle crashes
over the program’s 5-year trial period. They use the 31% figure as the baseline against which to evaluate
any reduction.

When you use data from large databases as a benchmark to evaluate a local program’s effectiveness, you
must make certain that the data are applicable to the local setting. The only data available to you may have
been collected a long time ago or under circumstances that are very different from those surrounding the
program you are evaluating, and so they may not apply. For example, data collected from an evaluation
conducted with men may not apply to women, and data on older men may not apply to younger men. Data
from an evaluation conducted with hospitalized patients may not apply to people in the community.

Evidence From the Research Literature

65
The research literature consists of all peer-reviewed evaluation reports. Most, not all, of these reports are
either published or expected to be published. Evaluators should use only the most scientifically rigorous
evaluations as the basis for applying evidence of effectiveness from one program to another. They must be
careful to check that the evaluated program included participants and settings that are similar to the new
program. Example 2.12 illustrates how evaluators can use the literature in setting evidence in program
evaluations.

Example 2.12 Using the Literature to Find and Justify Effectiveness Evaluation Evidence

The Community Cancer Center has inaugurated a new program to help families deal with the depressive
symptoms that often accompany a diagnosis of cancer in a loved one. A main program objective is to reduce
symptoms of depression among participating family members.

The evaluators want to convene a group of potential program participants to assist in developing evidence of
program effectiveness. In specific, the evaluators want assistance in defining “reduction in symptoms.” They
discover, however, that it is nearly impossible to find a mutually convenient time for a meeting with potential
participants. Also, the Center does not have the funds to sponsor a face-to-face meeting. Because of these
constraints, the evaluators decide against a meeting and instead turn to the literature. They plan to share their
findings with participants.

The evaluators go online to find research articles that describe the effectiveness of programs to reduce
depressive symptoms in cancer patients. Although they find five published articles, only one of the programs
has the same objectives and similar participants as the Community Cancer Center’s program, and it took
place in an academic cancer center. Nevertheless, given the quality of the evaluation and the similarities
between the two programs, the evaluators believe that they can apply this other program’s evidence to the
present program. This is what the evaluators found in the article:

At the 6-month assessment period, family members in the first group had significantly lower self-
reported symptoms of depression on the Depression Scale than did family members in the second group
(8.9 versus 15.5). The mean difference between groups adjusted for baseline scores was –7.0 (95%
confidence interval, –10.8 to –3.2), an effect size of 1.08 standard deviations. These results suggest that
86% who underwent the program reported lower scores of depressive symptoms at 6 months than would
have been expected if they had not undergone the program.

The evaluators decide to use the same measure of depressive symptoms (the Depression Scale) as did the
evaluator of the published study and to use the same statistical test to determine the significance of the
results.

When adopting evidence from the literature, you must compare the characteristics of the program you are
evaluating and the program or programs whose evaluation evidence you plan to adopt. You need to make

66
certain that the participants, settings, interventions, and primary outcomes are similar, if not identical. Then,
when you conduct your evaluation, you must choose the same measures or instruments and statistical
methods to interpret the findings.

When to Decide on Evidence

The evaluation should have the evidence of effectiveness in place before continuing with design and analysis.
Consider the following two examples.

Example 1

Program goal: To teach nurses to abstract medical records reliably

Evaluation question: Have nurses learned to abstract medical records reliably?

Evidence: 90% of all nurses learn to abstract medical records reliably

Program effects on: Nurses

Effects measured by: Reliable abstraction

Design: A survey of nurses’ abstractions

Data collection: A test of nurses’ ability to abstract medical records

Statistical analysis: Computation of the percentage of nurses who abstract medical records reliably

Example 2

Program goal: To teach nurses to abstract medical records reliably

Evaluation question: Have nurses learned to abstract medical records reliably?

Evidence: A statistically significant difference in learning is observed between nurses at Medical Center A
and nurses at Medical Center B. Nurses at Medical Center A participated in a new program, and the
difference is in their favor

Program effects on: Nurses at Medical Center A

Effects measured by: Reliable abstraction

Design: A comparison of two groups of nurses

Data collection: A test of nurses’ ability to abstract medical records

Statistical analysis: A t-test to compare average abstraction scores between nurses at Medical Center A and
nurses at Medical Center B

The evaluation questions and evidence contain within them the independent and dependent variables on
which the evaluation’s design, measurement, and analysis are subsequently based. Independent variables are
sometimes called explanatory or predictor variables because they are present before the start of the program

67
(that is, they are independent of it). Evaluators use independent variables to explain or predict outcomes. In
the example above, reliable abstraction of medical records (the outcome) is to be explained by nurses’
participation in a new program (the independent variable). In evaluations, the independent variables often are
the program (experimental and control), demographic features of the participants (such as sex, income,
education, experience), and other characteristics of the participants that might affect outcomes (such as
physical, mental, and social health; knowledge).
Dependent variables, also termed outcome variables, are the factors the evaluator expects to measure as a
consequence of being in the program or control group. In program evaluations, these variables include health
status, functional status, knowledge, skills, attitudes, behaviors, costs, and efficiency.
The evaluation questions and evidence necessarily contain the independent and dependent variables:
specifically, those variables on whom the program is designed to have effects and measures of those effects, as
illustrated in Example 2.13.

Example 2.13 Questions, Evidence, and Independent and Dependent Variables

Program goal: To teach nurses to abstract medical records reliably

Evaluation question: Have nurses learned to abstract medical records reliably?

Evidence: A statistically significant difference in learning is observed between nurses at Medical Center A and
nurses at Medical Center B. Nurses at Medical Center A participated in a new program, and the difference is
in their favor

Independent variable: Participation versus no participation in a new program

Dependent variable or what will be measured at end of program participation: Reliable abstraction

Program Evaluation and Economics

In the past, evaluators typically regarded the costs of programs to be almost irrelevant as the idea was to
investigate effectiveness. In recent years, it has become increasingly important to demonstrate that not only is
a program effective, but it is worth financial investment. For example, nearly 9,000 more cost-effectiveness
articles were published from 2002 to 2012 than from 1992 to 2002.

Four common types of cost analyses:

• Cost-effectiveness evaluation: Program A is effective and is currently the lowest-cost program.


• Cost-benefit analysis: Program A has merit if its benefits are equal to or exceed its costs; the benefit-to-
cost ratio of Program A is equal to or greater than 1.0 and exceeds the benefit-to-cost ratio of Program
B.
• Cost minimization analysis: Programs A and B have identical benefits, but Program A has lower costs.

68
• Cost utility analysis: Program A produces N (the evaluation figures out exactly how many) quality-
adjusted life years at lower cost than Program B. The quality-adjusted life year (QALY) is a measure of
disease burden, including both the quality and the quantity of life lived. It is used in assessing the value
for money of a medical intervention. The QALY is based on the number of years of life that would be
added by the intervention.

Example 2.14 illustrates the uses of economic evidence in evaluations.

Example 2.14 Evidence Used in Economic Evaluations

RISK-FREE is a new anti-smoking program. The evaluators have three study aims and associated
hypotheses. Two of the study aims (Aims 2 and 3) pertain to an economic evaluation.

• Aim 1: To evaluate the comparative effectiveness of an App to prevent smoking in young adults relative
to the current program, which consists of lectures, videos, and discussion
Hypothesis 1: When baseline levels are controlled for, the experimental students (using the App) will
have a significantly lower probability of starting to smoke over a 12-month period than students in the
current program.
Hypothesis 2: When baseline levels are controlled for, the experimental students will have significantly
better quality of life over a 12-month period than students in the current program.

As part of the evaluation, the evaluators will also examine proximal outcomes (factors that may help or hinder
achievement of the effectiveness end points), testing two additional hypotheses:

Hypothesis 3: When baseline levels are controlled for, the experimental students will demonstrate
significantly greater self-efficacy and knowledge over a 12-month period than students in the current
program.
Hypothesis 4: When baseline levels are controlled for, the experimental teachers will demonstrate
significantly greater knowledge and more positive attitudes than teachers in the current program.
• Aim 2: To evaluate the comparative costs of the App relative to the current program
Hypothesis 5: When baseline levels are controlled for, the experimental students will have significantly
lower need for counseling and net (intervention + nonintervention) costs over a 12-month period than
students in the current program.
• Aim 3: To evaluate the cost-effectiveness of the App relative to usual care
Hypothesis 6: The experimental intervention will be cost-effective relative to the current program, based
on generally accepted threshold values for incremental cost-effectiveness ratios.

If Aim 2 demonstrates that the program is cost saving because it has equal outcomes or cost neutral because it
has better outcomes, then the App is cost-effective by definition, and the Aim 3 analyses are unnecessary.

69
The QEV Report: Questions, Evidence, Variables

The relationships among evaluation questions, evidence, and variables can be depicted in a reporting form like
the one shown in Figure 2.2. As the figure shows, the evaluation questions appear in the first column of the
QEV (questions, evidence, variables) report form, followed in subsequent columns by the evidence associated
with each question, the independent variables, and the dependent variables.
The QEV report in Figure 2.2 shows information on an evaluation of an 18-month program combining
diet and exercise to improve health status and quality of life for persons 75 years of age or older who are living
at home. Participants will be randomly assigned to the experimental or control groups according to the streets
where they live. Participants in the evaluation who need medical services can choose one of two clinics
offering differing models of care delivery, one that is primarily staffed by physicians and one that is primarily
staffed by nurses. The evaluators will be investigating whether any differences exist between male and female
participants after program participation and the role of patient mix in those differences. (Patient mix refers to
those characteristics of patients that might affect outcomes; these include demographic characteristics,
functional status scores, and presence of chronic disorders, such as diabetes and hypertension.) The evaluators
will also be analyzing the cost-effectiveness of the two models of health care delivery.

Figure 2.2 The QEV Reporting Form

70
This evaluation has three questions: one about the program’s influence on quality of life, one about the
program’s influence on health status, and one about the cost-effectiveness of two methods for staffing clinics.
Each of the three questions has one or more evidence associated with it. The independent variables for the
questions about quality of life and health status are gender, group participation, and patient mix, and each of
these terms is explained. The dependent variables are also explained in the QEV report. For example, the
report notes that “quality of life” includes social contacts and support, financial support, and perceptions of
well-being.

Summary and Transition to the Next


Chapter on Designing Program Evaluations

A program evaluation is conducted to determine whether a given program has merit. Is it worth the costs, or
will a more efficient program accomplish even more? Evaluation questions are the evaluation’s centerpiece;
they ask about the extent to which program goals and objectives have been met and the degree, duration,
costs, and distribution of benefits and harms. The evaluation questions can also ask about the program’s social
and political environment and the implementation and effectiveness of different program activities and
management strategies.
Program evaluations are designed to provide convincing evidence of effectiveness, quality, and value. The
evaluator must decide on the questions or hypotheses and evidence in advance of any evaluation activities
because both together prescribe the evaluation’s design, data collection, and analysis. Evidence of effectiveness
comes from comparing programs, the opinions of experts, and reviews of the literature, past performance, and
existing, usually large, databases.
The next chapter tells you how to design an evaluation so that you will be able to link any changes found in
knowledge, attitudes, and behaviors to a new or experimental program and not to other competing events.
For example, suppose you are evaluating a school campaign that aims to encourage high school students to
drink water instead of energy drinks. You might erroneously conclude that your program is effective if you
observe a significant increase in water-drinking among program participants unless your evaluation’s design is
sufficiently sophisticated to distinguish between the effects of the program and those of other sources of
education, such as social media, the Internet, and television. The next chapter discusses the most commonly
used evaluation research designs.

Exercises

Directions

Read the evaluation descriptions below, and, using the information offered, list the evaluation questions,
associated evidence of program merit, and the independent and dependent variables.

1. Gambling and College Students


College students experience high rates of problem and pathological gambling, yet little research has

71
investigated methods for reducing gambling in this population. This study sought to examine the
effectiveness of brief intervention strategies. Seventeen college students were assigned randomly to an
assessment-only control, 10 minutes of brief advice, one session of motivational enhancement therapy
(MET), or one session of MET plus three sessions of cognitive–behavioral therapy (CBT). The three
interventions were designed to reduce gambling. Gambling was assessed at baseline, after 6 weeks, and at the
9th month using the Addiction Severity Index–gambling (ASI-G) module, which also assesses days and
dollars wagered.

2. Drug Education and Elementary School


The evaluators conducted a short-term evaluation of the revised D.A.R.E. (Drug Abuse Resistance
Education) curriculum. They examined D.A.R.E.’s effects on three substances, namely students’ lifetime and
30-day use of tobacco, alcohol, and marijuana, as well as their school attendance. The study comprised
students in 17 urban schools, each of which served as its own control; 5th graders in the 2006–2007 school
year constituted the comparison group (n = 1490), and those enrolled as 5th graders in the 2007–2008 school
year constituted the intervention group (n= 1450). The evaluators found no intervention effect on students’
substance use for any of the substance use outcomes assessed. We did find that students were more likely to
attend school on days they received D.A.R.E. lessons and that students in the intervention group were more
likely to have been suspended. Study findings provide little support for the implementation and dissemination
of the revised D.A.R.E. curriculum.

References and Suggested Readings

Institute of Medicine. (2000). Crossing the quality chasm. Washington, DC: National Academy Press.
Katz, C., Bolton, S. L., Katz, L. Y., Isaak, C., Tilston-Jones, T., & Sareen, J. A. (2013). Systematic review of
school-based suicide prevention programs: Depression and anxiety. ADAA Journal. Malden, MA: Wiley.
doi: 10.1002/da.22114
Moodie, M. L., Herbert, J. K., de Silva-Sanigorski, A. M., Mavoa, H. M., Keating, C. L., Carter, R. C.,…
Swinburn, B. A. (2013). The cost-effectiveness of a successful community-based obesity prevention
program: The Be Active Eat Well program. Obesity. Silver Spring, MD. Retrieved from
http://onlinelibrary.wiley.com.ezproxy.auckland.ac.nz/doi/10.1002/oby.20472/pdf
Simon, E., Dirksen, C. D., & Bogels, S. M. (2013). An explorative cost-effectiveness analysis of school-
based screening for child anxiety using a decision analytic model. European Child & Adolescent Psychiatry.
Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/23539355
Smit, E. S., Evers, S. M., de Vries, H., & Hoving, C. (2013). Cost-effectiveness and cost-utility of Internet-
based computer tailoring for smoking cessation. Journal of Medical Internet Research, 15(3), 57.

72
Purpose of This Chapter

An evaluation’s design is its unique research structure. The structure has five components:

1. Evaluation questions and hypotheses

2. Evidence of program merit: effectiveness, quality, value

3. Criteria for deciding on who is eligible to participate in the evaluation

4. Rules for assigning participants to programs

5. Rules for the timing and frequency of measurement

Each of these components is discussed in detail in this chapter. The chapter also explains the uses
and limitations of experimental (randomized controlled trials and nonrandomized controlled trials
or quasi-experiments) and observational (cohorts, case control, cross-sectional) evaluation designs.
In addition, the chapter explains how to minimize research design bias through blocking,
stratification, and blinding.

The chapter also discusses and explains comparative effectiveness research (CER). CER is
characterized by evaluating programs in their natural rather than experimental settings.
Comparative effectiveness evaluations aim to provide data that individuals, communities, and
society can use to make informed decisions about alternative programs.

In the real world, evaluators almost always have to deal with practical and methodological
challenges that prevent them from conducting the perfect study. These challenges, if not confronted
head-on, may “threaten” or bias the validity of the evaluation’s findings. This chapter discusses the
most typical threats to an evaluation design’s internal and external validity and explains how to
guard against these threats using statistical and other more field-friendly techniques.

73
3
Designing Program
Evaluations

A Reader’s Guide to Chapter 3

Evaluation Design: Creating the Structure

How the design’s structure is composed of the evaluation questions, hypotheses, and evidence of
program merit. When designing their evaluations, evaluators decide on the characteristics of eligible
participants, how they will be assigned to programs, and how often and when data will be collected
to measure program effects.

Designs for Program Evaluation

Experimental evaluation designs (randomized control trial with parallel groups and with a wait list);
factorial designs; observational evaluation designs (including cross-sectional surveys, cohorts, and
case controls); time series designs (pretest-posttest); comparative effectiveness evaluations; and
blocking, stratifying, and blinding procedures.

Internal and External Validity

How experimental and observational studies control for bias so that findings are valid.

The Evaluation Design Report:

Questions, Evidence of Merit, and Independent Variables

Summary and Transition to the Next Chapter on Sampling

Exercises

References and Suggested Readings

Evaluation Design: Creating the Structure

An evaluation design is a structure that is created especially to produce unbiased information on a program’s
effectiveness, quality, and value. Biases in evaluations lead to systematic errors in the findings. They usually
come about because of a variety of reasons: preexisting differences in the participants who are compared in
experimental evaluations; participants’ exposure to factors apart from program (like maturing physically or
emotionally); withdrawals or exclusions of people entered into the evaluation; and how outcomes are

74
measured, interpreted, and reported.
An evaluation’s design always has five structural components.

1. Evaluation questions and hypotheses;

2. Evidence of program merit: effectiveness, quality, value;

3. Criteria for deciding on who is eligible to participate in the evaluation;

4. Rules for assigning participants to programs; and

5. Rules for the timing and frequency of measurement.

Example 3.1 describes three basic evaluation designs. As you can see, the designs are applied to an
evaluation of the same program and have the same structural components. However, there are differences
between the designs. The main difference between Design 1 and Design 2 is in the way that the schools are
assigned: at random or not at random. Random assignment means that all experimental units (schools in this
case) have an equal chance of being in the experimental or control program. The main difference between
both Designs 1 and 2 and Design 3 is that in Design 3, the evaluator uses data from an already existing
program database: no new data are collected and assignment is not an issue.

Example 3.1. Three Basic Evaluation Designs for One Program: Spanish-Language Health
Education for 5th Graders

Program: A new Spanish language health education program for 5th graders is being evaluated in six of the
district’s elementary schools. If effective, it will be introduced throughout the district. Three of the schools
will continue with their regular English language health education program, and the other three will
participate in the new program. A main program objective is to improve students’ knowledge of nutrition.
The program developers anticipate that the new program will be more effective in improving this knowledge.
The evaluators will use statistical techniques to test the hypothesis that no differences in knowledge will be
found. They hope to prove that assumption to be false. Students will be given a knowledge test within 1
month of the beginning of the program and after 1 year to determine how much they have learned.

1. Experimental Evaluation Design With Random Assignment

Evaluation question: Does the program achieve its objective?

Hypothesis: No differences will be found in knowledge between students in the new and standard program
over a one-year period

Evidence: Statistically significant improvement over time favoring the new program (rejecting the
hypothesis that no differences exist)

Eligibility: Students who read at the 5th-grade level or better in Spanish

Assignment to programs: Using a computer-generated program of random numbers, five schools are

75
randomly assigned to the new program and five to the standard program

Measurement frequency and timing: A test of students’ knowledge within 1 month of the beginning of the
program, and a test of students’ knowledge 1 year after completion of the program

2. Experimental Evaluation Design Without Random Assignment

Evaluation question: Does the program achieve its objective?

Hypothesis: No differences will be found in knowledge between students in the new and standard program

Evidence: Statistically significant improvement over time favoring the new program (rejecting the
hypothesis that no differences exist) over a one-year period

Eligibility: Students who read at the 5th-grade level or better in Spanish

Assignment to programs: Evaluators create 5 pairs of schools with schools in each pair matched so that they
have similar student demographics and resources. Using a computer-generated table of random numbers,
one of each pair of schools is assigned to the new or standard program resulting in 5 schools in the
experimental group and 5 in the control group

Measurement frequency and timing: A test of students’ knowledge within 1 month of the beginning of the
program, and a test of students’ knowledge 1 year after completion of the program

3. Observational Evaluation Design

Evaluation question: Does the program achieve its objective?

Hypothesis: No differences will be found in knowledge between students in the new and standard program

Evidence: Statistically significant improvement over time favoring the new program (rejecting the
hypothesis that no differences exist) over a one-year period

Eligibility: Students who read at the 5th-grade level or better in Spanish

Assignment to programs: The evaluation occurs after the program has already been completed, and all data
have been entered into the program’s database. The evaluators develop an algorithm to identify which
students were in the experimental and which were in the comparison group. They also use the algorithm
to identify the students who provided complete knowledge test data. Not all students will have completed
all tests; some may have completed only parts of each test

Measurement frequency and timing: The evaluators were not consulted on measurement policy. With luck,
the program team administered knowledge tests at regular intervals and large numbers of students
completed all tests with few unanswered questions

Example 3.1 illustrates three basic program evaluation designs: (1) experimental evaluation designs with
random assignment into programs (the randomized controlled trial or RCT), (2) experimental evaluation
designs without random assignment into experimental and comparison groups, and (3) observational

76
evaluation designs that use data that already exist.
Random assignment is a method that relies on chance to assign participants to programs. Suppose the
evaluation is comparing how well people in Program A learn when compared to people in Program B. With
random assignment, every eligible person has an equal chance of ending up in one of the two programs. The
assignment takes place by using a random numbers table or a computer-generated random sequence.
A nonrandomized or quasi-experimental evaluation design is one in which experimental and comparison
groups are created without random assignment. In the second illustration in Example 3.1, the schools are
matched and then randomly assigned. Matching is just one way of creating groups for nonrandomized trials.
Other options include allocating participants to programs by their date of birth, or medical or school record
number, or assigning every other person on a list of eligible participants to the new or comparison program.
Nonrandom assignment does not ensure that participants have an equal chance of receiving the experimental
or comparison programs.
Randomized and nonrandomized controlled trials are frequently contrasted with observational evaluation
designs. Observational or descriptive designs are different from controlled trials because the evaluation is
conducted using an existing database: no new data are collected. In the third illustration in Example 3.1, the
evaluators do their analysis after all program data have been collected; they have no say in which information
is collected, from whom, or how often the information is collected.

Experimental Designs
The Randomized Controlled Trial or RCT

An RCT is an experimental evaluation design in which eligible individuals (doctors, lawyers, students) or
groups (schools, hospitals, communities) are assigned at random to one of several programs or interventions.
The RCT is considered by many to be the gold standard of designs because when implemented properly, it
can be counted on to rule out inherent participant characteristics that may affect the program’s outcomes. Put
another way, if participants are assigned to experimental and control programs randomly, then the two groups
will probably be alike in all important ways before they participate. If they are different afterwards, the
difference can be reasonably linked to the program. If the evaluation design is robust, it may be possible to say
that the program caused the outcome.
Suppose the evaluators of a health literacy program in the workplace hope to improve employees’ writing
skills. They recruit volunteers to participate in a six-week writing program and compare their writing skills to
those of other workers who are, on average, the same age and have similar educational backgrounds and
writing skills. Also, suppose that after the volunteers complete the six-week program, the evaluators compare
the writing of the two groups and find that the experimental group performs much better. Can the evaluators
claim that the literacy program is effective? Possibly. But, the nature of the design is such that you cannot
really tell if some other factors that the evaluators did not measure are the ones that are responsible for the
apparent program success. The volunteers may have done better because they were more motivated to achieve
(that is why they volunteered), have more home-based social support, and so on.
A better way to evaluate the workplace literacy program is to (1) randomly assign all eligible workers (e.g.,

77
those who score below a certain level on a writing test) to the experimental program or to a comparable
control program, and (2) then compare changes in writing skills over time. With random assignment, all the
important factors (such as motivation and home support) are likely to be equally distributed between the two
groups. Then, if writing test scores are significantly different after the evaluation is concluded, and the scores
favor the experimental group, the evaluators will be on firmer ground in concluding that the program is
effective (Example 3.2).

Example 3.2 An Effective Literacy


Program: Hypothetical Example

Conclusion: Experimental program effectively improved writing skills when compared to a comparable
program.

In sum, RCTs are quantitative controlled experiments where evaluators compare two or more programs,
interventions, or practices in eligible individuals or groups who receive them in random order.
Two commonly used randomized control designs are:

1. parallel controls in which two (or more) groups are randomly constituted, and they are studied at the
same time (parallel to one another); and

2. wait-list controls where one group receives the program first and others are put on a waiting list; then if
the program appears to be effective, participants on the waiting list receive it. Participants are randomly
assigned to the experimental and wait-list groups.

Parallel Controls. An evaluation design using parallel controls is one where programs are compared to each
other at the same time (in parallel). The design requires three steps:

1. The evaluators assess the eligibility of potential participants.

• Some people are excluded because they do not satisfy the evaluation’s inclusion criteria (must be a
certain age; have a particular medical condition) or they satisfy the exclusion criteria (refuse to give
their e-mail address; do not have reliable internet access).

• Some eligible people decide not to participate. They change their mind, become ill, or are too busy.

2. The eligible and willing participants are enrolled in the evaluation study.

3. These same participants are randomly assigned to the experiment (one or more programs) or to an

78
alternative (the control, which can be one or more comparison programs).

How does this work in practice? Suppose an evaluation is planned to compare the effectiveness of three
programs for women whose partners are substance abusers. The three interventions are:

1. Computerized partner substance abuse screening measure plus a list of local resources

2. A list of local resources only

3. Usual care

Evidence of effectiveness is a significant difference in quality of life among the three groups over a one-year
period. The evaluators anticipate that the difference will be in favor of the new and most intensive
intervention: computerized partner substance abuse screen and resource list. Women are eligible for the
evaluation if they are at least 18 years of age and have easy access to a telephone. Women are excluded from
participation if they are accompanied by their partner during the selection process and cannot be safely
separated at the enrollment site. Figure 3.1 shows how the design for this study played out over the course of
the evaluation.
As you can see, of the 3537 women who were contacted originally, 2708 or 76% were eligible for
randomization. About 87% of the 2708 completed the one-year follow-up.

Wait-List Controls. With a wait-list control design, both groups are measured for eligibility, but one is
randomly assigned to be given the program now—the experimental group—and the other—the control—is
put on a waiting list. After the experimental group completes the program, both groups are measured a
second time. Then the control receives the program and both groups are measured again (Figure 3.2).

Here is how the wait-list design is used:

1. Compare Group 1 (experimental group) and Group 2 (control group) at baseline (the pretest). If
random assignment has “worked,” the two groups should not differ from one another.

2. Give Group 1—the experimental group—the program.

3. Assess the outcomes for Groups 1 and 2 at the end of the program. If the program is “working,” expect
to see a difference in outcomes favoring the experimental group.

4. Give the program to Group 2.

5. Assess the outcomes a second time. If the program is “working,” Group 2 should catch up to Group 1,
and both should have improved in their outcomes (Figure 3.3).

Wait-list control designs are practical when programs are repeated at regular intervals, as they are in
schools with a semester system. Students, for example, can be randomly assigned to Group 1 or Group 2,
with Group 1 participating in the first semester. Group 2 can then participate in the second semester.
Wait-list control designs are most likely to yield unbiased results if the evaluator waits until the

79
experimental group’s progress ceases before introducing the program to the control. This wait is called a
“wash-out” period. Once the experimental group has achieved as much as it is going to, then it is fair to
implement the program in the control. When the control has completed the program, another wash-out
period may be required so that the two groups are compared at their “true” maximum achievement points.
Unfortunately, the amount of time needed for effects to wash out in either the experimental or control group
is not usually known in advance.

Figure 3.1 Randomized Control Trial Flow

Figure 3.2 Wait-List Control Group Evaluation

80
Factorial Designs

Factorial designs enable evaluators to measure the effects of varying features of a program or practice to see
which combination works best. In Example 3.3, the evaluators are concerned with finding out if the response
rate to web-based surveys can be improved by notifying prospective responders in advance by e-mail and/or
pleading with them to respond. The evaluators design a study to solve the response-rate problem using a two-
by-two (2X2) factorial design where participants are either notified about the survey in advance by e-mail or
not prenotified, or they are pleaded with to respond or not pleaded with. The factors (they are also
independent variables) are: Pleading (Factor 1) and Notifying (Factor 2). Each factor has two “levels”: Plead
versus Don’t Plead and Notify in Advance versus Don’t Notify in Advance.

Figure 3.3 Wait-List Control Evaluation Design

81
In a two-by-two or 2 X (times) 2 factorial design, there are four study groups: (1) prenotification e-mail
and pleading invitation e-mail; (2) prenotification e-mail and nonpleading invitation; (3) no prenotification
e-mail and pleading invitation; (4) no prenotification and nonpleading invitation. In the diagram above, the
empty cells are placeholders for the number of people in each category (such as the number of people in the
groups under the categories “Plead” and “Notify in Advance” compared to the number of people under
“Plead” and “Don’t Notify in Advance.”
With this design, evaluators can study main effects (plead versus don’t plead) or interactive effects
(prenotification and pleading). The outcome in this study is the response rate. The study is a randomized
controlled trial if the selection of a group for the research participants is a random assignment. In a
randomized controlled trial, you can assign people to groups or groups to people.
Factorial designs can include many factors and many levels. It is the number of levels that describes the
type of the design. For instance, in a study of psychotherapy versus behavior modification in outpatient,
inpatient, and day treatment settings, there are two factors (treatment and setting), with one factor having
two levels (psychotherapy versus behavior modification) and one factor having three levels (inpatient, day
treatment, and outpatient). This design is a 2 X 3 factorial design.

Randomizing and Blinding

Random Assignment. Randomization is considered to be the primary method of ensuring that participant
groups are alike at baseline, that is, before they participate in a program. The idea behind randomization is
that if chance—which is what random means—dictates the allocation to programs, all important factors will
be equally distributed between and among experimental and control groups. No single factor will dominate
any of the groups, possibly influencing program outcomes. That is, to begin with each group will be as smart,
as motivated, as knowledgeable, and as self-efficacious as the other. As a result, any differences between or
among groups that are observed later, after program participation, can reasonably be assigned to the program
rather than to the differences that were there at the beginning. In evaluation terms, randomized controlled
trials result in unbiased estimates of a program’s effects.
How does random assignment work? Consider the following commonly used method (Example 3.3).

Example 3.3 Random Assignment

1. An algorithm or set of rules is applied to a table of random numbers, which are usually generated by
computer. For instance, if the evaluation design includes an experimental and a control group and an
equal probability of being assigned to each, then the algorithm could specify using the random number 1
for assignment to the experimental group and 2 for assignment to the control group.

2. As each eligible person enters the study, he or she is assigned one of the numbers (1 or 2).

82
3. The random assignment procedure should be designed so that members of the evaluation team who have
contact with participants cannot influence the allocation process. For instance, random assignments to
experimental or control groups can be placed in advance in a set of sealed envelopes by someone who will
not be involved in their distribution. Each envelope is numbered so that all can be accounted for by the
end of the evaluation. As a participant comes through the system, his or her name is recorded, the
envelope is opened, and then the assignment (1 or 2) is recorded next to the person’s name.

4. It is crucial in randomized controlled trials that the evaluators prevent interference with randomization.
Who would tamper with assignment? Sometimes, members of the evaluation team may feel pressure to
ensure that the neediest people receive the experimental program. One method of avoiding this is to
ensure that tamper-proof procedures are in place. If the research team uses envelopes, they should ensure
the envelopes are opaque (so no one can see through them) and sealed. In large studies, randomization is
typically done away from the site.

Random Clusters. Some evaluations randomly assign clusters of individuals (such as families or communities)
rather than individuals to the experimental or control groups.
Suppose an evaluation aims to study the effectiveness of a college-based smokeless tobacco cessation
program for college athletes. The evaluators define effectiveness as reported cessation of smokeless tobacco
use in the previous 30 days. Current users of smokeless tobacco (defined as those who use more than once per
month and within the past month) are eligible for the evaluation. The evaluators select 16 colleges with an
average of 23 smokeless tobacco users, and randomly assign 8 colleges to an experimental program and 8 to a
control. Both the experimental and control groups are asked about their smokeless tobacco use before the
evaluation begins and three months after its conclusion. The evaluators then compare the experimental and
control groups to find out if differences existed in tobacco use in the previous 30 days.
Please note that in this example, data on the outcome (cessation of smokeless tobacco use in the previous
30 days) were collected from individual students, but randomization was done by college—not by student. Is
this OK? The answer depends on how the evaluation deals with the potential problems caused by
randomizing with one unit (colleges) and analyzing data from another (students).
Here are seven questions to answer when reporting on a cluster randomized trial. The answers to these
questions have implications for analyzing the data and coming to conclusions about whether the evaluation’s
conclusions apply to the cluster or to the individual.
Questions about clusters:

1. Do the evaluation questions refer to the cluster level, the individual participant level or both?

2. Do the programs pertain to the cluster level, the individual participant level, or both?

3. Do the outcome measures pertain to the cluster level, the participant level, or both?

4. How were people assigned to programs: by cluster or by participant?

83
5. How were individuals chosen for participation within each cluster? For example, were all classrooms
within the experimental cluster of schools chosen? A sample? Were students sampled in some
classrooms so that within any single classroom, some students were in the experimental group, but
others were in the control?

6. How were participants informed about the evaluation? By cluster? Individually?

7. How do the clusters and individuals in the experimental and control group each compare at baseline?
At follow-up? That is, did people differ in dropout rates by group, cluster, and individual?

Improving on Chance: Stratifying and Blocking. Despite all efforts to do the right thing, chance may dictate
that experimental and control groups differ on important variables at baseline even though they were
randomized.
RCTs can gain power to detect a difference between experimental and control programs (assuming one is
actually present) if special randomization procedures are used to balance the numbers of participants in each
(blocked randomization) and in the distribution of baseline variables that might influence the outcomes
(stratified blocked randomization).
Why are special procedures necessary if random assignment is supposed to take care of the number of
people in each group or the proportion of people in each with certain characteristics? The answer is that, by
chance, one group may end up being larger than the other (differences in drop-out rate) or because of
differences in age, gender, and so on. Good news: This happens less frequently in large studies. Bad news:
The problem of unequal distribution of variables becomes even more complicated when clusters of people
(schools or families) rather than individuals are assigned. In this case, the evaluator has little control over the
individuals within each cluster, and the number of clusters (over which he or she does have control) is usually
relatively small (e.g., 5 schools or 10 clinics). Accordingly, some form of constraint like stratification is almost
always recommended in RCTs in which allocation is done by cluster.
Suppose a team of evaluators wants to be certain that the number of participants in each group is balanced.
The team could use blocks of predetermined size. For example, if the block’s size is six, the team will randomly
assign 3 people to one group and 3 to the other group until the block of 6 is completed. This means that in an
evaluation with 30 participants, 15 will be assigned to each group, and in a study of 33, the disproportion can
be no greater than a ratio of 18:15.
Now, suppose the team wants to be certain that important independent or predictor variables are balanced
between the experimental and control group. That is, the team wants to be sure that each group is equally
healthy and motivated to stay in the study. The team can use a technique called stratification. Stratification
means dividing participants into segments. Participants can be divided into differing age groups (the
stratum), or gender, or educational level. For instance, in a study of a program to improve knowledge of how
to prevent infection from HIV/AIDS, having access to reliable transportation to attend education classes is a
strong predictor of outcome. When evaluating such a program, it is probably a good idea to have similar
numbers of people who have transportation (determined at baseline) assigned to each group. This can be
done by dividing the study sample at baseline into participants with or without transportation (stratification
by access to transportation), and then carrying out a blocked randomization procedure with each of these two

84
strata.

Blinding. In some randomized studies, the participants and the evaluators do not know which participants are
in the experimental or the control groups: This is called a double-blind experiment. When participants do not
know, but evaluators do know which participants are in the experimental or the control groups, this is called a
blinded trial. Participants, people responsible for program implementation or assessing program outcomes,
and statistical analysts are all candidates for being blinded in a study.
Design experts maintain that blinding is as important as randomization in ensuring unbiased, study results.
Randomization, they say, eliminates confounding variables before the program is implemented—at baseline
—but it cannot do away with confounding variables that occur as the study progresses.
A confounding variable is an extraneous variable in a statistical or research model that affects the
dependent variables (outcomes), and either was not originally considered or controlled for. For example, age,
educational level, and motivation are baseline variables that can confound program outcomes. Confounding
during the course of a study can occur if experimental participants get extra attention, or the control group
catches on to the experiment.
Confounders can lead to a false conclusion that the dependent variables are in a causal relationship with the
independent or predictor variables. For instance, suppose research shows that drinking coffee (independent or
predictor variable) is associated with heart attacks (the dependent variable). One possibility is that drinking
coffee causes heart attack. Another is that having heart attacks causes people to drink more coffee. A third
explanation is that some other confounding factor, like smoking, is responsible for heart attacks and is also
associated with drinking coffee.
RCTs are generally expensive, requiring large teams of skilled evaluators. They tend to be disruptive in
ordinary or real-world settings, and they are designed to answer very specific research questions. However,
when implemented skillfully, they can provide strong evidence that Program A, when compared to Program
B, causes participants with selected characteristics in selected settings to achieve selected benefits that last a
given time.

Nonrandomized Controlled Trials


Parallel Controls

Nonrandomized controlled trials (sometimes called quasi-experiments) are designs in which one group
receives the program and one does not; the assignment of participants to groups is not controlled by
evaluator; and assignment is not random.
Nonrandomized controlled trials rely on participants who (1) volunteer to join the study, (2) are
geographically close to the study site, or (3) conveniently turn up (at a clinic or school) while the study is
being conducted. As a result, people or groups in a nonrandomized trial may self-select, and the evaluation
findings may be biased because they are dependent on participant choice rather than chance.
Nonrandomized trials rely on a variety of methods to ensure that the participating groups are as similar to
one another as possible (equivalent) at baseline or before intervention. Among the strategies used to ensure

85
equivalence is one called matching.
Matching requires selecting pairs of participants who are comparable to begin with one another on
important confounding variables. For example, suppose an evaluator was interested in comparing the
effectiveness of two programs to improve patient safety among medical school residents. The evaluator
decides to match the two groups and compiles a list of potentially confounding variables (previous knowledge
of patient safety guidelines? Motivation?), measures participants on these variables, and then assigns them to
programs so that each group is equally knowledgeable and motivated.
Matching is often problematic. For one thing, it is sometimes difficult to identify the variables to match,
and even if you do identify them, you may not have access to reliable measures of these variables (such as a
measure of motivation to learn, or a measure of patient safety) or the time and resources to administer them.
Also, for each variable you do measure, another equally important variable (experience, knowledge, health)
may not be measured. Evaluators who use matching typically use statistical techniques to “correct” for
baseline differences and account for the many variables that matching excludes.
Other methods for allocating participants to study groups in nonrandomized evaluations include assigning
each potential participant a number and using an alternating sequence in which every other individual (1, 3, 5,
etc.) is assigned to the experimental group and the alternate participants (2, 4, 6, etc.) are assigned to the
control. Another option is to assign groups in order of appearance so, for example, patients who attend the
clinic on Monday, Wednesday, and Friday are in the experimental group, and those attending on Tuesday,
Thursday, and Saturday are assigned to the control.
Strong nonrandomized designs have many desirable features. They can provide information about
programs when it is inappropriate or too late to randomize participants. Also, when compared to RCTs,
nonrandomized trials tend to fit more easily into and accurately reflect how a program is likely to function in
the real world. RCTs require strict control over the environment, and to get that control, the evaluator has to
be extremely stringent with respect to selection and exclusion of study participants. As a result, RCT findings
generally apply to a relatively small population in constrained settings.
Well-designed nonrandomized trials are difficult to plan and implement and require the highest level of
evaluation expertise. Many borrow techniques from RCTs, including blinding. Many others use sophisticated
statistical methods to enhance confidence in the findings.

The Problem of Incomparable Participants:


Statistical Methods Like ANCOVA to the Rescue

Randomization is designed to reduce disparities between experimental and control groups by balancing
them with respect to all characteristics (such as participants’ age, sex, or motivation) that might affect a
study’s outcome. With effective randomization, the only difference between study groups is whether they are
assigned to receive an experimental program or not. If discrepancies in outcomes are subsequently found by
statistical comparisons (for example, the experimental group improves significantly), they can be attributed to
the fact that some people received the new program while others did not.
In nonrandomized studies, the evaluator cannot assume that the groups are balanced before they receive (or
do not receive) a program or intervention. But, if the participants are different then how can the evaluator

86
who finds a difference between experimental and control outcomes separate the effects of the intervention
from differences in study participants? One answer is to consider taking care of potential confounders during
the data analysis phase using statistical methods like analysis of covariance and propensity score methods. A
confounding variable, also known as a mediator variable, can adversely affect the relation between the
independent variable and dependent variable. An evaluator who studies the cause of heart attacks will
probably consider age as a probable confounding variable because heart attacks occur more frequently in older
people.
Analysis of covariance (ANCOVA) is a statistical procedure that can be used to adjust for preexisting
differences in participants’ background that can confound the outcomes. These differences, which are likely
between experimental and control groups in nonrandomized evaluations typically include various factors: for
example, age, gender, educational background, severity of illness, type of illness, and motivation. The
evaluator uses ANCOVA to adjust for the covariates/confounders so that each group is equal in age, severity
of illness, motivation to learn, and so on.
Which covariates are potentially linked to outcomes? To find out, the evaluator reviews the literature,
conducts preliminary analyses of study data or other databases, and gathers expert opinion on which
preexisting characteristics of participants can influence their willingness to participate in and benefit from
study inclusion.
Suppose an evaluator is interested in comparing the satisfaction of people in a new communications
program with people in a traditional one. The evaluator hypothesizes that people who participated in the
traditional program for at least 3 years will be less satisfied than participants who joined more recently. The
hypothesis is based on two recent studies about participant satisfaction in new programs. So, the evaluator’s
analysis uses length of participation as a covariate, because research suggests that this variable affects
satisfaction.
ANCOVA is one among several statistical methods for controlling confounders. Among them are
sensitivity analysis, instrumental variables, and propensity score analysis. Each of these makes assumptions
about the nature and distribution of the data that are collected to measure the independent, dependent, and
confounding variables. Statistical expertise is essential in selecting and implementing the appropriate strategy
to control for potential confounders.

Observational Designs

When evaluators use observational designs, they describe events (cross-sectional or survey designs), or examine
them as they occur (e.g., cohort designs), or describe events that have already taken place (e.g., case control
designs).

Cross-Sectional Designs

Evaluators use cross-sectional designs to collect baseline information on experimental and control groups,
to guide program development, and to poll the public. Cross-sectional designs enable evaluators to develop
portraits of one or many groups at one period of time. Example 3.4 illustrates the use of cross-sectional

87
designs in program evaluations.

Example 3.4 Cross-Sectional or Cross-Sectional


Survey Designs and Program Evaluation

• A test of student knowledge of program evaluation principles: The Curriculum Committee wanted information
on entering graduate students’ knowledge of the methods and uses of program evaluation in improving the
public’s health. The committee asked the evaluator to prepare and administer a brief test to all students.
The results revealed that 80% could not distinguish between cross-sectional and experimental evaluation
designs, and only 20% could list more than one source for setting standards of program merit. The
Curriculum Committee used these findings to develop a course of instruction to provide entering graduate
students with the skills they need to conduct and review evaluations.
Comment: The test is used to survey students for information on needed curriculum content.
• A review of the care of hypertensive patients: Evaluators conducted a review of the medical records of 2,500
patients in three medical centers to find out about the management of patients with hypertension. They
are using the data to determine whether a program is needed to improve dissemination of practice
guidelines for this common medical problem and to determine how that program should be evaluated.

Comment: The review or survey of medical records is used to guide program development.
• Describe the study sample: Evaluators compared participants who did and did not meet criteria for
posttraumatic stress disorder (PTSD).

• Statistically significant: *P < .05; **P < .01

Comment: Cross-sectional surveys are most commonly used to describe an evaluation study’s participants
and compare their characteristics before, during, and after program participation with those of the
comparison groups.

88
Cohort Designs

A cohort is a group of people who have something in common (they share a health problem or have
participated in a program) and who remain part of a study over an extended period of time. Consider, for
example, an evaluator who is concerned with the need to educate pregnant women about the potentially
harmful effects of smoking on their children’s IQ. First, the evaluator identifies the cohort. Then he follows
them for a given period of time (Example 3.5). Because the design is forward in time, and the evaluator waits
for events to unfold, it is called a prospective design.

Example 3.5 A Cohort Design

The aim of the study was to examine the effects of tobacco smoking in pregnancy on children’s IQ at the age
of five. A prospective follow-up study was conducted on 1,782 women and their offspring sampled from the
National Birth Cohort Registry. Five year later, the children were tested with the Wechsler Preschool and
Primary Scale of Intelligence-Revised (WPPIS-R). The evaluators compared the IQ of children whose
mothers smoked with the IQ of children whose mothers did not smoke.

In this prospective observational study, the evaluators do not introduce a program. The study focuses on
comparing the children of smokers and nonsmokers, and the evaluators choose their own measures (e.g., the
Wechsler Preschool and Primary Scale of Intelligence-Revised.
High-quality prospective or longitudinal studies are expensive to conduct, especially if the evaluator is
concerned with outcomes that are hard to predict. Studying unpredictable outcomes requires large samples
and numerous measures. Also, evaluators who do prospective cohort studies have to be on guard against loss
of subjects over time or attrition. Longitudinal studies of children, for example, are characterized by attrition
because over time, families lose interest, move far away, change their names, and so on. If a large number of
people drop out of a study or cannot be found, the sample that remains may be very different from the sample
that left. The people who left may be different (in age, motivation, health) than those who remain in the
study. The remaining sample may reflect more motivated people or less mobile people than those who left,
and these factors may be related in unpredictable ways to any observed outcomes.

Case Control Designs

Case control designs are used to explain why a phenomenon currently exists by comparing the histories of
two different groups, one of which is involved in the phenomenon. For example, in an evaluation that
hypothesizes that people who smoke (the independent variable) are more likely to be diagnosed with lung
cancer (the outcome), the cases would be persons with lung cancer, the controls would be persons without lung
cancer and some of each group would be smokers. If a larger proportion of people in the cases smoke than in
the control groups, that suggests, but does not conclusively show, that the hypothesis is valid. Because the
evaluators look back over time, the design is retrospective.
The cases in case control designs are individuals who have been chosen on the basis of some characteristic

89
or outcome (e.g., motor vehicle accident). The controls are individuals without the characteristic or outcome.
The histories of cases and controls are analyzed and compared in an attempt to uncover one or more
characteristics present in the cases (such as a new program or policy) that is not present in the controls and
which also preceded the outcome (such as drinking alcohol before the motor vehicle accident).
Here is one example:
A team of evaluators invented a toolkit to prevent patients from falling in hospitals, a relatively frequent and
serious problem among people of all ages. The evaluation demonstrated that the toolkit significantly reduced falls.
Despite the kit’s effectiveness, however, some patients fell anyway. Why? To answer the question in order to have data
improve the kit, the evaluators conducted a case control study.
The cases included hospital patients that fell at four acute care hospitals where the kit was in place. Controls were
randomly selected from patients who were admitted to intervention units within the study’s 6-month period and did
not fall. Cases and controls were matched for gender, age (within five years), first total fall score, and unit length of
stay up to the time of the fall. The evaluators relied on data from the original study’s databases for all its information.
No new data were collected.
Case control designs can establish that a predictor variable (i.e., being dizzy or not using a cane when
getting up from bed) precedes an outcome (falling). They are often less expensive than many other designs
because the data has already been collected. But, this advantage must be weighed against the fact that the data
were collected for another purpose. They may not include the variables, measures, or participants that the
evaluators might prefer if they were responsible for the original design and data collection.
Case control evaluators who rely on very large databases can provide information on many thousands of
people in real-world settings without having to cope with the logistics and expense of experimental studies.
To avoid bias, however, case control evaluators need to ensure consistency within and among databases
because evaluators who provide the data almost always differ in how they organize information and in the
terms and codes they use.

A Note on Pretest-Posttest Only or Self-Controlled Designs

In a simple pretest-posttest only design (also called self-controlled design), each participant is measured on
some important program variable (e.g., quality of life) and serves as his or her own control. These designs do
not have a control group. Participants are usually measured twice (at baseline and after program
participation), but they may be measured multiple times before and after as well. For instance, the evaluators
of a new one-year program decided to collect data five weeks before participation, at enrollment, and at
intervals of 3, 6, and 12 months after program completion. The data were subsequently used to modify the
program based on participant feedback. At the conclusion of the 12-month period, the evaluators felt
confident that the program was ready for implementation.
Pretest-Posttest designs are not appropriate for effectiveness evaluations. They tend to be relatively small in
size and duration, and because they have no control group, they are subject to many potential biases. For
instance, participants may become excited about taking part in an experiment, this excitement may help
motivate performance, and without a comparison group, you cannot measure and compare the excitement.
Also, between the pretest and the posttest, participants may mature physically, emotionally, and intellectually,

90
thereby affecting the program’s outcomes. Without a comparison, you cannot tell if any observed changes are
due to the new program or to maturation. Finally, self-controlled evaluations may be affected by historical
events, including changes in program administration and policy. You need a comparison group that is also
subject to the same historical events to understand the likely reason for any change you observe: Is the change
due to the program or the event?

Comparative Effectiveness Research and Evaluation

Comparative effectiveness evaluation designs are used in the health sciences (public health, medicine, nursing),
but they are appropriate for all disciplines (education, psychology, social work). These evaluations aim to
compare the health outcomes and clinical effectiveness, risks and benefits of two or more available health or
medical treatments, programs, or services.
Comparative effectiveness evaluations or research (CER) have four defining characteristics:

1. Two programs are directly compared with one another. For instance, Program A to improve
immunization rates is compared to Program B to improve immunization rates.

2. The standard of effectiveness is explicit. For example, Program A is effective when compared to
Program B, if it achieves all of its five stated objectives and also costs less per participant than Program
B.

3. The patients, clinicians, and programs who participate in the evaluation must be representative of usual
practice. Programs A and B take place in naturalistic settings (physicians’ offices, clinics, and hospitals)
and often involve already existing activities (reminders to patients to get immunized). When they do,
they are observational studies.

4. The goal of the evaluation is to help everyday patients, clinicians, and policy makers make informed
choices by providing the best available evidence on the effectiveness, quality, and value of alternative
programs.

Evaluations that compare two existing programs are not new. One unique characteristic of comparative
effectiveness evaluation or research (CER) is its emphasis on having study activities take place in usual
practice rather than in controlled settings. In typical evaluations, the evaluators alter or control the
environment. They may include only people with certain characteristics (for instance, people without chronic
illness) and exclude others (for example, people with high blood pressure or depression). In CER, all these
people are eligible to be in the study because in real life, the physician or hospital does not exclude them from
care.
A second unique characteristic of CER is found in its primary purpose: to provide information for
improved decision making. The questions CER practitioners ask when designing their studies are: Will the
findings provide a patient, clinician, or policy maker with information to make a more informed choice than
they would have without the evaluation? Did the evaluation promote informed decision making? Example 3.6

91
summarizes two comparative effectiveness evaluations.

Example 3.6 Comparative Effectiveness Evaluations in Action

1. Are helicopter or ground emergency medical services more effective for adults who have suffered a major
trauma? (Galvagno et al., 2012)

Helicopter services are a limited and expensive resource, and their effectiveness is subject to debate. The
evaluators compared the effectiveness of helicopter and ground emergency services in caring for trauma
patients. They defined one standard of effectiveness as survival to hospital discharge. Data were collected
from the American College of Surgeons National Trauma Data Bank on 223,475 patients older than 15
years, having an injury severity score considered “high,” and sustaining blunt or penetrating trauma that
required transport to U.S. Level I or Level II trauma centers.

The study found that a total of 61,909 patients were transported by helicopter, and 161,566 patients were
transported by ground. Overall, 7,813 patients (12.6%) transported by helicopter died compared with 17,775
patients (11%) transported by ground services. Helicopter transport, when compared with ground transport,
was associated with improved odds of survival for patients transported to both Level I and Level II trauma
centers. These findings were significant. Concurrently, fewer patients transported by helicopter left Level II
trauma centers against medical advice. These findings were also significant.

After controlling for multiple known confounders, the evaluators concluded that among patients with major
trauma admitted to Level I or Level II trauma centers, that transport by helicopter compared with ground
services was associated with improved survival to hospital discharge.

2. Is face-to-face or over-the-telephone delivery of cognitive behavioral therapy more effective and cost-
effective for patients with common mental disorders? (Hammond et al., 2012)

The investigators collected data over a two-year period on 39,227 adults who were in a program to improve
access to psychological services. Patients received two or more sessions of cognitive behavioral therapy (CBT).
The over-the-telephone delivery program was hypothesized to be no worse in improving outcomes than the
face-to-face intervention. This is called a “non-inferiority” effectiveness standard.

Patients usually complete questionnaires for depression, anxiety, and work and social adjustment, and the
evaluators used these existing questionnaire data. As hypothesized, they found that the over-the-telephone
intervention was no worse than the face-to-face intervention, except in the case of the most ill patients. The
telephone intervention, however, cost 36.3% less than the face-to-face intervention. The evaluators concluded
that, given the non-inferiority of the telephone intervention, the evaluation provided evidence for better
targeting of psychological therapies for people with common mental disorders.

The first illustration in Example 3.5 is a comparative effectiveness evaluation because

92
1. two programs (helicopter and ground emergencies) are compared to one another;

2. the standard of effectiveness is explicit: survival to hospital discharge;

3. the evaluators use data taken from actual emergency events. They do not create a special environment
to observe what happens to trauma patients; and

4. the evaluators provide information on which of the two interventions is more effective to help inform
choices between the two.

The second illustration is a comparative effectiveness evaluation because

1. two programs are compared: face-to-face versus over-the-telephone cognitive behavioral therapy;

2. the standard is explicit: Over-the-telephone intervention will not be inferior to face-to-face (“non-
inferiority”);

3. the evaluators use data that were collected as a routine part of treatment; they do not collect new data;
and

4. the evaluators provide data on effectiveness and cost-effectiveness to better inform decisions about
which CER may use experimental or observational research designs depending on the study’s purposes.

CER studies are costly and tend to involve large numbers of people in many settings, requiring teamwork,
and statistical expertise.

Commonly Used Evaluation Designs

Program evaluators in the real world have to make trade-offs when designing a study. Table 3.1 describes the
benefits and concerns of six basic evaluation designs.

Table 3.1 Six Basic Program Evaluation Designs: Benefits and Concerns

93
Internal and External Validity

Internal validity refers to the extent to which the design and conduct of an evaluation are likely to have
prevented bias. A study has internal validity if you report that Program A causes outcome A, you can prove it

94
with valid evidence. An evaluation study has external validity if its results are applicable—generalizable—to
other programs, populations, and settings. A study has external validity when, if you report that Program A
can be used in Setting B, you can prove it with valid evidence.

Internal Validity Is Threatened

Just as the best laid plans of mice and men (and women) often go awry, evaluations, regardless of how well
they are planned, lose something in their implementation. Randomization may not produce equivalent study
groups, for example, or people in one study group may drop out more often than people in the other. Factors,
such as less than perfect randomization and attrition can threaten or bias an evaluation’s findings. There are
at least eight common threats to internal validity:

1. Selection of participants. This threat occurs when biases result from the selection or creation of groups
of people (those in Program A and those in Program B) that are not equivalent. Either the random
assignment did not work, or attempts to match groups or control for baseline confounders was
ineffective. As a result, participants are inherently different to begin with, and the effects of these
differences on program outcomes are difficult to explain, trace, and measure. Selection can interact with
history, maturation, and instrumentation.

2. History. Unanticipated events occur while the evaluation is in progress, and this history jeopardizes
internal validity. For instance, the effects of a school-based program to encourage healthier eating may
be affected by a healthy eating campaign on a popular children’s television show.

3. Maturation. Processes (e.g., physical and emotional growth) occur within participants inevitably as a
function of time, threatening internal validity. Children in a three-year school-based physical education
program mature physically, for example.

4. Testing. This threat can occur because taking one test has an effect on the scores of a subsequent test.
For instance, after a three-week program, participants are given a test. They recall their answers on a
pretest, and this influences their responses to the second test. The influence may be positive (they learn
from the test) or negative (they recall incorrect answers).

5. Instrumentation. Changes in a measuring instrument or changes in observers or scorers cause an effect


that can bias the results. Evaluator A makes changes to the questions in a structured interview given to
participants at Time 1; she administers the changed questions in Time 2. Or, no changes are made in
interview questions, but Evaluator A is replaced by Evaluator B; each evaluator has differing interview
styles.

6. Statistical regression. Regression occurs when participants are selected on the basis of extreme scores
(very ill or in great need of assistance) and they then regress or go back toward an average score.
Regression to the mean is a statistical artifact: its occurrence is due to some factor or factors outside of
the evaluation.

95
7. Attrition (drop-out) or loss to follow up. This threat to internal validity refers to the differential loss of
participants from one or more groups on a nonrandom basis. Participants in one group drop out more
frequently than participants in the others or more are lost to follow-up, for example. The resulting two
groups, which were alike to begin with, now differ.

8. Expectancy. A bias is caused by the expectations of the evaluator or the participants or both.
Participants in the experimental group expect special treatment, while the evaluator expects to give it to
them (and sometimes does). Blinding is one method of dealing with expectancy. A second is to ensure
that a standardized process is used in delivering the program.

External Validity Is Threatened

Threats to external validity are most often the consequence of the way in which participants or respondents
are selected and assigned. They also occur whenever respondents are tested, surveyed, or observed. They may
become alert to the kinds of behaviors that are expected or favored. There are at least four relatively common
sources of external invalidity:

1. Interaction effects of selection biases and the experimental treatment. This threat to external validity
occurs when an intervention or program and the participants are a unique mixture, one that may not be
found elsewhere. The threat is most apparent when groups are not randomly constituted. Suppose a
large company volunteers to participate in an experimental program to improve the quality of
employees’ leisure time activities. The characteristics of the company, such as leadership and priorities,
are related to the fact that it volunteered for the experiment and may interact with the program so that
the two elements combined make the study unique; therefore, the particular blend of company and
program can limit the applicability of the findings to other companies.

2. Reactive effects of testing. This bias occurs because participants are pretested, which sensitizes them to
the treatment that follows. In fact, pretesting has such a powerful effect, that teachers sometimes use
pretests to get students ready for their next class assignment. A pretested group is different from one
that has not been pretested, and so the results of an evaluation using a pretest may not be generalizable.

3. Reactive effects of experimental arrangements or the Hawthorne effect. This threat to external validity
can occur when people know that they are participating in an experiment. Sometimes known as the
Hawthorne Effect, this threat is caused when people behave uncharacteristically because they are aware
that their circumstances are different. (They are being observed by cameras in the classroom, for
instance).

4. Multiple program interference. This threat results when participants are in other complementary
activities or programs that interact. Participants in an experimental mathematics program are also
taking a physics class and both teach differential calculus; hence, the possible interference.

How do the threats to internal and external validity promote bias? Consider The Work and Stress
Program, a yearlong program to help reduce on-the-job stress. Eligible people can enroll in one of two

96
variations of the program. To find out if participants are satisfied with the quality of the program, both
groups complete an in-depth questionnaire at the end of the year, and the evaluator compares the results.
This is a nonrandomized controlled trial. Its internal validity is potentially marred by the fact that the
participants in the groups may be different from one another at the onset. More stressed people may choose
one program over the other, for example. Also, because of the initial differences, the attrition or loss to
follow-up may be affected if the only people to continue with the program are the motivated ones. The failure
to create randomly constituted groups will jeopardize the study’s external validity by the interactive effects of
selection.

Summary and Transition


to the Next Chapter on Sampling

This chapter focuses on ways to structure an evaluation to produce unbiased information of program merit.
The structure, or evaluation design, includes the evaluation questions and hypotheses, criteria for participant
selection, and rules for assigning participants to programs and administering measures. The chapter also
discusses the benefits and concerns associated with experimental and observational designs. Randomized
controlled trials are recommended for evaluators who want to establish causation, but these designs are
difficult to implement. Nonrandomized trials and observational designs are often used in evaluations, and if
rigorously implemented can provide useful, high-quality data. Comparative effectiveness research aims to
provide data for informed decision making by providing data on the effectiveness, quality, and value of
alternative programs.
The next chapter discusses sampling—that is, what you need to do when you cannot obtain an entire given
population for your evaluation. Among the issues discussed are how to obtain a random sample and how to
select a large enough sample so that if program differences exist, you will find them.

Exercises

Exercise 1

Directions

Read the following description of an evaluation of the effectiveness of a school-based intervention for
reducing children’s symptoms of PTSD and depression resulting from exposure to violence.

1. Name the design

2. List the evaluation questions or hypotheses

3. Identify the evidence of merit

4. Describe the eligibility criteria

97
5. Describe whether and how participants were assigned

6. Discuss the timing and frequency of the measures

Sixth-grade students at two large middle schools in Los Angeles who reported exposure to violence and
had clinical levels of symptoms of PTSD were eligible for participation in the evaluation. Students were
randomly assigned to a 10-session standardized cognitive-behavioral therapy (the Cognitive-Behavioral
Intervention for Trauma in Schools) early intervention group (n = 61) or to a wait-list delayed intervention
comparison group (n = 65) conducted by trained school mental health clinicians. Students were evaluated
before the intervention and 3 months after the intervention on measures assessing child-reported symptoms
of PTSD and depression, parent-reported psychosocial dysfunction, and teacher-reported classroom
problems.

Exercise 2

Name the threats to internal and external validity in the following two study discussions.

1. The Role of Alcohol in Boating Deaths


Although many potentially confounding variables were taken into account, we were unable to adjust for
other variables that might affect risk, such as the boater’s swimming ability, the operator’s boating skills and
experience, use of personal flotation devices, water and weather conditions, and the condition and
seaworthiness of the boat. Use of personal flotation devices was low among control subjects (about 6.7% of
adults in control boats), but because such use was assessed only at the boat level and not for individuals, it was
impossible to include it in our analyses. Finally, although we controlled for boating exposure with the random
selection of control subjects, some groups may have been underrepresented.

2. Violence Prevention in the Emergency Department


The study design would not facilitate a blinding process that may provide more reliable results. The study
was limited by those youth who were excluded, lost to follow-up, or had incomplete documents.
Unfortunately, the study population has significant mobility and was commonly unavailable when the case
managers attempted to interview them. The study was limited by the turnover of case managers.

Note: In addition to the “limitations” discussed above, the evaluators cite other study problems. For
example, the evaluators say, “This study and the results noted were limited by the duration of case
management and follow-up to 6 months. Perhaps, a longer period of at least 1 year may produce improved
results.” They also state, “All of the components of the evaluation tool, except for Future Expectations and
Social Competence, were validated, but the combination has not been.”

Exercise 3

Describe the advantages and limitations of commonly used research designs, including

• Randomized controlled trials with concurrent and wait-list control groups

98
• Quasi-experimental or nonrandomized designs with concurrent control groups
• Observational designs, including cohorts, case controls, and cross-sectional surveys

References and Suggested Readings

Galvagno, S. M., Jr., Haut, E. R., Zafar, S. N., Millin, M. G., Efron, D. T., Koenig, G. J., Jr., … Haider, A.
H. (2012, April 18). Association between helicopter vs ground emergency medical services and survival
for adults with major trauma. JAMA: The Journal of the American Medical Association, 307(15),1602–1610.
Hammond, G. C., Croudace, T. J., Radhakrishnan, M., Lafortune, L., Watson, A., McMillan-Shields, F.,
& Jones, P. B. (2012). Comparative effectiveness of cognitive therapies delivered face-to-face or over the
telephone: An observational study using propensity methods. PLoS One, 7(9).

For Examples of Randomized Controlled Trials:

Baird, S. J., Garfein, R. S., McIntosh, C. T., & Ozler, B. (2012). Effect of a cash transfer programme for
schooling on prevalence of HIV and herpes simplex type 2 in Malawi: A cluster randomised trial. Lancet,
379(9823), 1320–1329. doi: 10.1016/s0140–6736(11)61709–1
Buller, M. K., Kane, I. L., Martin, R. C., Grese, A. J., Cutter, G. R., Saba, L. M., & Buller, D. B. (2008).
Randomized trial evaluating computer-based sun safety education for children in elementary school.
Journal of Cancer Education, 23, 74–79.
Butler, R. W., Copeland, D. R., Fairclough, D. L., Mulhern, R. K., Katz, E. R., Kazak, A. E., . . .Sahler, O.
J. (2008). A multicenter, randomized clinical trial of a cognitive remediation program for childhood
survivors of a pediatric malignancy. Journal of Consulting and Clinical Psychology, 76, 367–378.
DuMont, K., Mitchell-Herzfeld, S., Greene, R., Lee, E., Lowenfels, A., Rodriguez, M., & Dorabawila, V.
(2008). Healthy Families New York (HFNY) randomized trial: Effects on early child abuse and neglect.
Child Abuse & Neglect, 32, 295–315.
Fagan, J. (2008). Randomized study of a prebirth coparenting intervention with adolescent and young fathers.
Family Relations, 57, 309–323.
Johnson, J. E., Friedmann, P. D., Green, T. C., Harrington, M., & Taxman, F. S. (2011). Gender and
treatment response in substance use treatment-mandated parolees. Journal of Substance Abuse Treatment,
40(3), 313–321. doi: 10.1016/j.jsat.2010.11.013
Nance, D. C. (2012). Pains, joys, and secrets: Nurse-led group therapy for older adults with depression. Issues
in Mental Health Nursing, 33(2), 89–95. doi: 10.3109/01612840.2011.624258
Poduska, J. M., Kellam, S. G., Wang, W., Brown, C. H., Ialongo, N. S., & Toyinbo, P. (2008). Impact of
the Good Behavior Game, a universal classroom-based behavior intervention, on young adult service use
for problems with emotions, behavior, or drugs or alcohol. Drug and Alcohol Dependence, 95, S29–S44.
Rdesinski, R. E., Melnick, A. L., Creach, E. D., Cozzens, J., & Carney, P. A. (2008). The costs of
recruitment and retention of women from community-based programs into a randomized controlled
contraceptive study. Journal of Health Care for the Poor and Underserved, 19, 639–651.

99
Swart, L., van Niekerk, A., Seedat, M., & Jordaan, E. (2008). Paraprofessional home visitation program to
prevent childhood unintentional injuries in low-income communities: A cluster randomized controlled
trial. Injury Prevention, 14(3), 164–169.
Thornton, J. D., Alejandro-Rodriguez, M., Leon, J. B., Albert, J. M., Baldeon, E. L., De Jesus, L. M., …
Sehgal, A. R. (2012). Effect of an iPod video intervention on consent to donate organs: A randomized
trial. Annals of Internal Medicine, 156(7), 483–490. doi: 10.1059/0003–4819–156–7-201204030–00004

For Examples of Nonrandomized Trials


or Quasi-Experimental Studies:

Corcoran, J. (2006). A comparison group study of solution-focused therapy versus “treatment-as-usual” for
behavior problems in children. Journal of Social Service Research, 33, 69–81.
Cross, T. P., Jones, L. M., Walsh, W. A., Simone, M., & Kolko, D. (2007). Child forensic interviewing in
Children’s Advocacy Centers: Empirical data on a practice model. Child Abuse & Neglect, 31, 1031–1052.
Gatto, N. M., Ventura, E. E., Cook, L. T., Gyllenhammer, L. E., & Davis, J. N. (2012). L.A. Sprouts: A
garden-based nutrition intervention pilot program influences motivation and preferences for fruits and
vegetables in Latino youth. Journal of the Academy of Nutrition and Dietetics, 112(6), 913–920. doi:
10.1016/j.jand.2012.01.014
Hebert, R., Raiche, M., Dubois, M. F., Gueye, N. R., Dobuc, N., Tousignant, M., & PRISMA Group
(2010). Impact of PRISMA, a coordination-type integrated service delivery system for frail older people
in Quebec (Canada): A quasi-experimental study. Journals of Gerontology Series B-Psychological Sciences and
Social Sciences, 65(1), 107–118. doi: 10.1093/geronb/gbp027
Kutnick, P., Ota, C., & Berdondini, L. (2008). Improving the effects of group working in classrooms with
young school-aged children: Facilitating attainment, interaction and classroom activity. Learning and
Instruction, 18, 83–95.
Orthner, D. K., Cook, P., Sabah, Y., & Rosenfeld, J. (2006). Organizational learning: A cross-national pilot-
test of effectiveness in children’s services. Evaluation and Program Planning, 29, 70–78.
Pascual-Leone, A., Bierman, R., Arnold, R., & Stasiak, E. (2011). Emotion-focused therapy for incarcerated
offenders of intimate partner violence: A 3-year outcome using a new whole-sample matching method.
Psychotherapy Research, 21(3), 331–347. doi: 10.1080/10503307.2011.572092
Rice, V. H., Weglicki, L. S., Templin, T., Jamil, H., & Hammad, A. (2010). Intervention effects on tobacco
use in Arab and non-Arab American adolescents. Addictive Behaviors, 35(1), 46–48. doi:
10.1016/j.addbeh.2009.07.005
Struyven, K., Dochy, F., & Janssens, S. (2008). The effects of hands-on experience on students’ preferences
for assessment methods. Journal of Teacher Education, 59, 69–88.

For Examples of Cohort Designs:

Brown, C. S., & Lloyd, K. (2008). OPRISK: A structured checklist assessing security needs for mentally
disordered offenders referred to high security psychiatric hospital. Criminal Behaviour and Mental Health,

100
18, 190–202.
Chauhan, P., & Widom, C. S. (2012). Childhood maltreatment and illicit drug use in middle adulthood:
The role of neighborhood characteristics. (Special Issue 03). Development and Psychopathology, 24, 723–
738. doi:10.1017/S0954579412000338
Kemp, P. A., Neale, J., & Robertson, M. (2006). Homelessness among problem drug users: Prevalence, risk
factors and trigger events. Health & Social Care in the Community, 14, 319–328.
Kerr, T., Hogg, R. S., Yip, B., Tyndall, M. W., Montaner, J., & Wood, E. (2008). Validity of self-reported
adherence among injection drug users. Journal of the International Association of Physicians in AIDS Care,
7(4), 157–159.
Pletcher, M. J., Vittinghoff, E., Kalhan, R., Richman, J., Safford, M., Sidney, S., … Kertesz, S. (2012).
Association between marijuana exposure and pulmonary function over 20 years. JAMA: The Journal of the
American Medical Association, 307(2), 173–181. doi: 10.1001/jama.2011.1961
Van den Hooven, E. H., Pierik, F. H., de Kluizenaar, Y., Willemsen, S. P., Hofman, A., van Ratinjen, S.
W., … Vrijheid, J. S. (2012). Air pollution exposure during pregnancy, ultrasound measures of fetal
growth, and adverse birth outcomes: A prospective cohort study. Environmental Health Perspectives,
120(1), 150–156. doi: 10.1289/ehp.1003316
White, H. R., & Widom, C. S. (2003). Does childhood victimization increase the risk of early death? A 25-
year prospective study. Child Abuse & Neglect, 27, 841–853.

For Examples of Case Control Studies:

Belardinelli, C., Hatch, J. P., Olvera, R. L., Fonseca, M., Caetano, S. C., Nicoletti, M., … Soares, J. C.
(2008). Family environment patterns in families with bipolar children. Journal of Affective Disorders, 107,
299–305.
Bookle, M., & Webber, M. (2011). Ethnicity and access to an inner city home treatment service: A case-
control study. Health & Social Care in the Community, 19(3), 280–288. doi: 10.1111/j.1365–
2524.2010.00980.x
Davis, C., Levitan, R. D., Carter, J., Kaplan, A. S., Reid, C., Curtis, C., … Kennedy, J. L.(2008). Personality
and eating behaviors: A case-control study of binge eating disorder. International Journal of Eating
Disorders, 41, 243–250.
Hall, S. S., Arron, K., Sloneem, J., & Oliver, C. (2008). Health and sleep problems in Cornelia de Lange
syndrome: A case control study. Journal of Intellectual Disability Research, 52, 458–468.
Menendez, C. C., Nachreiner, N. M., Gerberich, S. G., Ryan, A. D., Erkal, S., McGovern, P. M., … Feda,
D. M. (2012). Risk of physical assault against school educators with histories of occupational and other
violence: A case-control study. Work, 42(1), 39–46.

For Examples of Cross-Sectional Studies:

Belardinelli, C., Hatch, J. P., Olvera, R. L., Fonseca, M., Caetano, S. C., Nicoletti, M., … Soares, J. C.
(2008). Family environment patterns in families with bipolar children. Journal of Affective Disorders,

101
107(1–3), 299–305.
Carmona, C. G. H., Barros, R. S., Tobar, J. R., Canobra, V. H., & Montequín, E. A. (2008). Family
functioning of out-of-treatment cocaine base paste and cocaine hydrochloride users. Addictive Behaviors,
33, 866–879.
Cooper, C., Robertson, M. M., & Livingston, G. (2003). Psychological morbidity and caregiver burden in
parents of children with Tourette’s disorder and psychiatric comorbidity. Journal of the American Academy
of Child & Adolescent Psychiatry, 42, 1370–1375.
Davis, C., Levitan, R. D., Carter, J., Kaplan, A. S., Reid, C., Curtis, C., … Kennedy, J. L. (2008).
Personality and eating behaviors: A case-control study of binge eating disorder. International Journal of
Eating Disorders, 41, 243–250.
Joice, S., Jones, M., & Johnston, M. (2012). Stress of caring and nurses’ beliefs in the stroke rehabilitation
environment: A cross-sectional study. International Journal of Therapy & Rehabilitation, 19(4), 209–216.
Kypri, K., Bell, M. L., Hay, G. C., & Baxter, J. (2008). Alcohol outlet density and university student
drinking: A national study. Addiction, 103, 1131–1138.
Meijer, J. H., Dekker, N., Koeter, M. W., Quee, P. J., van Beveren, M. J. N., & Meijer, C. J. (2012).
Cannabis and cognitive performance in psychosis: A cross-sectional study in patients with non-affective
psychotic illness and their unaffected siblings. Psychological Medicine, 42(4), 705–716. doi:
10.1017/s0033291711001656
Schwarzer, R., & Hallum, S. (2008). Perceived teacher self-efficacy as a predictor of job stress and burnout.
Applied Psychology: An International Review, 57 (Suppl. 1), 152–171.

102
Purpose of This Chapter

Why do evaluators typically use samples of participants rather than entire populations? This chapter
answers this question and also discusses the advantages and limitations of five commonly used
evaluation sampling methods: random, systematic, stratified, cluster, and convenience sampling.
The chapter also explains the difference between unit of sampling (schools and hospitals) and the
unit of data analysis (individual students and individual patients) and discusses how to calculate an
appropriate sample size. A form for reporting on sampling strategy is also offered. This form is
designed to show the logical relationships among the evaluation questions, standards, independent
variables, sampling strata, inclusion and exclusion criteria, dependent variables, measures, sampling
methods, and sample size.

103
4
Sampling

A Reader’s Guide to Chapter 4

What Is a Sample?

Why Sample?

Inclusion and Exclusion Criteria or Eligibility

Sampling Methods

Simple random sampling, random selection and random assignment, systematic sampling, stratified
sampling, cluster sampling, and nonprobability or convenience sampling

The Sampling Unit

Sample Size

Power analysis and alpha and beta errors

The Sampling Report

Summary and Transition to the Next Chapter on Collecting Information

Exercises

References and Suggested Readings

What Is a Sample?

A sample is a portion or subset of a larger group called a population. The evaluator’s target population consists
of the institutions, persons, problems, and systems to which or to whom the evaluation’s findings are to be
applied or generalized. For example, suppose 500 students who are 14 to 16 years of age are selected to be in
an evaluation of a program to improve computer programming skills. If the program is effective, the school
system’s curriculum planners will offer it to all 14- to 16-year-olds in the system. The target population is
students between the ages of 14 to 16. The system may include thousands of 14- to 16-year-old students, but
only a sample of 500 are included in the evaluation.
Consider the two target populations and samples in Example 4.1.

104
Example 4.1 Two Target Populations and Two Samples

Evaluation 1

Target population: All homeless veterans throughout the United States

Program: Outreach, provision of single-room occupancy housing, medical and financial assistance, and job
training: the REACH-OUT Program

Sample: 500 homeless veterans in four of the fifty U.S. states who received outpatient medical care between
April 1 and June 30

Evaluation 2

Target population: All newly diagnosed breast cancer patients

Program: Education in Options for Treatment

Sample: Five hospitals in three of fifty U.S. states; within each hospital, 15 physicians; for each physician, 20
patients seen between January 1 and July 31 who are newly diagnosed with breast cancer

The REACH-OUT Program in Evaluation 1 is designed for all homeless veterans. The evaluator plans to
select a sample of 500 homeless veterans in four states between April 1 and June 30. The findings are to be
applied to all homeless veterans in all 50 states. Women newly diagnosed with breast cancer are the targets of
an educational program in Evaluation 2. The evaluators will select five hospitals and, within them, 20 patients
seen from January through July by each of 15 doctors. The findings are to be applied to all patients newly
diagnosed with breast cancer.

Why Sample?

Evaluators sample because it is efficient to do so and because it contributes to the precision of their research.
Samples can be studied more quickly than entire target populations, and they are also less expensive to
assemble. In some cases, it would be impossible to recruit a complete population for an evaluation, even if the
necessary time and financial resources were available. For example, it would be futile for an evaluator to
attempt to enroll all homeless veterans in an investigation of a program that targets them (see Example 4.1,
above). Sampling is also efficient in that it allows the evaluator to invest resources in important evaluation
activities, such as monitoring the quality of data collection and standardizing the implementation of the
program rather than in the collection of data on an unnecessarily large number of individuals or groups.
Sampling enables the evaluator to focus precisely on the characteristics of interest. For example, suppose an
evaluator wants to compare older and younger veterans with differing health and functional statuses. A
stratified sampling strategy can give the evaluator just what he or she needs. A sample of the population with
precise characteristics is actually more suitable for many evaluations than the entire population.

105
Inclusion and Exclusion Criteria or Eligibility

A sample is a part of a larger population where the evaluation’s findings from the sample will be applied to
make inferences or extrapolations about the larger population. For instance, if an evaluation is intended to
investigate the impact of an educational program on women’s knowledge of their options for surgical
treatment for cancer, and not all women with cancer are to be included in the program, the evaluator has to
decide on the characteristics of the women who will be the focus of the study. Will the evaluation concentrate
on women of a specific age? Women with a particular kind of cancer? Example 4.2 presents hypothetical
inclusion and exclusion criteria for an evaluation of such a program.

Example 4.2 Inclusion and Exclusion Criteria for a Sample of


Women to Be Included in an Evaluation of a Program for
Surgical Cancer Patients

Inclusion: Using the U.S. Medicare claims database, of all patients hospitalized during 2013, those with
diagnosis or procedure codes related to breast cancer; for patients with multiple admissions, only the
admission with the most invasive surgery

Exclusion: Women under the age of 65 (because women under 65 who receive Medicare in the United States
are generally disabled or have renal disease), men, women with only a diagnostic biopsy or no breast surgery,
women undergoing bilateral mastectomy, women without a code for primary breast cancer at the time of the
most invasive surgery, women with a diagnosis of carcinoma in situ, and women with metastases to regions
other than the axillary lymph nodes

The evaluator of this program has set criteria for the sample of women who will be included in the
evaluation and for which its conclusions will be appropriate. The sample will include women over 65 years of
age who have had surgery for breast cancer. The findings of the evaluation will not be applicable to women
under age 65 with other types of cancer who had only a diagnostic biopsy, have not had surgery, or had a
bilateral mastectomy.
The independent variables are the evaluator’s guide to determining where to set inclusion and exclusion
criteria. For example, suppose that in an evaluation of the effects on teens of a preventive health care
program, one of the evaluation questions asks whether boys and girls benefit equally from program
participation. In this question, the independent variable is sex and the dependent (outcome) variable is
benefit. If the evaluator plans to sample boys and girls, he or she must set inclusion and exclusion criteria.
Inclusion criteria for this evaluation might include boys and girls under 18 years of age who are likely to
attend all of the educational activities for the duration of the evaluation and who are able to read at levels
appropriate for their ages. Teens might be excluded if they already participate in another preventive health
care program, if they do not speak English, or if their parents object to their participation. If these
hypothetical inclusion and exclusion criteria are used to guide sampling eligibility, then the evaluation’s
findings can be generalized only to English-speaking boys and girls under age 18 who read appropriately for

106
their age and tend to be compliant with school attendance requirements. The evaluation is not designed to
enable the findings to be applicable to teens who have difficulty reading or speaking English and who are
unlikely to be able to complete all program activities.

Sampling Methods

Sampling methods are usually divided into two types: probability sampling and convenience sampling.
Probability sampling is considered the best way to ensure the validity of any inferences made about a
program’s effectiveness and its generalizability. In probability sampling, every member of the target
population has a known probability of being included in the sample.
In convenience sampling, participants are selected because they are available. Thus, in this kind of
sampling some members of the target population have a chance of being chosen whereas others do not, even
if they meet inclusion criteria. As a result, the data that are collected from a convenience sample may not be
applicable at all to the target group as a whole. For example, suppose an evaluator who is studying the quality
of care provided by student health services decides to interview all students who come for care during the
week of December 26. Suppose also that 100 students come and all agree to be interviewed: a perfect
response rate. The problem is that in some parts of the world, the end of December is associated with
increased rates of illness from respiratory viruses as well as with high numbers of skiing accidents; moreover,
many universities are closed during that week, and most students are not around. Thus, the data collected by
the happy evaluator with the perfect response rate could very well be biased, because the evaluation excluded
many students simply because they were not on campus (and, if they were ill, may have received care
elsewhere).

Simple Random Sampling

In simple random sampling, every subject or unit has an equal chance of being selected. Because of this
equality of opportunity, random samples are considered relatively unbiased. Typical ways of selecting a simple
random sample include applying a table of random numbers (available free online) or a computer-generated
list of random numbers to lists of prospective participants.

Suppose an evaluation team wants to select 10 social workers at random from a list (or sampling frame) of 20
names. Using a table of random numbers (Figure 4.1), the evaluators first identify a row of numbers at
random and then a column. Where the two intersect, they begin to identify their sample.

1. How they randomly identify the row: A member of the evaluation team tosses a die twice. On the first
toss, the die’s number is 3; the number on the second toss is 5. Starting with the first column in the
table, these numbers correspond to the third block and the fifth row of that block, or the row containing
numbers 1 4 5 7 5 in Figure 4.1.

2. How they randomly identify the column: A member of the team tosses the die twice and gets 2 and 1.
These numbers identify the second block of columns and the first column beginning with 1. The

107
starting point for this sample is where the row 1 4 5 7 5 and the column beginning with 1 intersect, at 3
5 4 9 0.

3. How they select the sample: The evaluators need 10 social workers from their list of 20—or two-digit
numbers. Moving down from the second column, the 10 numbers between 01 and 20 that appear
(starting with the numbers below 3 5 4 9 0, and beginning with the first number, 7) are 12, 20, 09, 01,
02, 13, 18, 03, 04, and 16. These are the social workers selected for the random sample.

Random Selection and Random Assignment

In any given evaluation, the process of random selection may be different from the process of random
assignment, as is illustrated in Example 4.3.

Figure 4.1 A Portion Table of Random Numbers

Note: Numbers that are bold and underlined = sample of ten, consisting of numbers between 01 and 20.

1 = Two rolls of a die yield 1 (column) and 5 (block).

2 = Two rolls of a die yield 2 (column) and 1 (row).

3 = Intersection of 1 and 2.

4 = Start here to get the sample.

Example 4.3 Random Selection and


Random Assignment: Two Illustrations

Evaluation A

Evaluation A had six months to identify a sample of teens to participate in an evaluation of an innovative
school-based preventive health care program. At the end of the six months, all eligible teens were assigned to
the innovative program or the traditional (control) program. Assignment was based on how close to a
participating school each teen lived. The evaluators used this method because participation meant that

108
students would be required to attend several after-school activities, and no funding was available for
transportation.

Evaluation B

Evaluation B had six months to identify a sample of teens to participate in an evaluation of an innovative
school-based preventive health care program. At the end of the six months, a sample was randomly selected
from the population of all teens who were eligible. Half the sample was randomly assigned to the innovative
program, and half was assigned to the traditional (control) program.

In the example of Evaluation A, the evaluators selected all eligible teens and then assigned them to the
experimental and control groups according to how close each one lived to a participating school. In
Evaluation B, the evaluators selected a random sample of all those who were eligible and then randomly
assigned the participants either to the experimental group or the control group. Of these two methods, the
latter is usually considered preferable to the former. Random selection means that every eligible person has an
equal chance of becoming part of the sample; if all are included because they just happen to appear during the
time allocated for sampling, biases may be introduced. Random assignment can also guard against bias.

Systematic Sampling

Suppose an evaluator is to select a sample of 500 psychologists from a list with the names of 3,000
psychologists. In systematic sampling, the evaluator divides 3,000 by 500 to yield 6, and then selects every
sixth name on the list. An alternative is to select a number at random—say, by tossing a die. If the toss comes
up with the number 5, for example, the evaluator selects the fifth name on the list, then the tenth, the
fifteenth, and so on, until he or she has the necessary 500 names.
Systematic sampling is not an appropriate method to use if repetition is a natural component of the
sampling frame. For example, if the frame is a list of names, those beginning with certain letters of the
alphabet (in English: Q or X) might get excluded because they appear infrequently.

Stratified Sampling

A stratified random sample is one in which the population is divided into subgroups, or strata, and a random
sample is then selected from each group. For example, in a program to teach women about options for the
treatment of breast cancer, the evaluator might choose to sample women of differing general health status (as
indicated by scores on a 32-item test), age, and income (high = +, medium = 0, and low = –). In this case,
health status, age, and income are the strata. This sampling blueprint can be depicted as shown in Figure 4.2.
The evaluator chooses the strata, or subgroups, based on his or her knowledge of their relationship to the
dependent variable or outcome measure—in this case, the options chosen by women with breast cancer. That
is, the evaluator in this example has evidence to support the assumption that general health status, age, and
income influence women’s choices of treatment. The evaluator must be able to justify his or her selection of
the strata with evidence from the literature or other sources of information (such as historical data or expert

109
opinion).

Figure 4.2 Sampling Blueprint for a Program to Educate Women in Options for Breast Cancer
Treatment

If the evaluator neglects to use stratification in the choice of a sample, the final results may be confounded.
For example, if the evaluation neglects to distinguish among women with different characteristics, good and
poor performance may be averaged among them, and the program will seem to have no effect even if women
in one or more groups benefited. In fact, the program actually might have been very successful with certain
women, such as those over age 75 who have moderate incomes and General Health Status scores between 25
and 32.
When evaluators do not use stratification, they may apply statistical techniques (such as analysis of
covariance and regression) retrospectively (after the data have already been collected) to correct for
confounders or covariates on the dependent variables or outcomes. Evaluators generally agree, however, that
it is better to anticipate confounding variables by sampling prospectively than to correct for them
retrospectively, through analysis. Statistical corrections require very strict assumptions about the nature of the
data, and the sampling plan may not have been designed for these assumptions. With few exceptions, using
statistical corrections afterward results in a loss of power, or the ability to detect true differences between
groups (such as the experimental and control groups).
The strata in stratified sampling are subsets of the independent variables. If the independent variables are
sex, health status, and education, the strata are how each of these is defined. For example, sex is defined as
male and female. A variable like health status can be defined in many ways, depending on the measures
available to collect data and the needs of the evaluation. For example, health status may be defined as a
numerical score on some measure or may be rated as excellent, very good, good, fair, or poor.

Cluster Sampling

110
Cluster sampling is used in large evaluations—those that involve many settings, such as universities,
hospitals, cities, states, and so on. In cluster sampling, the population is divided into batches that are then
randomly selected and assigned, and their constituents can be randomly selected and assigned. For example,
suppose that California’s counties are trying out a new program to improve emergency care for critically ill
and injured children, and the control program is the traditional emergency medical system. If you want to use
random cluster sampling to evaluate this program, you can consider each county to be a cluster and select and
assign counties at random to the new program or the traditional program. Alternatively, you can randomly
select children’s hospitals and other facilities treating critically ill children within counties and randomly
assign them to the experimental system or the traditional system (assuming this is considered ethical).
Example 4.4 gives an illustration of the use of cluster sampling in a survey of Italian parents’ attitudes toward
AIDS education in their children’s schools.

Example 4.4 Cluster Sampling in a Study of Attitudes


of Italian Parents Toward AIDS Education

Epidemiologists from 14 of Italy’s 21 regions surveyed parents of 725 students from 30 schools chosen by a
cluster sampling technique from among the 292 classical, scientific, and technical high schools in Rome. Staff
visited each school and selected students using a list of random numbers based on the school’s size. Each of
the students selected for participation was given a letter to deliver to his or her parents explaining the goals of
the study and when they would be contacted.

Nonprobability or Convenience Sampling

Convenience samples are samples where the probability of selection is unknown. Evaluators use
convenience samples simply because they are easy to obtain. This means that some people have no chance at
selection simply because they are not around to be chosen. Convenience samples are considered to be biased,
or not representative of the target population, unless proved otherwise.
In some cases, evaluators can perform statistical analyses to demonstrate that convenience samples are
actually representative. For example, suppose that during the months of July and August, an evaluator
conducts a survey of the needs of county institutions concerned with critically ill and injured children.
Because many county employees take their yearly vacations in July and August, the respondents may be
different from those who would have answered the survey during other times of the year. If the evaluator
wants to demonstrate that those employees who were around to respond and those who were not available in
July and August are not different, he or she can compare the two groups on key variables, such as time on the
job and experience with critically ill and injured children. If this comparison reveals no differences, the
evaluator is in a relatively stronger position to assert that, even though the sample was chosen on the basis of
convenience, the characteristics of the participants do not differ on certain key variables (such as length of
time on the job) from those of the target population.

111
The Sampling Unit

A major concern in sampling is the unit to be sampled. Example 4.5 illustrates the concept of the sampling
unit.

Example 4.5 What Is the Target? Who Is Sampled?

An evaluation of a new program is concerned with measuring the program’s effectiveness in altering
physicians’ practices pertaining to acute pain management for children who have undergone operations. The
target population is all physicians who care for children undergoing operations. The evaluation question is
“Have physicians improved their pain management practices for children?” The evidence of effectiveness is
that physicians in the experimental group show significant improvement over a 1-year period and significantly
greater improvement than physicians in a control group. Resources are available for 20 physicians to
participate in the evaluation.
The evaluators randomly assign 10 physicians to the experimental group and 10 physicians to the control.
They plan to find out about pain management through a review of the medical records of 10 patients of each
of the physicians in the experimental and control groups, for a total sample of 200 patients. (This is
sometimes called a nested design.) A consultant to the evaluation team says that, in actuality, the evaluators are
comparing the practices of 10 physicians against those of 10 physicians, and not the care of 100 patients
against that of 100 patients. The reason is that characteristics of the care of the patients of any single
physician will be very highly related. The consultant advises the evaluators to correct for this lack of
“independence” among patients of the same physician by using one of the statistical methods available for
correcting for cluster effects. Another consultant, in contrast, advises the evaluators to use a much larger
number of patients per physician and suggests a statistical method for selecting the appropriate number.
Because the evaluators do not have enough money to enlarge the sample, they decide to “correct” statistically
for the dependence among patients.

In this example, the evaluators want to be able to apply the evaluation’s findings to all physicians who care
for children undergoing surgery, but they have enough resources to include only 20 physicians in the
evaluation. In an ideal world, the evaluators would have access to a very large number of physicians, but in the
real world, they have only the resources to study 10 patients per physician and access to statistical methods to
correct for possible biases. These statistical methods enable evaluators to provide remedies for the possible
dependence among the patients of a single physician, among students at a single institution, among health
care workers at a single hospital, and so on.

Sample Size

Power Analysis and Alpha and Beta Errors

112
An evaluation’s ability to detect an effect is referred to as its power. A power analysis is a statistical method
for identifying a sample size that will be large enough to allow the evaluator to detect an effect if one actually
exists. A commonly used evaluation research design is one where the evaluator compares two randomly
assigned groups to find out whether differences exist between them. Accordingly, a typical evaluation
question is “Does Program A differ from Program B in its ability to improve quality of life?” To answer this
question accurately, the evaluator must be sure that enough people are in each program group so that if a
difference is actually present, it will be uncovered. Conversely, if there is no difference between the two
groups, the evaluator does not want to conclude falsely that there is one. To begin the process of making sure
that the sample size is adequate to detect any true differences, the evaluator’s first step is to reframe the
appropriate evaluation questions into null hypotheses. Null hypotheses state that no difference exists between
groups, as illustrated in Example 4.6.

Example 4.6 The Null Hypothesis


in a Program to Improve Quality of Life

Question: Does Experimental Program A improve quality of life?

Evidence: A statistically significant difference is found in quality of life between Experimental Program A’s
participants and Control Program B’s participants, and the difference is in Program A’s favor

Data source: The Quality of Life Assessment, a 30-minute self-administered questionnaire with 100
questions. Scores on the Assessment range from 1 to 100, with 100 meaning excellent quality of life

Null hypothesis: No difference in quality of life exists between participants in Program A and participants in
Program B. In other words, the average scores on the Quality of Life Assessment obtained by participants in
Program A and in Program B are equal

When an evaluator finds that differences exist among programs, but in reality there are no differences, that
is called an alpha error or Type I error. A Type I error is analogous to a false-positive test result; that is, a result
indicating that a disease is present when in actuality it is not. When an evaluator finds that no differences
exist among programs, but in reality differences exist, that is termed a beta error or Type II error. A Type II
error is analogous to a false-negative test result; that is, a result indicating that a disease is not present when in
actuality it is. The relationship between what the evaluator finds and the true situation can be depicted as
shown in Figure 4.3.

Figure 4.3 Type I and Type II Errors: Searching for a True Difference

113
To select sample sizes that will maximize the power of their evaluations, evaluators must rely on formulas
whose use requires an understanding of hypothesis testing and a basic knowledge of statistics. Evaluators
using these formulas usually must perform the following steps:

• State the null hypothesis.


• Set a level (alpha or ∝) of statistical significance—usually .05 or .01—and decide whether it is to be a
one- or two-tailed test. A two-tailed test will use a statistical method to determine if the mean (obtained
average score) is significantly greater than x (the predicted mean score) and if the mean is significantly
less than x (the predicted mean score). A one-tailed test is used to find out if the obtained mean is either
statistically greater or less than the predicted mean. When using a one-tailed test, you are testing for the
possibility of the relationship in one direction and completely disregarding the possibility of a
relationship in the other direction.
• Decide on the smallest acceptable meaningful difference (e.g., the difference in average scores between
groups must be at least 15 points).
• Set the power (1—β) of the evaluation, or the chance of detecting a difference (usually 80% or 90%).
• Estimate the standard deviation (assuming that the distribution of the data is normal) in the population.

Some researchers have proposed alternative sample size calculations based on confidence intervals. A
confidence interval is computed from sample data that have a given probability that the unknown parameter
(such as the mean) is contained within the interval. Common confidence intervals are 90%, 95%, and 99%.
Calculating sample size is a technical activity that requires some knowledge of statistics. Several easy-to-use
programs for calculating sample size are currently available for free online. To find one, enter the search term
“sample size” into any search engine.

The Sampling Report

Evaluators can use the sampling report (SR) form (Figure 4.4) for planning and explaining their evaluations.
The SR contains the evaluation questions and evidence, the independent variables and strata, the evaluation
design, inclusion and exclusion criteria, the dependent variable, the data source, the criteria for level of
acceptable statistical and clinical (or educational or practical) significance, the sampling methods, and the size
of the sample.
The form in Figure 4.4 shows the use of the SR for one evaluation question asked in an 18-month program
combining diet and exercise to improve health status and quality of life for persons 65 years of age or older.

114
An experimental group of elderly people who still live at home will receive the program and another group
will not. To be eligible, participants must be able to live independently. People who are under 65 years of age
and those who do not speak English or Spanish are not eligible. Participants will be randomly assigned to the
experimental or control groups according to the streets on which they live (that is, participants living on
Street A will be randomly assigned, as will participants living on Streets B, C, and so on). The evaluators will
be investigating whether the program effectively improves quality of life for men and women equally. A
random sample of men and women will be selected from all who are eligible, but no two will live on the same
street. Then men and women will be assigned at random to the experimental group or the control group.

Figure 4.4 The Sampling Report Form

115
Summary and Transition to the
Next Chapter on Collecting Information

This chapter discusses the advantages and limitations of probability and convenience sampling and how to
think about the sample unit and sample size. The next chapter discusses the evaluator’s data collection
choices.

Exercises

Directions

For each of the following situations, choose the best sampling method from among these choices:

A. Simple random sampling

B. Stratified sampling

C. Cluster sampling

D. Systematic sampling

A. Convenience sampling

Situation 1

The Rehabilitation Center has 40 separate family counseling groups, each with about 30 participants. The
director of the Center has noticed a decline in attendance rates and has decided to try out an experimental
program to improve them. The program is very expensive, and the Center can afford to finance only a 250-
person program at first. If the evaluator randomly selects individuals from among all group members, this will
create friction and disturb the integrity of some of the groups. As an alternative, the evaluator has suggested a
plan in which five of the groups—150 people—will be randomly selected to take part in the experimental
program and five groups will participate in the control.

Situation 2

The Medical Center has developed a new program to teach patients about cardiovascular fitness. An
evaluation is being conducted to determine how effective the program is with males and females of different
ages. The evaluation design is experimental, with concurrent controls. In this design, the new and traditional
cardiovascular programs are compared. About 310 people signed up for the winter seminar. Of these, 140 are
between 45 and 60 years old, and 62 of these 140 were men. The remaining 170 are between 61 and 75 years
old, and 80 of these are men. The evaluators randomly selected 40 persons from each of the four subgroups
and randomly assigned every other person to the new program and the remainder to the old program.

Situation 3

116
A total of 200 health education teachers signed up for a continuing education program. Only 50 teachers
from this group, however, are to participate in an evaluation of the program’s impact. The evaluator assigns
each participant a number from 001 to 200 and, using a table, selects 50 names by moving down columns of
three-digit random numbers and taking the first 50 numbers within the range 001 to 200.

References and Suggested Readings

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
Dawson, B., & Trapp, R. G. (2005). Basic and clinical biostatistics (3rd ed.). New York: Lange
Medical/McGraw-Hill.
Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, D., Hearst, N., & Newman, T. B. (2006).
Designing clinical research (2nd ed.). Philadelphia, PA: Lippincott Williams & Wilkins.

Suggested Websites

These websites provide definitions and explanations of sampling methods. For other sites, search for
statistical sampling; how to sample; sampling methods; sampling strategies.

http://www.cliffsnotes.com/study_guide/Populations-Samples-Parameters-and-Statistics.topicArticleId-
267532,articleId-267478.html

http://www.ma.utexas.edu/users/parker/sampling/srs.htm

http://www.census.gov/history/www/innovations/data_collection/developing_sampling_techniques.html

http://www.youtube.com/watch?v=HldwaasDP2A

117
Purpose of This Chapter

An evaluation’s purpose is to produce unbiased information on a program’s merits. This chapter


discusses evaluation information sources, such as self-administered questionnaires (on and offline),
achievement tests, record reviews, observations, interviews (in person and over landlines or mobile
phones), large databases, performance tests, vignettes, physical examinations, and literature. The
advantages and limitations of each of these data sources are discussed. Evaluators frequently go to
the published literature for information on the data sources that other investigators use. Therefore,
this chapter explains how to identify and search the literature for potentially useful data sources and
measures.

118
5
Collecting Information

The Right Data Sources

A Reader’s Guide to Chapter 5

Information Sources: What’s the Question?

Choosing Appropriate Data Sources

Data Sources or Measures in Program Evaluation and Their Advantages and Disadvantages

Self-administered surveys, achievement tests, record reviews, observations, interviews, computer-


assisted interviews, physical examinations, large databases, vignettes, and the literature

Guidelines for Reviewing the Literature

Summary and Transition to the Next Chapter on Evaluation Measures

Exercises

References and Suggested Readings

Information Sources: What’s the Question?

An evaluator is asked to study the effectiveness of the At-Home Program, a program whose main objectives
include providing high-quality health care and improving the health status and quality of life for elderly
persons over 75 years of age who are still living at home. At-Home, the experimental program, consists of a
comprehensive assessment of each elderly participant by an interdisciplinary team of health care providers and
a home visit every three months by a nurse practitioner to monitor progress and reevaluate health and
function. The control program consists of elderly people in the same communities who are not visited by the
nurse practitioner and continue with their usual sources of care. What information should the evaluator
collect to find out whether the At-Home Program’s objectives have been achieved? Where should the
information come from? The first place the evaluator should go for answers is to the evaluation questions and
evidence of effectiveness and their associated independent and dependent variables. Consider the illustrations
from the evaluation of the At-Home Program in Example 5.1.

Example 5.1 Evaluation Questions, Evidence,

119
Variables, and Data Sources

Evaluation Question 1

Question: Has quality of care improved for diabetic patients in the At-Home and control program?

Evidence: A statistically and clinically meaningful improvement in quality of care is observed in experimental
diabetic patients compared with control diabetic patients

Independent variable: Group participation (experimental diabetics and control diabetics)

Potential data sources: Physical examination, medical record review, surveys of health care practitioners and
patients

Dependent variable: Quality of care

Potential data sources: Medical record review and surveys of health care practitioners and patients

Evaluation Question 2

Question: Is quality of life improved for At-Home and control program participants? One aspect of good
quality of life is the availability of a network of family and friends

Evidence: A statistically significant difference between experimental and control programs in the availability
of a network of family and friends

Independent variable: Group participation (experimental and control)

Potential data source: Names or ID codes of participants in the experimental and control groups

Dependent variable: Availability of a network of family and friends

Potential data sources: Surveys of patients and their family members and friends, surveys of health care
practitioners, reviews of diaries kept by patients, observations

To answer the first question in Example 5.1, the evaluator has at least two implicit tasks: identifying
persons with diabetes and assigning them to the experimental and control groups. Patients with diabetes can
be identified through physical examination, medical record review, or surveys of health care practitioners and
patients. Quality of care for diabetes can be measured through medical record review or through a survey of
health care practitioners.
To identify persons in the experimental and control groups for the second question in Example 5.1, the
evaluator would examine the study’s database. To measure the adequacy of each patient’s network, the
evaluator could survey all patients and ask them to keep records or diaries of outings with their friends and
lists of their daily activities. The evaluator might also survey the patients’ friends and family members.
Given the range of choices for each evaluation question, on what basis, for instance, should the evaluator
choose to interview patients’ families rather than administer a questionnaire to the study’s participants? How
should the evaluator decide between medical record reviews and physical exams as data sources? Answering
these questions is at the heart of a program evaluation’s data collection.

120
Choosing Appropriate Data Sources

Evaluators have access to an arsenal of methods for collecting information. Among these are self-administered
questionnaires (on and offline performance and achievement tests), face-to-face and telephone interviews,
observations, analysis of existing databases and vital statistics (such as infant mortality rates), the literature,
and personal, medical, financial, and other statistical records. There are advantages and limitations to using
each of these. To choose appropriate data sources for your program evaluation, you need to ask the following
questions:

Guidelines: Questions to Ask in Choosing Data Sources

• What variables need to be measured? Are they defined and specific enough to measure?
• Can you borrow or adapt a currently available measure, or must you create a new one?
• If an available measure seems to be appropriate, has it been tried out in circumstances that are similar to
those in which your evaluation is being conducted?
• Do you have the technical skills, financial resources, and time to create a valid measure?
• If no measure is available or appropriate, can you develop one in the time allotted?
• Do you have the technical skills, financial resources, and time to collect information with the chosen
measure?
• Are participants likely to be able to fill out forms, answer questions, and provide information called for by
the measure?
• In an evaluation that involves direct services to patients or students and also uses information from
medical, school, or other confidential records, can you obtain permission to collect data in an ethical way?
• To what extent will users of the evaluation’s results (e.g., practitioners, students, patients, program
developers, and sponsors) have confidence in the sources of information on which they are based?

Example 5.2 shows what can happen when evaluators neglect to answer these questions.

Example 5.2 (Not) Collecting Evaluation Information:


A Case Study

The evaluators of an innovative third-year core surgery clerkship prepared a written examination to find out
whether students learned to test donor-recipient compatibility before transfusion of red blood cells. The
examination (to be given before and after the clerkship) included questions about the mechanisms involved in
and consequences of transfusing incompatible red cells, the causes of incompatible transfusions, what to do
about Rh-negative females who may bear children, how to identify unusual red cell antibodies, and what to
do when no compatible blood is available.

121
The evaluators also planned to prepare a measure of students’ understanding of ethical issues in blood
transfusion that would consist of 10 hypothetical scenarios with ethical components. They intended to
compare students’ responses to standards of ethics set by the University Blood Bank.

Finally, the evaluators anticipated distributing a self-administered survey to students before and after their
participation in the clerkship to find out if their attitudes toward transfusion medicine changed. The results of
the evaluators’ activities were to be presented at a special meeting of the School of Medicine’s Curriculum
Committee 1 year after the start of the innovative program.

The evaluators’ report turned out to be very brief. Although they were able to locate a number of achievement
tests with questions about donor-recipient compatibility, and thus did not have to prepare these measures
“from scratch,” they could not find an appropriate time to give all students a premeasure and postmeasure
survey. This meant that the evaluators had incomplete information on the performance of many students,
with only pretests for some and only posttests for others. In addition, the evaluators found that developing
the measure involving scenarios took about nine months because it was more difficult to create the scenarios
than they had anticipated. A sample of students found the original scenarios hard to understand and
ambiguous, so the evaluators had to rewrite and retest them. In the end, the scenario measure was not even
ready for use at reporting time.

Finally, many students refused to complete the attitude questionnaire. Anecdotal information suggested that
students felt they were overloaded with tests and questionnaires and that they did not believe this additional
one was important. Because of the poor quality of the data, the evaluators were unable to provide any
meaningful information about the third-year surgery clerkship’s effectiveness.

In the vignette presented above, the evaluators encountered difficulties for three major reasons:

1. They did not have enough time to collect data on students’ achievement before and after the measure.

2. They did not have enough time and possibly lacked the skills to prepare the scenarios for a planned
measure.

3. They chose a method of information collection that was not appropriate for the participants as
demonstrated by their unwillingness to complete it.

Data Sources or Measures in Program Evaluation


and Their Advantages and Disadvantages
Self-Administered Surveys

In self-administered surveys (or questionnaires), respondents answer questions or respond to items,


whether on paper or online. Typical self-administered questionnaire items might look like those shown in
Figure 5.1, whereas a typical item in an online self-administered questionnaire might look like the one in

122
Figure 5.2.

Figure 5.1 Examples of Items in a Self-Administered Survey Item

Figure 5.2 Example of an Item in an Online Self-Administered Survey

Source: Powered by Survey Monkey.

Self-administered surveys in hardcopy format (whether distributed through the mail, in a classroom or
clinic, or in some other way) differ from online surveys because of who does the scoring and data entry. The
responses that participants fill in on paper questionnaires must be scanned into a database or entered into it
by hand, whereas online survey responses are automatically entered into a database, and aspects of the
responses can be analyzed instantly (e.g., number of responses, number of correct responses, how many males,
how many females, etc.). From the evaluator’s perspective, online surveys are efficient because they save
paper, save the time needed for data entry (and the errors that may go with it), and do away with the need to
find storage space for completed forms. However, to maximize the efficiency of online surveys, evaluators
must be sure that they have the technical expertise necessary to prepare and advertise online questionnaires
and also ensure respondent privacy. Because of the strict rules regarding participant privacy, evaluators using
online surveys may need to acquire dedicated computers or tablets and set up firewalls and other protections.
These activities consume resources and require advance planning.

123
Advantages

• Many people are accustomed to completing questionnaires, regardless of where or how they are
administered.
• Many questions and rating scales are available for adaptation.
• Online questionnaires can reach large groups of people at a relatively low cost.
• Online surveys produce immediate findings.

Disadvantages

• Survey respondents may not always tell the truth.


• The self-administered survey format is not suitable for obtaining explanations of behavior.
• Some respondents may fail to answer some or even all questions, leaving the evaluator with incomplete
information.
• Some people ignore unsolicited online surveys because they think they receive too many requests.
Advance planning is essential to ensure that your survey is not automatically deleted or put into spam or
junk mail.

Achievement Tests

Educational accomplishment is frequently assessed by tests to measure individuals’ knowledge,


understanding, and application of theories, principles, and facts. To assess higher levels of learning, such as
the evaluation of evidence or the synthesis of information from varying sources, other methods, such as
observation of performance or analysis of essays and scientific studies, are more appropriate.
Most evaluations of educational programs rely to some extent on achievement tests. Many of these employ
multiple-choice questions, as in the example shown in Figure 5.3.

Record Reviews

Record reviews are analyses of an individual’s documented behavior. The documentation may be in print,
online, or in audio or video form.
Records come in two types. The first type consists of those that are already in place (e.g., a medical record).
The second type is one that the evaluator asks participants to create specifically for a given program. For
example, an evaluator of a nutrition education program may ask participants to keep a diary of how much
food they consume during the study’s six month data collection period.

Figure 5.3 Example of a Multiple-Choice Question on an Achievement Test

124
Evaluators also use data from existing records because doing so reduces the burden that occurs when people
are required to complete surveys or achievement tests. In other words, existing records are unobtrusive
measures. Why ask people to spend their time answering questions when the answers are already available in
their records? Birth date, sex, place of birth, and other demographic variables are often found in accessible
records and need not be asked directly of people.
Records are also relatively accurate sources of data on behavior. For example, evaluators interested in the
effects of a program on school attendance can ask students, teachers, or parents about attendance. But, school
records are probably more reliable because they do not depend on people’s recall, and they are updated
regularly. For the same reason, records are also good places to find out about actual practices. Which
treatments are usually prescribed? Check a sample of medical records. How many children were sent to which
foster families? Check the foster agency files.
Records are not always easily accessible, and so evaluators should be cautious in relying on them. If an
evaluator plans to use records from more than one site (e.g., two hospitals or four clinics), the evaluation team
may first have to learn how to access the data in each system and then create a mechanism for linking the
data. This learning process can take time, it may be costly, and it is often associated with completing paper
work to guarantee the appropriate uses of the data obtained from the records.
Diaries are a special type of record in which people are specifically asked to record (e.g., in writing or on
film) certain activities. Diaries are typically used in evaluations of programs involving diet, sleep, substance
abuse, and alcohol use. People are notoriously inaccurate when asked to recall how much they ate or drank,
but keeping a diary improves the quality of their information.
Example 5.4 describes the use of diaries in an evaluation of a combined dietary, behavioral, and physical
activity program to treat childhood obesity. The evaluators asked obese children to keep 2-day food records
three times during the course of the study.

Example 5.4 Food Diaries and Obese Children


(Nemet et al., 2005)

This evaluation examined the short- and long-term effects of a 3-month program among an experimental and
control group of obese children. All children were instructed on how to keep a 2-day food record and were
evaluated for understanding and accuracy through administration of a 24-hour recall before initiation of the
study. Children kept three 2-day food records (at baseline, at the end of the 3-month program, and 1 year
later). The food record data were reviewed by the project nutritionist and checked for omissions (for example,
whether dressing was used on a salad listed as ingested with no dressing) and errors (for example,

125
inappropriate portion size). All children completed the baseline, 3-month, and 1-year food records.

Advantages

• Obtaining data from existing records can be relatively unobtrusive because daily activities in schools,
prisons, clinics, and so on need not be disturbed.
• Records are often a relatively reliable storehouse of actual practice or behavior.
• If data are needed on many demographic characteristics (e.g., age, sex, insurance status), records are
often the best source.
• Data obtained from records, such as medical and school records reduce participants’ research burden.

Disadvantages

• Finding information in records is often time consuming. Even if records are electronic, the evaluator
may have to learn how to use multiple systems and go through an elaborate process to gain access to
them.
• Reviews can vary from recorder to recorder. To ensure consistency, each record reviewer needs to be
trained, and more than one reviewer is recommended. The costs of not having to collect new data may
be offset by other expenses.
• Certain types of information are rarely recorded (e.g., functional or mental status; time spent with clients
“after hours”).
• Records do not provide data on the appropriateness of a practice or on the relationship between what
was done (process) and results (outcomes and impact).

Observations

Observations are appropriate for describing the environment (e.g., the size of an examination room or the
number, types, and dates of magazines in the office waiting room) and for obtaining global portraits of the
dynamics of a situation (e.g., a typical problem-solving session among medical students or a “day in the life”
of a clinic). A commonly used observational technique is the time-and-motion study, where measures are
taken of the amount of time spent by patients and physicians as they go through the health care system.
Usually, time-and-motion studies are used to measure the efficiency of care. Figure 5.4 displays a portion of a
typical observation form.

Advantages

• Observation provides evaluators with an opportunity to collect firsthand information.


• Observation may provide evaluators with information that they did not anticipate collecting.

Disadvantages

• Evaluation personnel must receive extensive training and follow a very structured format in order to

126
collect dependable observations.
• The process of observation is both a labor-intensive and time-consuming method.
• Observers can influence the environment they are studying, and observed individuals may behave
differently than they might otherwise because they are being watched.

Figure 5.4 Portion of an Observation Form

Interviews

Interviews can be conducted in person or over the telephone. Figure 5.5 presents an excerpt from a typical
patient face-to-face interview questionnaire.

Advantages

• Interviews allow the evaluator to ask respondents about the meanings of their answers.
• Interviews can be useful for collecting information from people who may have difficulty reading or
seeing.

Figure 5.5 Portion of an Interview Form

127
Disadvantages

• The process of conducting interviews is both a time-consuming and labor-intensive method.


• Interviewers must receive extensive training before they begin interviewing, and their work must be
monitored on an ongoing basis.
• Interviewers may need special skills to interpret responses that are “off the record.”

Computer-Assisted Interviews

Computer-assisted telephone interviewing (CATI) is an efficient method for conducting interviews. With
CATI, the interviewer reads instructions and questions to the respondent directly from a computer monitor
and enters the responses directly into the computer. The computer, not the interviewer, controls the
progression of the interview questions. Because no paper copies of the interview are produced, CATI
eliminates the evaluator’s need to find secure storage space for completed questionnaires.
CATI software programs enable the researcher to enter all telephone numbers and call schedules into the
computer. When an interviewer logs on, he or she is prompted with a list of phone numbers to call, including
new scheduled interviews and callbacks. For example, suppose the interviewer calls someone at 8:00 a.m. but
receives no answer. The CATI program can automatically reschedule the call for some other time. CATI
programs are also available that enable specially trained interviewers to contact respondents with unique
needs. For instance, suppose your study sample consists of people who speak different languages. CATI
allows multilingual interviewers to log on with certain keywords; the computer then directs them to their
unique set of respondents.
The major advantage of CATI is that once the data are collected they are immediately available for
analysis. However, having easy access to data may not always be a blessing. Such availability may tempt some
evaluators to analyze the data before they have completed data collection, and the preliminary results may be

128
misleading. The main value of easy access to data, especially in the early stages of data collection, is that it
gives the evaluator the means to check on the characteristics of the respondents and to monitor the quality of
the CATI interviewers in obtaining complete data.
Intensive interviewer training is crucial when CATI is used in field studies. Interviewers must first learn
how to use the CATI software and handle computer problems should they arise during the course of an
interview. For instance, what should the interviewer do if the computer freezes? Further, interviewers need to
practice answering the questions that respondents invariably pose regarding the study’s objectives, methods,
human subject protections, and incentives. In fact, given the complexity of CATI, interviewer training may
take up to a week. Evaluators should probably consider using CATI primarily when their investigations are
well funded, because it is a relatively expensive and specialized form of data collection.
CATI takes two forms. In one, the interviews are conducted from a lab, a facility furnished with banks of
telephone calling stations equipped with computers linked to a central server. The costs of building such a lab
are extremely high because it must have soundproof cubicles and either a master computer that stores the data
from the individual computers or linkage to a server. Additional resources are needed to cover the cost of
leasing CATI software and hiring a programmer to install it. Training for this type of CATI is expensive,
because the interviewers require a great deal of practice. Numerous incidental costs are also associated with
establishing a CATI lab, including those for headsets, seats and desks, instructional manuals, and service
contracts for the hardware.
The second type of CATI system uses software programs that are run on laptops or tablets. With this type
of CATI, the evaluator needs only a laptop and wireless connection. This type of CATI is appropriate for
studies with a variety of funding levels because it is portable and relatively inexpensive. The portability of
laptops and tablets, however, raises concerns about privacy. Laptops and tablets are sometimes shared or
stolen, either of which can endanger the confidentiality of participant data. In anticipation of these concerns,
evaluators who use laptops or tablets for CATI should dedicate them to a single study; enforce strict privacy
safeguards, and give interviewers special training to ensure proper CATI implementation and privacy
protection.

Physical Examinations

Physical examinations are invaluable sources of data for heath program evaluators because they produce
primary data on health status. However, physical examinations for evaluation purposes may intrude on the
time and privacy of the physician and the patient, and because they are labor intensive, they are an expensive
data source.

Large Databases

Evaluators often use large databases (“big data”) to help program planners explore the need for particular
programs and to set evaluation standards by studying previous performance. Do the data show that students
are underperforming in math? Is every eligible patient receiving flu shots? Large databases can also be used
for observational evaluations. For example, an evaluator can use a school’s statistics database to compare
students who participated last year in Programs A and B. Did Program A improve attendance as anticipated?

129
Observational evaluations may require the creation of separate data sets. Suppose a database contains
information on students in Programs A, B, and C. The evaluator who wants to compare students only in
Programs A and B will have to create a separate dataset that contains information on just the needed students
and programs.
Governments and researchers compile databases to keep track of individual and community health as well
as to describe and monitor large health care systems. Among the most familiar datasets of this kind in the
United States are those compiled by the Centers for Disease Control, Centers for Medicare and Medicaid
Services, and the National Center for Education Statistics.
The analysis of data from existing databases is called secondary data analysis. Tutorials for using specific
U.S. databases are available online at the appropriate websites. The National Center for Health Statistics
(http://www.cdc.gov/nchs/), for instance, offers a tutorial for accessing and using the National Health and
Nutrition Examination Survey (NHANES). This database contains information on the health and
nutritional status of adults and children.
Evaluators use secondary data because it is comparatively economical. Although professional skill is needed
in creating datasets and analyzing data, the costs in time and money are probably less than the resources
needed for primary data collection. Primary data are collected by evaluators to meet the specific needs of their
project.

Advantages

• Sometimes primary data collection simply is not necessary because the available data is available to solve
the problem or answer the question.
• Using existing data can be less expensive and time consuming than collecting new data.
• Secondary sources of information can yield more accurate data for some variables than can primary
sources. For instance, data collected by governments or international agencies from surveys of health
behaviors and health problems are accurate. The data collection methods and processes are often
perfected over time with large numbers of people.
• Secondary data are especially useful in the exploratory phase of large studies or program planning efforts.
They can be used to determine the prevalence of a problem, and to study if certain members of a given
population are more susceptible to the problem than others.

Disadvantages

• The definitions of key variables that the original data collectors use may be different from your
requirements. Definitions of terms like quality of life or self-efficacy may vary considerably from time to
time and country to country.
• The information in the database may not be presented exactly as you need it: The original researchers
collected information on people’s health behaviors in the past year, for example, but you need data on
those behaviors for the last month. The original researchers asked for categorical responses (often,
sometimes, rarely), but you want continuous data (number of drinks per day; blood pressure readings).
• The data in the database may have come from a community that is somewhat different from your

130
community in age, socioeconomic status, health behaviors, health policies, and so on.
• The reliability of published statistics can vary over time. Systems for collecting data change.
Geographical or administrative boundaries are changed by government, or the basis for stratifying a
sample is altered.
• The data that are available may be out of date by the time you gain access to it. Large databases are
typically updated periodically, say every five to ten years. This time lag may acquire significance if new
policies are put in place between the time of the original data collection and the evaluator’s access.

Vignettes

A vignette is a short scenario that is used in collecting data in “what if” situations. Example 5.6 describes
how a group of researchers explored the impact of a doctor’s ethnicity, age, and gender on patients’
judgments. Study participants are given one of eight photos of a “doctor” who varied in terms of ethnic group
(Asian versus White), age (older versus younger), and gender (male versus female).

Example 5.6 Vignettes: Influence of Physicians’ Ethnicity,


Age, and Sex

The evaluators used a factorial design involving photographs of a doctor who varied in terms of ethnicity
(White versus Asian), age (old versus young), and sex (male versus female). This required eight separate
photographs. The age groups were defined broadly with “young” composed of doctors aged between 25 and
35 and “old” comprising doctors aged between 50 and 65. Patients were asked to rate the photograph in terms
of expected behavior of the doctor, expected behavior of the patient, and patient ease with the doctor. Eight
individuals were identified that fit the required ethnicity, age, and gender criteria. Background features,
lighting, hairstyle, expression, make up, and so forth were kept as consistent as possible. Photographs were
taken using a digital camera.

Participants were presented with one of the eight photographs followed by this statement: “Imagine that you
have been feeling tired and run down for a while; you see this doctor for the FIRST time.” They were then
asked to rate the picture (doctor), using scales ranging from “not at all” [1] to “extremely” [5] in terms of
three broad areas: expected behavior of the doctor, expected behavior of the patient, and expected ease of the
patient.

Advantages

• Vignettes can be fun for participants.


• Vignettes can be efficient. They enable the researcher to vary important factors (e.g., age and gender)
one factor at a time. Not every research participant has to review a scenario with every factor as long as
all participants review some factors and all factors are reviewed.

Disadvantages

131
• Producing vignettes requires technical and artistic (writing) skills if the results are to be convincing.
• Sampling can get complicated when varying factors and participants.
• Vignettes are hypothetical. The scenarios they describe may never occur, and, even if they do, the person
responding to them may not act as indicated in the hypothetical setting. Vignettes are self-reports, not
actual behavior.

The Literature

Evaluators turn to the literature for reasons that range from gathering ideas for research designs and data
collection and analysis methods to comparing data and conclusions across research. The broad term the
literature refers to all published and unpublished reports of studies or statistical findings. Published reports are
often easier to locate than those that remain unpublished, because they have appeared in books and journals
that are accessible in libraries and online. In addition, since published reports have the advantage of public
scrutiny, their authors’ methods (and their conclusions) are likely to be more dependable. Reports on
evaluation studies may be published in peer-reviewed journals, in books, or as stand-alone reports or
monographs produced by local, state, or national agencies.
Program evaluators generally use the literature for the following reasons.

Reasons for Evaluators’ Use of the Literature

1. To identify and justify evidence of effectiveness: The literature can provide information on the past
performance of programs and populations. Evaluators may use this information in planning an evaluation
and as a yardstick against which to compare the findings of one evaluation that has already been
completed.

2. To define variables: The literature is a primary source of information about the ways others have defined
and measured commonly used key variables, such as child abuse and neglect; high-risk behaviors;
comorbid conditions; social, physical, and emotional functioning; quality of care; and quality of life.

3. To determine sample size: Power calculations, which evaluators use to arrive at sample sizes that are large
enough to reveal true differences in programs (if they exist), require estimation of the variance—a measure
of dispersion—in the sample or population. Sometimes, however, evaluators have no readily available data
on the variance in the sample of interest. They can conduct a pilot study to obtain the data, but
appropriate data may be available in the literature, enabling the evaluators to build on and expand the
work of others.

4. To obtain examples of designs, measures, and ways of analyzing and presenting data: Evaluators can use the
literature as a source of information on research methods and data collection, analysis, and reporting
techniques.

5. To determine the significance of the evaluation and its findings: Evaluators often use the literature to justify
the need for particular programs and for their evaluation questions. They also use the literature to show

132
whether their evaluation findings confirm or contradict the results of other studies and to identify areas in
which little or no knowledge is currently available.

6. To conduct meta-analyses: Meta-analysis is a technique in which the results of two or more randomized
controlled trials are pooled and reanalyzed. The idea is that by combining the results of relatively small,
local studies, one can increase the power of the findings. The use of meta-analysis is predicated on data
from experimental studies. Given that not all experiments are of equal quality, understanding how to
review and interpret the literature is an important first step in conducting a meta-analysis.

Guidelines for Reviewing the Literature

As an evaluator, you should take the following six steps in conducting a literature review:

1. Assemble the literature.

2. Identify inclusion and exclusion criteria.

3. Select the relevant literature.

4. Identify the “best” literature.

5. Abstract the information.

6. Consider the Non-Peer Reviewed Literature.

Assemble the Literature

Hundreds of online bibliographic databases are available to program evaluators. A commonly used health-
related database is the National Library of Medicine’s database, PubMed. Other databases include ERIC
(Educational Resources Information Center) and the Web of Science. Some databases are proprietary—that
is, many universities and large public agencies require that you pay for a subscription. One such database is
the American Psychological Association’s PsycINFO, which includes citations from the literature of
psychology and the social sciences.
The key to an efficient literature search is specificity. If you search for all articles published by Jones from
1980 through 2013 in English about evaluations of the quality of education in the United States, you are
much more likely to get what you want than if you search for all published evaluations of the quality of
education in the United States. If Jones has published articles about evaluations of U.S. education that have
appeared in the public policy or social science literature, then certain articles may not turn up if you rely solely
on a search of an education–related database. To conduct a more comprehensive search, you should
investigate all potentially relevant databases, scrutinize the references in key articles, and ask experts in the
field to recommend references and bibliographic databases.
To search any database efficiently, you must carefully specify the variables and populations that pertain to

133
your interest. A first step is to decide on specific criteria for including a study in the literature review or
excluding it from the literature review. Once you have established these criteria, you can employ terms that
describe these criteria to guide the search. These are called search terms or keywords.

Identify Inclusion and Exclusion Criteria

The inclusion and exclusion criteria define whether a study is appropriate or inappropriate for review.
These criteria usually include attention to the variables and populations of concern, where and when the study
was published, and its methodological quality, as can be seen in Example 5.7, which illustrates the use of
inclusion and exclusion criteria for a review of the effectiveness of prenatal care programs.

Example 5.7 Inclusion and Exclusion Criteria for a Review of Evaluated Prenatal Care Programs

The evaluators decide to include in their review of the literature on prenatal care programs any evaluations of
programs aiming to integrate medical and social services to improve the health outcomes of mothers and
newborns. They select only published evaluations because the editorial review process screens out the poorest
studies. They choose 2000 as a starting point so that their data will reflect fifteen years of the increased
accessibility to prenatal care begun twenty years ago. The evaluators exclude research that was primarily
medical (e.g., aggressive treatment of preterm births) or psychosocial (e.g., improving mothers’ self-esteem)
and where the focus was on the organization of the medical care system (e.g., centralizing a region’s prenatal
care).

With these inclusion and exclusion criteria as guides, the evaluators in Example 5.7 can focus on a search
for studies published from 2000 forward and in all languages. Their search terms will probably include
“program evaluation,” “prenatal care,” and “health outcomes.”
Example 5.8 shows some of the search terms used in reviews of the literature in two studies. The first
search compared the literature (randomized controlled trials only) to experts’ recommendations for treating
myocardial infarction (heart attack), and the second reviewed the efficacy of treatments for posttraumatic
stress disorder in victims of rape, combat veterans, torture victims, and the tragically bereaved.

Example 5.8 Search Terms Used in Two Studies to Search


Electronic Bibliographic Databases

1. To search the literature on treatments for myocardial infarction: “myocardial infarction,” “clinical trials,”
“multicenter studies,” “double-blind method,” “meta-analysis,” “random”

2. To search the literature on posttraumatic stress disorder: “traumatic stress,” “treatment,” “psychotherapy,”
“flooding,” “PTSD,” “behavior therapy,” “pharmacotherapy,” “drugs,” “cognitive therapy”

134
Select the Relevant Literature

After you have conducted your searches and assembled the articles you found, you will usually need to
screen the articles for irrelevant material. Because few searches (or searchers) are perfect, they invariably turn
up studies that do not address the topic of interest or that are methodologically unsound.
Screening often consists of applying methodological criteria. For example, you may screen out reports of
studies that did not use control groups, those that include data from unreliable sources, those that present
insufficient data on some important variable, or those where the findings are preliminary.

Identify the Best Available Literature

Regardless of the scope of your literature review, you must employ a method that distinguishes among
articles with differing levels of quality. Selecting the best literature means identifying the studies with the least
bias.
At least two individuals are needed to make an adequate appraisal of a study’s quality. These individuals
should be given a definition of the parameters of quality and should be trained to apply that definition. Before
they begin the formal review, they should test out their understanding of the system by collaborating on the
appraisal of the quality of one to ten articles. A third knowledgeable person can act as adjudicator in cases of
disagreement.
As an example, Figure 5.6 presents the set of methodological features and definitions that the evaluators
from Example 5.8, shown above, used to appraise the quality of the published literature on prenatal care
program evaluations.
The evaluator must decide how a study is to be categorized as “best available”—for example, does it have to
meet all the criteria or some fraction (say, a score of more than 5 of 8), or does it have to achieve certain
minimum criteria (such as random assignment and valid data collection)? Needless to say, these choices are
somewhat arbitrary, and the evaluator must be able to defend their merits on a case-by-case basis.

Figure 5.6 Rating Methodological Features of the Literature: Identifying the Best Published Prenatal
Care Program Evaluations

135
An important component of the process of deciding on a study’s quality is agreement among reviewers. To
identify the extent of agreement, one can examine each item in the review independently or compare
reviewers’ scores on the entire set of items.
When two or more persons measure the same item and their measurements are compared, an index of
interrater reliability is obtained. One statistic that is often used in deciding on the degree of agreement among
two reviewers on a dichotomous variable (valid data were collected: yes or no) is kappa (κ). The statistic used
to examine the relationship between two numerical characteristics (Reviewer A’s score of 5 points versus
Reviewer B’s score of 7 points) is correlation.
Not all literature reviews are done by two or more persons. When only one person conducts a review,
intrarater reliability can be calculated using kappa or correlation. In this procedure, a single reviewer rates or
scores a selection of articles at least twice and the results are compared.

Abstract the Information

The most efficient way to get data from the literature is to standardize the review process. A uniform
abstraction system guards against the possibility that some important information will be missed, ignored, or
misinterpreted. Abstraction forms often look like survey questionnaires (Figure 5.7).

Consider the Non-Peer-Reviewed Literature

Various analyses of the published literature have suggested the existence of a bias in favor of positive
results. This means that if a review is based solely on published, peer-reviewed evaluations, negative findings
may be underrepresented. It is not easy, however, to locate and gain access to unpublished evaluation reports.
Published studies have been reviewed by experts as well as the evaluator’s peers and colleagues, and probably
the most unreliable studies have been screened out. Nevertheless, when you are interpreting the findings from
the published literature, you should consider the potential biases that may exist because unpublished reports
are excluded. Unpublished reports include monographs, dissertations, and blogs, and professional research
sites. Statistical methods are available to account for publication bias. They are controversial, however, partly
because they may create false positives (finding a benefit when none exists) under some circumstances.

Figure 5.7 Excerpts From a Form to Guide Reviewers in Abstracting Literature on Alcohol Misuse in
Older Adults

136
137
Summary and Transition to the
Next Chapter on Evaluation Measures

This chapter explains the factors that the evaluator should take into account when deciding on measures or
sources of data. First, the evaluator must carefully review the evaluation question so that all variables are
known and clarified. Then, the evaluator considers the array of possible measures that are likely to provide
information on each variable. Among the possibilities are self-administered questionnaires, achievement and
performance tests, record reviews, observations and interviews, physical examinations, vignettes, and existing
databases. Each data source or measure has advantages and disadvantages. For example, interviews enable the
evaluator to question program participants in a relatively in-depth manner, but conducting them can be time
consuming and costly. Self-administered questionnaires may be less time consuming and costly than
interviewers, but they lack the intimacy of interviews, and the evaluator must always worry about response
rate. In selecting a source of data for your evaluation, you must look first at the evaluation question and decide
whether you have the technical and financial resources and the time to develop your own measure; if not, you

138
must consider adopting or adapting an already existing measure.
The next chapter addresses the concepts of reliability (the consistency of the information from each source)
and validity (the accuracy of the information). The chapter also discusses the steps you need to take to
develop and validate your own measure and outlines the activities involved in selecting or adapting a measure
that has been developed by other evaluators. Among the additional topics addressed in the next chapter are
“coding” (that is, making certain that your measure is properly formatted or described from the point of view
of the person who will enter the data into the computer) and ensuring that the components of the evaluation
are logically linked. A measurement chart is presented that is helpful for portraying the logical connections
among the evaluation’s variables and measures.

Exercises

Exercise 1

Directions

Match each of the following descriptions with the appropriate measure.

Exercise 2

Directions

Locate the following evaluations and name the types of measures used in each.

a. King, C. A., Kramer, A., Preuss, L., Kerr, D. C. R., Weisse, L., & Venkataraman, S. (2006). Youth-
nominated support team for suicidal adolescents (version 1): A randomized controlled trial. Journal of
Consulting and Clinical Psychology, 74, 199–206.

b. Garrow, D., & Egede, L. E. (2006). Association between complementary and alternative medicine use,
preventive care practices, and use of conventional medical services among adults with diabetes. Diabetes
Care, 29(1), 15–19.

139
c. Lieberman, P. M. A., Hochstadt, J., Larson, M., & Mather, S. (2005). Mount Everest: A space
analogue for speech monitoring of cognitive deficits and stress. Aviation Space and Environmental
Medicine, 76, 1093–1101.

Exercise 3

Directions

You are the evaluator of a prenatal care program. One of the evaluation questions asks about characteristics
of the babies born to participating mothers. The project routinely collects data on mothers and babies, but
you need data on birth weight, gestational age, whether the baby was stillborn, whether a drug toxicology
screen was performed at birth on the mother and the baby (and, if so, the results), the number of prenatal care
visits, and the birth date. Create a form that will aid you in obtaining this information.

References and Suggested Readings

Bailey, E. J., Kruske, S. G., Morris, P. S., Cates, C. J., & Chang, A. B. (2008, April). Culture-specific
programs for children and adults from minority groups who have asthma. Cochrane Database of Systematic
Reviews, 2, 1–22.
Dane, F. C. (2011). Evaluating research: Methodology for people who need to read research. Thousand Oaks, CA:
Sage.
Fink, A. (2013). Conducting research literature reviews: From the Internet to paper. Thousand Oaks, CA: Sage.
Fink, A., Parhami, I., Rosenthal, R. J., Campos, M. D., Siani, A., & Fong, T. W. (2012). How transparent
is behavioral intervention research on pathological gambling and other gambling-related disorders? A
systematic literature review. Addiction, 107(11), 1915–1928.
Hoffler, T. N., & Leutner, D. (2007). Instructional animation versus static pictures: A meta-analysis.
Learning and Instruction, 17, 722–738.
Hofmann, S. G., & Smits, J. A. (2008). Cognitive-behavioral therapy for adult anxiety disorders: A meta-
analysis of randomized placebo-controlled trials. Journal of Clinical Psychiatry, 69, 621–632.
Lemstra, M., Neudorf, C., D’Arcy, C., Kunst, A., Warren, L. M., & Bennett, N. R. (2008). A systematic
review of depressed mood and anxiety by SES in youth aged 10–15 years. Canadian Journal of Public
Health, 99(2), 125–129.
Nemet, D., Barkan, S., Epstein, Y., Friedland, O., Kowen, G., & Eliakim, A. (2005). Short- and long-term
beneficial effects of a combined dietary-behavioral-physical activity intervention for the treatment of
childhood obesity. Pediatrics, 115(4), 443–449.
Nigg, J. T., Lewis, K., Edinger, T., & Falk, M. (2012). Meta-analysis of attention-deficit/hyperactivity
disorder or attention-deficit/hyperactivity disorder symptoms, restriction diet, and synthetic food color
additives. Journal of the American Academy of Child and Adolescent Psychiatry, 51(1), 86–97. doi:
10.1016/j.jaac.2011.10.015

140
Reynolds, S., Wilson, C., Austin, J., & Hooper, L. (2012). Effects of psychotherapy for anxiety in children
and adolescents: A meta-analytic review. Clinical Psychology Review, 32(4), 251–262. doi:
10.1016/j.cpr.2012.01.005
Shor, E., Roelfs, D. J., Bugyi, P., & Schwartz, J. E. (2012). Meta-analysis of marital dissolution and
mortality: Reevaluating the intersection of gender and age. Social Science & Medicine, 75(1), 46–59. doi:
10.1016/j.socscimed.2012.03.010
Siegenthaler, E., Munder, T., & Egger, M. (2012). Effect of preventive interventions in mentally ill parents
on the mental health of the offspring: Systematic review and meta-analysis. Journal of the American
Academy of Child and Adolescent Psychiatry, 51(1), 8–17. doi: 10.1016/j.jaac.2011.10.018
Wood, S., & Mayo-Wilson, E. (2012). School-based mentoring for adolescents: A systematic review and
meta-analysis. Research on Social Work Practice, 22(3), 257–269. doi: 10.1177/10497315 11430836

141
Purpose of This Chapter

An evaluation’s data sources or measures include surveys, achievement tests, observations, record
reviews, and interviews. Reliable and valid measures are the foundation of unbiased evaluation
information. A reliable measure is consistent in its findings, and a valid one is accurate. This
chapter discusses measurement validity in detail. It also explains how to develop reliable and valid
new measures, select appropriate measures among those that are currently available, and create a
measurement chart to establish logical connections among the evaluation’s questions and
hypotheses, design, and measures.

142
6
Evaluation Measures

A Reader’s Guide to Chapter 6

Reliability and Validity


Reliability and validity

A Note on Language: Data Collection Terms

Checklist for Creating a New Evaluation Measure


Boundaries, subject matter, content, item choices, rating scales, expert review, revision, format, and
testing

Checklist for Selecting an Already Existing Measure


Costs, content, reliability, validity, and format

The Measurement Chart: Logical Connections

Summary and Transition to the Next Chapter on Managing Evaluation Data

Exercises

References and Suggested Readings

Reliability and Validity

A measure is the specific data source or instrument that evaluators use to collect data. For instance, the 25-
item Open City Online Survey of Educational Achievement is a measure. The data source or data collection
method is an online survey. The measure is the Open City Online Survey of Educational Achievement. The
term metric is sometimes used to describe how a concept is measured. For instance, to measure the school
library’s usefulness, an evaluator may ask patrons to use a rating scale with choices ranging from 1 (not useful)
to 5 (extremely useful), or the evaluator may ask librarians to count the number of people who use the library
over a two-year period. In the first example, the metric is a rating scale, and in the second, the metric is a
count. Data collection measures take several formats (self-administered questionnaires or face-to-face
interviews) and rely on a variety of metrics (rating scales, ranks).
Evaluators sometimes create their own measures, and sometimes they adapt parts or all of already existing
measures. Because the conclusions of an evaluation are based on data from the measures used, the quality of
the measures must be demonstrably high for the evaluation’s results to be unbiased. (Otherwise, we have the

143
well-known phenomenon of “garbage in–garbage out.”) To determine the quality of their data collection
measures, evaluators must understand the concepts of reliability and validity.

Reliability

A reliable measure is one that is relatively free of “measurement error,” which causes individuals’ obtained
scores to be different from their true scores (which can be obtained only through perfect measures). What
causes measurement error? In some cases, it results from the measure itself, as when the measure is difficult to
understand or poorly administered. For example, a self-administered questionnaire regarding the value of
preventive health care can produce unreliable results if it requires a level of reading ability that is beyond that
of the targeted audience—in this case, the teen mothers are the intended target. If the respondents’ reading
level is not a problem, but the directions are unclear, the measure will also be unreliable. Of course, the
evaluator can simplify the measure’s language and clarify the directions and still find measurement error
because such error can also come directly from the people being questioned. For example, if the teen mothers
who are asked to complete the questionnaire are especially anxious or fatigued at the time, their obtained
scores could differ from their true scores.
In program evaluation, four kinds of reliability are often discussed: test-retest, equivalence, homogeneity,
and interrater and intrarater reliability. A measure has test-retest reliability if the correlation between scores on
the measure from one time to another is high. Suppose a survey of patient satisfaction is administered in
April and again in October to the same group of patients at Hospital A. If the survey is reliable, and no
special program or intervention has been introduced between April and October, on average, we would expect
satisfaction to remain the same. The major conceptual difficulty in establishing test-retest reliability is in
determining how much time is permissible between the first and second administrations. If too much time
elapses, external events might influence responses on the second administration; if too little time passes, the
respondents may remember and simply repeat their answers from the first administration.
Equivalence, or alternate-form reliability, refers to the extent to which two assessments measure the same
concepts at the same level of difficulty. Suppose that students were given an achievement test before
participating in a course and then again 2 months after completing it. Unless the evaluator is certain that the
two tests are of equal difficulty, better performance on the second test could be because the second test was
easier than the first rather than improved learning. And, as with test-retest reliability, because this approach
requires two administrations, the evaluator must worry about the appropriate interval between them.
As an alternative to establishing equivalence between two forms of the same measure, evaluators sometimes
compute a split-half reliability. This requires dividing the measure into two equal halves (or alternate forms)
and obtaining the correlation between the two halves. Problems arise if the two halves vary in outcome.
Homogeneity refers to the extent to which all the items or questions in a measure assess the same skill,
characteristic, or quality. Sometimes this type of reliability is referred to as internal consistency. The extent of
homogeneity is often determined through the calculation of Cronbach’s coefficient alpha, which is basically
the average of all the correlations between each item and the total score. For example, suppose that an
evaluator has created a questionnaire to find out about patients’ satisfaction with Hospital A. An analysis of
homogeneity will tell the extent to which all items on the questionnaire focus on satisfaction (rather than

144
some other variable like knowledge).
Some variables do not have a single dimension. Patient satisfaction, for example, may consist of satisfaction
with many elements of the hospital experience: nurses, doctors, financial arrangements, quality of care, quality
of surroundings, and so on. If you are unsure of the number of dimensions included in your instrument, you
can perform a factor analysis. This statistical procedure identifies relationships among the items or questions
in a measure.
Interrater reliability refers to the extent to which two or more individuals agree on a given measurement.
Suppose that two individuals are sent to a prenatal care clinic to observe patient waiting times, the appearance
of the waiting and examination rooms, and the general atmosphere. If these observers agree perfectly in their
ratings of all these items, then interrater reliability is perfect. Evaluators can enhance interrater reliability by
training data collectors thoroughly, providing them with guidelines for recording their observations,
monitoring the quality of the data collection over time to ensure that data collectors are not “burning out,”
and offering data collectors opportunities to discuss any difficult issues or problems they encounter in their
work.
Intrarater reliability refers to a single individual’s consistency of measurement. Evaluators can also enhance
this form of reliability through training, monitoring, and continuous education.

Validity

Validity refers to the degree to which a measure assesses what it purports to measure. For example, a test
that asks students to recall information would be considered an invalid measure of their ability to apply
information. Similarly, a survey of patient satisfaction cannot be considered valid unless the evaluator can
prove that the people who are identified as satisfied on the basis of their responses to the survey think or
behave differently than people who are identified as dissatisfied.
Content validity refers to the extent to which a measure thoroughly and appropriately assesses the skills or
characteristics it is intended to measure. For example, an evaluator who is interested in developing a measure
of quality of life for cancer patients has to define quality of life and then make certain that items in the
measure adequately include all aspects of the definition. Because of the complexity of this task, evaluators
often consult the literature for theories, models, or conceptual frameworks from which to derive the
definitions they need. A conceptual model of “quality of life,” for instance, consists of the variables included
when the concept is discussed in differing kinds of patients, such as those with cancer, those who are
depressed, and those who are very old or very young. It is not uncommon for evaluators to make a statement
like the following in establishing content validity: “We used XYZ cognitive theory to select items on
knowledge, and we adapted the ABC Role Model Paradigm for questions about social relations.”
Face validity refers to how a measure appears on the surface: Does it seem to ask all the needed questions?
Is the language used in the questions both appropriate and geared to the respondents’ reading level? Face
validity, unlike content validity, does not rely on established theory for support.
Criterion validity is made up of two subcategories: predictive validity and concurrent validity. Predictive
validity is the extent to which a measure forecasts future performance. A medical school entry examination
that successfully predicts who will do well in medical school has predictive validity. Concurrent validity is

145
demonstrated when two assessments agree or a new measure compares favorably with one that is already
considered valid. For example, to establish the concurrent validity of a new aptitude test, the evaluator can
administer the new measure as well as a validated measure to the same group of examinees and compare the
scores. Alternatively, the evaluator can administer the new test to the examinees and compare the scores with
experts’ judgment of students’ aptitude. A high correlation between the new test and the criterion measure
indicates concurrent validity. Establishing concurrent validity is useful when the evaluator has created a new
measure that he or she believes is better (e.g., shorter, cheaper, fairer) than any previously validated measure.
Construct validity, which is established experimentally, demonstrates that a measure distinguishes between
people who have certain characteristics and those who do not. For example, an evaluator who claims construct
validity for a measure of compassionate nursing care has to prove in a scientific manner that nurses who do
well on the measure are more compassionate nurses than nurses who do poorly. Construct validity is
commonly established in at least two ways:

1. The evaluator hypothesizes that the new measure correlates with one or more measures of a similar
characteristic (convergent validity) and does not correlate with measures of dissimilar characteristics
(discriminant validity). For example, an evaluator who is validating a new quality-of-life measure might
hypothesize that it is highly correlated with another quality-of-life instrument, a measure of
functioning, and a measure of health status. At the same time, the evaluator might hypothesize that the
new measure does not correlate with selected measures of social desirability (e.g., the tendency by
individuals to answer questions in a manner that portrays them in a positive light) or selected measures
of hostility.

2. The evaluator hypothesizes that the measure can distinguish one group from the other on some
important variable. For example, a measure of compassion should be able to demonstrate that people
who are high scorers are compassionate and that people who are low scorers are unfeeling. This requires
that the evaluator translate a theory of compassionate behavior into measurable terms, identify people
who are compassionate and people who are unfeeling (according to the theory), and prove that the
measure consistently and correctly distinguishes between the two groups.

A Note on Language: Data Collection Terms

The language used to discuss reliability and validity (terms, such as examinees, scores, scales, tests, and measures)
comes from test theory, or psychometrics. Program evaluators often use the terms data source, measure, scale, test,
and instrument interchangeably. As you can imagine, this is sometimes confusing, especially because
evaluators also talk about outcome measures and outcome indicators when referring to evaluation study outcomes.
The following brief lexicon is helpful for sorting out data collection terms.

A Guide to Data Collection Terms

• Data source: Any source of information for the evaluation. This may include data from questionnaires or

146
tests, literature reviews, existing databases, and vital statistics (such as the number of live births in a given
year).
• Index: A way to rank the order of things. Scores on an index of function give an indication of where people
stand in relation to one another. This term is sometimes used interchangeably with scale.
• Instrument: A device or strategy used to collect data; instruments include laboratory tests, self-administered
questionnaires, and interviews. This term is often used interchangeably with measure.
• Measure: This term is often used interchangeably with instrument, test, and assessment. Measures are often
the specific devices used in an evaluation, such as the 25-item Open City Survey of Student Achievement.
• Metric: A way of measuring concepts. When asking for a rating of a commodity on a scale of 1 (excellent)
to 5 (poor), the evaluator is using a metric in which respondents put an order to their perceptions of the
commodity. If the evaluator uses scores on an achievement test to evaluate an education program, those
scores are then the metric.
• Outcome: The consequences of participating in a program. Outcomes may be changes in areas, such as
health status and emotional well-being.
• Outcome measure or outcome indicator: Often used as a synonym for outcome.
• Rating scale: A graded set of choices. Scales may be nominal (or categorical) (e.g., race, gender), ordered or
ordinal (with response categories, such as often, sometimes, and never; or Stages I, II, and III of a disease),
or numerical, including continuous (age, height) and discrete (number of pregnancies, number of arrests
for driving drunk). The most commonly used type of rating scale is the Likert scale, with response
categories, such as strongly agree, agree, neither agree nor disagree, disagree, and strongly disagree.
• Scale: A combination of items or questions that measure the same concept, such as a 10-item scale that
measures emotional well-being or a 36-item scale that measures health status.
• Test: Achievement test, laboratory test.

Checklist for Creating a New Measure

Knowing the types of measures that are available and how to demonstrate their reliability and validity enables
the evaluator to get down to the serious business of developing a measure that is tailored to the needs of the
investigation, or selecting and adapting one that is already in use. Before you attempt to create a new measure
for your evaluation study, you must make certain that you have identified the domain of content (through
observation or with the help of experts, research, and theory) and have the expertise, time, and money to
complete the task. The following is a checklist of the basic steps you need to take in creating a new measure.

1. Set boundaries.

Decide on the type of measure (e.g., questionnaire, observation).


Determine the amount of needed and available time for administration and scoring (e.g., a 15-minute
interview and 10 minutes for summarizing responses).

147
Select the kinds of reliability and validity information to be collected (e.g., to establish alternate-form
reliability, you must develop two forms; to establish concurrent validity, you need an already existing
instrument).

2. Define the subject matter or topics that will be covered. For definitions, consult the literature, experts,
or health care consumers. Example 6.1 illustrates how the definitions for an evaluation of prenatal care
programs were found in the literature and corroborated by nurses, physicians, and evaluators.

Example 6.1 Defining Terms: The Case of


Prenatal Care Programs

Prenatal health care refers to pregnancy-related services provided to a woman between the time of conception
and delivery and consists of monitoring the health status of the woman; providing patient information to
foster optimal health, good dietary habits, and proper hygiene; and providing appropriate psychological and
social support. Programs have preset, specific purposes and activities for defined populations and groups.
Outcomes of prenatal care programs include the newborn’s gestational age and birth weight and the mother’s
medical condition and health habits.
Because of these definitions, the evaluator learns the following: If you want to evaluate prenatal care
programs, your measures should include attention to patient education, dietary habits and hygiene, and
psychosocial support. If you are interested in outcomes, you need measures of gestational age and mother’s
medical condition and health habits. You also need to decide which medical conditions (e.g., diabetes,
hypertension) and health habits (e.g., drinking, smoking) will be the focus of your evaluation.

3. Outline the content.

Suppose that an evaluator is concerned with the outcomes of a particular prenatal care program: the
Prenatal Care Access and Utilization Initiative. Assume also that the evaluator’s review of the literature and
consultation with experts reveal the importance of collecting data on the following variables: still or live birth,
birth weight, gestational age, number of prenatal visits, and drug toxicology status of mother and baby. An
outline of the contents might look like this:

a. Baby’s birth date

b. Birth weight

c. Gender

d. Gestational age

e. Whether a drug toxicology screen was performed on baby and results

f. Whether a drug toxicology screen was performed on mother and results

148
g. Number of visits

4. Select response choices for each question or item.

An item on a measure is a question asked of the respondent or a statement that the respondent is asked for
a reaction. Example 6.2 presents a sample item and its response choices.

Example 6.2 Response Choices

Selecting response choices for items requires skill and practice. Whenever possible, you should use response
choices that others have used effectively. The possibilities of finding appropriate choices are greater when you
are collecting demographic information (e.g., age, gender, ethnicity, income, education, address), for example,
than when you are collecting data on the knowledge, attitudes, or behaviors that result from a specific
program designed for a particular group of people. You can find effective item choices in the literature and
can obtain many from measures prepared by the U.S. Bureau of the Census; the health departments of cities,
counties, and states; and other public and private agencies.

5. Choose rating scales.

Whenever possible, adapt rating scales from scales that have already been proven in earlier research. Just as
in the choices for items, rating scales are available from measures designed by public and private agencies and
those described in the literature. Example 6.3 displays an item that uses a simple true-and-false scale.

Example 6.3 Item With a True-False Rating Scale

Please circle the number that best describes whether each of the following statements is true or false for you.

149
6. Review the measure with experts and potential users.

It is wise to ask other evaluators, subject matter experts, and potential users to review your measure. The
following are some important questions to ask them.

Questions to Ask of Those Reviewing Your Measure

• Ask experts

1. Is all relevant content covered?

2. Is the content covered in adequate depth?

3. Are all response choices appropriate?

4. Is the measure too long?

• Ask users

1. Is all relevant content covered?

2. Is the content covered in adequate depth?

3. Do you understand without ambiguity all item choices and scales?

4. Did you have enough time to complete the measure?

5. Did you have enough time to administer the measure?

6. Is the measure too long?

7. Revise the measure based on comments from the reviewers.

8. Put the measure in an appropriate format. For example:

• Add an ID code, because without such coding you cannot collect data on the same person over time.

• Add directions for administration and completion.

• Add a statement regarding confidentially (informing the respondent that participants are identifiable
by code) or anonymity (informing the respondent that you have no means of identifying
participants).

• Add a statement thanking the respondent.

• Give instructions for submitting the completed measure. If it is to be mailed, is an addressed and
stamped envelope provided? By what date should the measure be completed?

150
9. Review and test the measure before administration.

The importance of pilot testing a new measure cannot be overemphasized. To conduct a meaningful pilot
test, you must use the measure under realistic conditions. This means administering the measure to as many
participants as your resources allow. After participants complete the measure, you need to interview them to
find out about any problems they had in completing the measure. When your study involves interviews, you
must test the methods for interviewing as well the measure itself.

Checklist for Selecting an Already Existing Measure

Many instruments and measures are available for use by program evaluators. Good sources for these are the
published evaluation reports found in journals. In some cases, whole instruments are published as part of these
articles. Even when the measures themselves do not appear, the evaluators usually describe all of their main
data sources and measures in the “methods” sections of their reports, and you can check the references for
additional information.
Using an already tested measure has many advantages, including saving you the time and other resources
needed to develop and validate a completely new instrument. Choosing a measure that has been used
elsewhere is not without pitfalls, however. For example, you may have to pay to use an established measure, or
you may be required to share your data. You may even have to modify the measure so substantially that its
reliability and validity are jeopardized, requiring you to establish them all over again.
The following is a checklist for choosing an already existing measure.

1. Find out the costs: Do you have to pay? Share data? Share authorship?

2. Check the content: In essence, you must do your own fact and content validity study. Make sure that
the questions are the ones you would ask if you were developing the instrument. Check the item choices
and rating scales. Will they elicit the information you need?

3. Check the reliability and validity: Make sure that the types of reliability and validity that have been
confirmed are appropriate for your needs. For example, if you are interested in interrater reliability, but
only internal consistency statistics are provided, the measure may not be the right one for you. If you are
interested in a measure’s ability to predict, but only content validity data are available, think again before
adopting the instrument.

You need to check the context in which the measure was validated. Are the settings and groups similar to
those in your evaluation? If not, the instrument may not be valid for your purposes. For example, a measure of
compliance with counselors’ advice in a program to prevent child abuse and neglect that was tested on teen
mothers in Montana may not be applicable to nonteen mothers in Helena, Montana, or to teen mothers in
San Francisco, California.
You must also decide whether the measure is sufficiently reliable and valid for use. Reliability and validity
are often described as correlations (e.g., between experts or measures or among items). How high should the

151
correlations be? The fast answer is that the higher, the better, and .90 is best. But, the statistic by itself should
not be the only or even the most important criterion. A lower correlation may be acceptable if the measure
has other properties that are potentially more important. For example, the content may be especially
appropriate, or the measure might have been tested on participants who are very much like those in your
evaluation.

4. Check the measure’s format:

• Will the data collectors be able to score the measure?

• Does it make sense to use the particular measure, given your available technology? For example, if
the measure requires certain software or expertise, do you have it? Can you afford to get it?

• Will the participants in the evaluation be willing to complete the measure? Participants sometimes
object to spending more than 10 or 15 minutes on an interview, for example. Also, personal
questions and complicated instructions can result in incomplete data.

The Measurement Chart: Logical Connections

A measurement chart assists the evaluator in the logistics of the evaluation by helping ensure that all variables
will have the appropriate coverage. The chart is also useful when the evaluator is writing proposals, because it
portrays the logical connections among what is being measured, how it is being measured, for how long, and
with whom. When the evaluator is writing reports, the chart provides a summary of some of the important
features of the evaluation’s data sources. Illustrated by the sample measurement chart in Figure 6.1, the
information in the chart’s columns enables the evaluator to make logical connections among the various
segments of data collection. Each column in the chart is explained briefly below.

Variables. To ensure that all independent and dependent variables are covered, the evaluator uses the chart to
check the evaluation questions and sampling strategies, including all strata and inclusion and exclusion
criteria. For example, suppose that an evaluation asks about the effectiveness of a yearlong combined diet and
exercise program in improving the health status and quality of life for people over 75 years of age. Also,
suppose that it excludes all persons with certain diseases, such as metastatic cancer and heart disease. Assume
that the evaluators plan to compare men and women to determine whether any differences exist between
them after program participation. The variables needing measurement in such an evaluation would include
quality of life, health status (to identify persons with metastatic cancer and heart disease and to assess
changes), and demographic characteristics (to determine who is male and who is female).

How measured. For each variable, the measure should be indicated. The measurement chart in Figure 6.1
shows that quality of life will be assessed through interviews with patients and observations of how they live,
health status will be measured through physical examination, demographic characteristics will be measured
through self-administered questionnaires or interviews, and costs will be measured through a review of
financial records.

152
Sample. This column in the measurement chart contains information on the number and characteristics of
individuals who will constitute the sample for each measure. For example, the measurement chart in Figure
6.1 shows that to measure quality of life the evaluator will interview all 100 patients (50 men and 50 women)
in the experimental group and all 100 patients (50 men and 50 women) in the control group as well as observe
a sample of the lifestyles of 50 patients (25 men and 25 women) in each group. Assessment of health status
will be based on physical examination of all persons in the experimental and control groups, and demographic
information will be collected on all experimental and control program participants. Data on costs will be
collected only for those individuals who use one of the two staffing models.

Timing of measures. The information in this column refers to when each measure is to be administered. For
example, the measurement chart in Figure 6.1 shows that interviews regarding quality of life and physical
examination will be conducted 1 month before the program, immediately after the program (within 1 month),
and 1 year after. Observations will be made 1 month before and 6 months after. Demographic information
will be obtained just once: 1 month before the start of the program.

153
154
Duration of measures. This column of the chart contains information on the amount of time it will take to
administer and summarize or score each measure. The measurement chart in Figure 6.1 shows that the
quality-of-life interviews will take 1 hour to conduct and a half hour to summarize. The observations will take
a half hour to conduct and 15 minutes to summarize. The physical examinations are expected to take 30
minutes, and collecting data on demographic characteristics will take less than 5 minutes.

Content. The evaluator should provide a brief description of the content in the measurement chart. For
example, if measurement of quality of life is to be based on a particular theory, the evaluator should note the
theory’s name. If the interview has several sections (e.g., social, emotional, and physical function), the
evaluator should mention them. It is important to remember that the chart’s purpose is really to serve as a
guide to the measurement features of an evaluation. Each one of its sections may require elaboration. For
example, for some measures, the evaluator may want to include the number of items in each subscale.

Reliability and validity. If the measures being used are adapted from some other study, the evaluator might

155
describe the relevant types of reliability and validity statistics in this part of the chart. For example, if the
quality-of-life measure has been used on elderly people in another evaluation that showed that higher scorers
had higher quality than low scorers, this information might be included. If additional reliability information is
to be collected in the current evaluation, that, too, may be reported. A review of medical records to gather
information on the number, types, and appropriateness of admissions to the hospital over a 1-year period, for
example, could require estimations of data collectors’ interrater reliability; this type of information belongs in
this section of the chart.

General concerns. In this portion of the chart, the evaluator notes any special features of the entire data
collection and measurement endeavor. These include information on costs, training, number of items, special
software or hardware requirements, and concerns related to informed consent.

Summary and Transition to the Next


Chapter on Managing Evaluation Data

Reliability refers to the consistency of a measure, and validity refers to its accuracy. Having reliable and valid
measures is essential in an unbiased evaluation. Sometimes the evaluator is required or chooses to create a new
measure; at other times, a measure is available that appears to be suitable. Whether creating, adapting, or
adopting a measure, the evaluator must critically review the measure to ensure its appropriateness and
accuracy for the current study.
A measurement chart is a useful way of showing the relationships among variables, how and when the
variables are measured, and the content, reliability, and validity of the measures. Measurement charts are
useful tools for evaluators as they plan and report on evaluations.
The next chapter discusses how the evaluator engages in activities to ensure the proper management of
evaluation data to preserve them for analysis. These activities include drafting an analysis plan, creating a
codebook, establishing coder reliability, reviewing data collection instruments for incomplete or missing data,
entering data into a database, cleaning the data, creating the final data set, and archiving the data set.

Exercises

Exercise 1: Reliability and Validity

Directions

Read the following excerpts and determine which concepts of reliability and validity are covered in each.

Excerpt A

The self-administered questionnaire was adapted with minor revisions from the Student Health Risk
Questionnaire, which is designed to investigate knowledge, attitudes, behaviors, and various other cognitive
variables regarding HIV and AIDS among high school students. Four behavior scales measured sexual

156
activity (4 questions in each scale) and needle use (5 questions); 23 items determined a scale of factual
knowledge regarding AIDS. Cognitive variables derived from the health belief model and social learning
theory were employed to examine personal beliefs and social norms (12 questions).

Excerpt B

All school records were reviewed by a single reviewer with expertise in this area; a subset of 35 records was
reviewed by a second blinded expert to assess the validity of the review. Rates of agreement for single items
ranged from 81% (k = .77; p <.001) to 100% (k = 1; p <.001).

Excerpt C

Group A and Group B nurses were given a 22-question quiz testing evaluation principles derived from the
UCLA guidelines. The quizzes were not scored in a blinded manner, but each test was scored twice.

Exercise 2: Reviewing a Data Collection Plan

Directions

Read the following information collection scenario and, acting as an independent reviewer, provide the
evaluator with a description of your problems and concerns.

The School of Nursing is in the process of revising its elective course in research methods. As part of the
process, a survey was sent to all faculty who currently teach the methods courses to find out whether and
to what extent epidemiology topics were included. Among the expectations was that methods courses
would aim to improve students’ knowledge of epidemiology and their attitudes toward its usefulness in a
number of nursing subjects, ranging from public health nursing to home health care administration. The
results of the survey revealed little coverage of some important objectives. Many faculty indicated that
they would like to include more epidemiology, but they were lacking educational materials and did not
have the resources to prepare their own. To rectify this, a course with materials was developed and
disseminated. The Center for Nursing, Education, and Evaluation was asked to appraise the effectiveness
of the educational materials. Evaluators from the center prepared a series of knowledge and skill tests and
planned to administer them each year over a 5-year period. The evaluators are experts in test
construction, and so they decided to omit pilot testing and save the time and expense. Their purpose in
testing was to measure changes (if any) in nurses’ abilities. They also planned to interview a sample of
cooperating students to get an in-depth portrait of their knowledge of clinical epidemiology.

References and Suggested Readings

Cohen, R. J., Swerdik, M., & Sturman, E. (2012). Psychological testing and assessment: An introduction to tests
and measurement. New York: McGraw Hill.
Furr, R. M., & Bacharach, V. R. (2013). Psychometrics: An introduction. Thousand Oaks, CA: Sage.
Litwin, M. (2002). How to assess and interpret survey psychometrics. Thousand Oaks, CA: Sage.

157
Raykov, T., & Marcoulides, G. A. (2011). Introduction to psychometric theory. New York: Routledge.

158
Purpose of This Chapter

The term data management refers to the actions evaluators take to convert the data they collect
from all sources into an analytic database or data set that is ready for analysis. This chapter discusses
how to prepare a data analysis plan, enter and store data in an evaluation database, avoid data entry
errors, deal with missing data, and create a clean data set.

159
7
Managing Evaluation Data

A Reader’s Guide to Chapter 7

Managing Evaluation Data: The Road to Data Analysis

Drafting an Analysis Plan

Creating a Codebook or Data Dictionary


Establishing Reliable Coding
Measuring Agreement: The Kappa

Entering the Data

Searching for Missing Data


What to Do When Participants Omit Information

Cleaning the Data


Outliers
When Data Are in Need of Recoding

Creating the Final Database for Analysis

Storing and Archiving the Database

Summary and Transition to the Next Chapter on Data Analysis

Exercises

References and Suggested Readings

Managing Evaluation Data


Management: The Road to Data Analysis

The term data management refers to the actions evaluators take to convert the data they collect from all
sources into an analytic database or data set that is ready for analysis. The terms database and data set are often
used interchangeably, but a data set is really a subset of a larger database. For example, suppose an evaluation
team collects data on all nurses who complete a three-year program, and then they enter the information into
a database. If the evaluation team later on decides to compare the program’s effects on nurses who went to
Schools A and B, it would create a data set that only included nurses who attended Schools A and B. Nurses

160
who attended other schools would not be included.
Evaluation data management activities include at least 8 activities:

1. Drafting an analysis plan that defines the variables to be analyzed

2. Creating a codebook or data dictionary (an “operations” manual)

3. Ensuring the reliability of the coders

4. Reviewing the data for completeness and accuracy

5. Entering data into a database and validating the accuracy of the entry

6. Cleaning the data

7. Creating the final data set for analysis

8. Storing and archiving the data

Suppose that you are the evaluator of a 3-year program to encourage primary care patients to use preventive
health care services on a regular basis. Your plan is to survey patients four times over the 3 years. There are
400 patients in the evaluation, and you decide to survey them first at baseline (just before the program begins)
and then at intervals of 12, 24, and 36 months afterward. Each survey contains 25 questions. In addition, you
also plan to survey 25 physicians at baseline and at the conclusion of the program, 36 months later. The
physicians’ survey has 50 questions. If everyone in your original sample stays in the evaluation and answers all
of the questions on all of the surveys, you have to find a way to manage data from 400 patients who answer 25
questions each four times (40,000 answers) plus the data from 25 physicians who answer 50 questions twice
(2,500 answers). To complicate matters, you may need to combine answers from patients and physicians in
order to answer some of the evaluation questions. For example, say that one of the evaluation questions is,
“How do the male and female patients of female physicians compare with those of male physicians in their
use of two specific preventive health services?” To answer this question, you need to be able to easily and
accurately create a separate analytic file of female and male physicians, their male and female patients, and the
preventive services used by each male and each female patient for each male and each female physician as
described in Figure 7.1.
Before data collection begins, the evaluator must select the software to use for data management. The
complexity and importance of data management are sometimes overlooked, even though evaluators estimate
that data management activities take between 20% and 50% of the time typically allocated for the analytic
process. As you plan your own evaluation, it is important to be realistic about the amount of time you need
for data management and that you make certain you have sufficient resources (staff, time, and money) to do
the job. The principles that underlie good data management apply to qualitative data as well as to statistical
information.

Figure 7.1 Comparing Female and Male Physicians’ Female and Male Patients in Their Use of
Preventive Health Services

161
Drafting an Analysis Plan

An analysis plan contains a description and explanation of the statistical and other analytic methods that the
evaluator plans to use for each evaluation question or hypothesis. Suppose that you are the evaluator of a
health education program designed to reduce risks for harmful drinking among adults. The program consists
of three main components: (a) the Alcohol-Related Problems Screen (ARPS), a measure to determine
drinking classification (harmful or not harmful); (b) a report describing the reasons for an individual’s
classification (such as “A person who drinks four or more drinks at one sitting two or three times a week or
more is a harmful drinker”); and (c) educational materials (a booklet and a list of related websites) to help
individuals reduce their drinking down to a healthier level. Example 7.1 shows portions of a simple plan for
analyzing the data to answer two of this study’s evaluation questions.

Example 7.1 Portion of an Analysis Plan for an Evaluation of a Program to Reduce Harmful
Drinking in Adults

Evaluation Question 1

Question: How do men and women compare in terms of numbers of harmful drinkers?

Hypothesis: More men than women will be harmful drinkers.

Independent variable: Sex

Data Source: Survey questionnaire (Are you male or female?)

Dependent variable: Drinking status (harmful or not harmful)

Data Source: The Alcohol-Related Problems Screen

Planned analysis: Chi-square to test for differences between numbers of men and women who are or are not
harmful drinkers, according to their scores on the Alcohol-Related Problems Screen

162
Evaluation Question 2

Question: How do men and women compare in their risks for harmful drinking as defined by scores on the
Alcohol-Related Problems Screen?

Independent variable: Sex

Data Source: Survey questionnaire (Are you male or female?)

Dependent variable: Risks for harmful drinking

Data Source: The Alcohol-Related Problems Screen (Higher scores mean greater risks. Scores are continuous
and range from 1 to 50.)

Planned analysis: T-test to test for differences in average scores obtained by men and by women

The evaluation questions and hypothesis contain the independent and dependent variables. Each variable
has a source of data (survey questionnaire, the Alcohol-Related Problems Screen) and the data have certain
characteristics. In Example 7.1, they are categories (male or female; harmful or not harmful) or continuous (a
score of 1 to 50). You need to understand the characteristics of the data in order to choose an appropriate
statistical technique. Categorical data lend themselves to different analytic methods (such as chi-square) than
continuous data (such as a t-test).
Regardless of how well you plan your analysis, the realities of sampling and data collection may force you to
modify your plan. Suppose, for example, that you review preliminary data from the Alcohol-Related
Problems Screen (Evaluation Question 2 in Example 7.1). Based on the results of your review, you decide
that in addition to testing for differences in average scores, you also want to compare the number of men and
women who attain scores of 25 or higher with the number who obtain scores of 24 or lower. You would then
have to modify your original analysis plan to include a chi-square test (to compare proportions or numbers) as
well as the planned t-test (to compare averages). In general, evaluators can count on having to make
modifications to their original analysis plans, especially in large studies that collect a great deal of data.

Creating a Codebook or Data Dictionary

Codes are units that the evaluator uses to “speak” to the software. Suppose that in your evaluation of a program
to reduce harmful drinking, 1,000 men and women complete the Alcohol-Related Problems Screen. This
survey of alcohol use asks respondents how much and how often they drink. To determine how many men
report drinking four drinks at one sitting, two or three times a week, for example, you have to communicate
with the statistical software program and “tell” it which variables to look for (e.g., sex and quantity and
frequency of alcohol use). Software programs read about variables through codes. Example 7.2 displays two
items from the Alcohol-Related Problems Screen.

Example 7.2 Excerpts From a Screening


Measure to Detect Harmful Drinking

163
1. Are you male or female? [SEX]

Male 0

Female 1

.
.
.

8. How often in the past 12 months have you had 4 or more drinks of alcohol at one sitting? [QFDRINK]

Choose one answer


Daily or almost daily 1

Four or five times a week 2

Two or three times a week 3

Two to four times a month 4

One time a month or less 5

Never 0

In Example 7.2, the codes are the numbers to the right of the response boxes. You use a statistical program
to tell you how many people who answered “1” to question 1 also answered “4” to question 8. To do this, you
must tell the program the names of the variables (Question 1 = SEX and Question 8 = QFDRINK) and their
values (0 = male and 1 = female; 1 = daily or almost daily to 0 = never). Many statistical programs also require
that you tell the computer where in the data line to find the variables. (See the section “Entering the Data,”
later in this chapter.) The “words” in brackets in Example 7.2 [SEX and QFDRINK] correspond to the
variables represented by each question.
The evaluator must create a codebook or data dictionary that contains descriptions of all of the questions,
codes, and variables associated with a survey. Example 7.3 displays portions of a typical codebook.

Example 7.3 Portions of a Codebook

164
As this example illustrates, each variable is broken down into discrete units, called values that correspond to
the codes for that variable. For instance, a participant’s feeling guilty or sorry for something he or she did
because of alcohol use has five values: 1 = daily or almost daily; 2 = at least once a week, but less than daily; 3
= at least once a month, but less than weekly; 4 = less than once a month; 0 = never. The codes are thus 0, 1,
2, 3, and 4. The codebook also notes that if no information is available, nothing is entered.
The evaluator should assign missing data a code or value that is not numeric, because many statistical
programs treat all codes as data that need to be analyzed. Consider that you are conducting a survey of teens,
and one participant neglects to include his or her age on a questionnaire; if you code that missing information
as 99, the statistical program will likely assume that the respondent is 99 years old.
There are several ways to code missing data, including using a period (.) or inserting a blank space. Some
software programs allow users to select several codes for missing data, such as “.a” for missing, “.b” for “don’t
know,” and “.c” for “not applicable.” It is a good idea to distinguish among these three concepts. If a person in
an evaluation of harmful drinking does not drink, for instance, then all variables pertaining to the quantity
and frequency of that person’s drinking should be coded as “not applicable,” given that the data are not
actually “missing.” When you have selected the software you will use in your evaluation, check the manual to
see which coding system is appropriate and to determine what will work best for your analysis.
Although statistical software programs vary in the terminology they use, many (however, by no means all)
require that variable labels appear in all capital letters (PQOL or EDUC) and do not permit the use of special
characters (such as commas and semicolons) in these labels. Some programs limit the number of characters
for variable labels to about eight. Variable labels should be based on the actual names of the variables (e.g.,
the variable name “perceived quality of life” is given the variable label PQOL). To understand the data, the
software needs to know the name of each variable, the variable’s label, and the variable’s values and their

165
labels. For the variable named “community,” for example, the software needs to know that its label is COMM
and that its values are 1 = urban and 2 = rural. (Note that although the statistical program you use may
employ slightly different terms for all these concepts, the ideas will be exactly the same.)

Establishing Reliable Coding

To assure reliable data in a small evaluation—for instance, with just one person doing the coding—the
evaluator should recode all or a sample of the data to check for consistency. The second coding should take
place about a week after the first coding. This is enough time for the coder to forget the first set of codes so
that he or she does not just automatically reproduce them. After the data are coded a second time, the
evaluator should compare the two sets of codes for agreement. If there are disagreements, the evaluator may
resolve them by calling in a second person to arbitrate.
In a large evaluation, a second person should independently code a sample of the data. To assure reliability
between coders, the evaluator must provide the coders with formal training and make sure they have access to
the definitions of all terms.
Despite the evaluator’s best efforts at setting up a high-quality codebook and data management system, the
coders may not always agree with one another. To find out the extent of agreement between coders—
intercoder or interrater reliability—the evaluator can calculate a statistic called kappa, which measures the
agreement between a given pair of coders and how much better it is than chance. Typically, kappa is used in
assessing the degree to which two or more raters, examining the same data, agree when it comes to assigning
the data to categories. The following subsection explains the principle behind the kappa statistic.

Measuring Agreement: The Kappa

Suppose that two evaluators are asked to review independently 100 interviews with single working mothers
of two or more children. These mothers have just completed a month-long program, and the interviews are
about their health-related quality of life. The reviewers are to study the transcripts of the interviews to find
how many of the participants mention doing regular exercise during the discussion. The reviewers are asked
to code 0, for “no,” if a participant does not mention regular exercise at least once; and 1, for “yes,” if she does
mention regular exercise. Here are the reviewers’ codes:

Reviewer 1 says that 30 (A) of the 100 interviews do not contain reference to regular exercise, whereas
Reviewer 2 says that 35 (B) do not. The two reviewers agree that 20 (C) interviews do not include mention of
exercise.
What is the best way to describe the extent of agreement between these reviewers? The figure of 20 out of
100, or 20% (C), is probably too low, because the reviewers also agree that 55% (D) of the interviews include
mention of exercise. The total agreement 55% + 20% is an overestimate, because with only two categories (yes

166
and no), some agreement may occur by chance. This is shown in the following formula, in which O is the
observed agreement and C is the chance agreement.

Measuring Agreement Between Two Coders: The Kappa (κ) Statistic

where O − C = agreement beyond chance and 1 − C = agreement possible beyond chance. Here is how this
formula works with the above example.

1. Calculate how many interviews the reviewers may agree by chance do not include mention of exercise.
One does this by multiplying the number of no’s and dividing by 100, because there are 100 interviews:
(30 × 35)/100 = 10.5.

2. Calculate how many interviews the reviewers may agree by chance do include mention of exercise by
multiplying the number of interviews that each reviewer found to include mention. One does this by
multiplying the number of yeses and dividing by 100: (70 × 65)/100 = 45.5.

3. Add the two numbers obtained in steps 1 and 2 and divide by 100 to get a proportion for chance
agreement: (10.5 + 45.5)/100 = 0.56.

The observed agreement is 20% + 55% = 75%, or 0.75. Therefore the agreement beyond chance is 0.75 −
0.56 = 0.19: the numerator.
The agreement possible beyond chance is 100% minus the chance agreement of 56%, or 1 − 0.56 = 0.44: the
denominator.

What is a “high” kappa? Some experts have attached the following qualitative terms to kappas: 0.0–0.2 =
slight, 0.2–0.4 = fair, 0.4–0.6 = moderate, 0.6–0.8 = substantial, and 0.8–0.10 = almost perfect.
How can an evaluator achieve substantial or almost perfect agreement—reliability—among reviewers? By
making certain that all reviewers collect and record data on exactly the same topics and that they agree in
advance on what each important variable means. The “fair” kappa of 0.43 obtained by the reviewers in the
example above may be due to differences between the reviewers’ and the evaluator’s definitions, the evaluator’s
poor training of the reviewers in the use of the definitions, and/or mistakes in coding.

Entering the Data

Data entry is the process of getting evaluation information into a database from surveys, interviews,
observations, records, and other measures. Entry usually takes one of three forms. In the first form, the

167
evaluator enters the data by hand into a database management or spreadsheet program. In the second form,
the evaluator scans the data (such as a page of a completed survey) using a combined OCR/OMR system
(optical character recognition and optical mark recognition). The OCR/OMR system “reads” the data and
processes it electronically. In the third form of data entry, data are automatically entered into a database as the
evaluator or participant completes data collection. Data entry of this type is associated with computer-assisted
interviewing, online surveys, and electronic health records.
Each type of program—data entry, database management, and statistical—has its own conventions and
terminology. Some programs refer to entering data as setting up a “record” for each evaluation participant.
The record consists of the participant’s unique ID (identification code) and the participant’s “observations”
(response choices, scores, comments, and so on). Other programs consider the unit of analysis (such as the
individual participant) as the observation, and the data collected on each observation are referred to as
“variables” or “fields.”
Example 7.4 shows a simple data set for 6 people.

Example 7.4 Data on Six People

In this example, with the exception of the first column, the table is organized so that the respondent’s
identification (RESPID) numbers constitute the rows and the data on the participants are the columns. That
is, person 2’s data are 2, 4, 1, 3, and 2. Many statistical programs require the user to tell the computer where
on the data line a variable is located. For instance, in Example 7.4 the person’s sex is called SEX, and data on
biological sex can be found in column 2. Feeling guilty because of drinking is called GUILT, and the data for
this variable can be found in column 5.
Database management programs, statistical programs, and online surveys with automatic data entry can
facilitate the accuracy of data entry when they are programmed to allow the entry of only legal codes. For
instance, if the codes should be entered as 001, 002, and so on, up to 010, the user can write rules so that an
entry of 01 or 10 is not permitted—that is, if anyone tries to enter 01 or 10, the program responds with an
error message. With minimal programming, such software can also check each entry to ensure that it is
consistent with previously entered data and that skip patterns in the questionnaire are respected. That is, the
program can ensure that the fields for questions to be skipped by some participants are coded as skips and not
as missing data. Designing a computer-assisted data entry protocol requires skill and time. The evaluator

168
should never regard any such protocol as error free until it has been tested and retested in the field.

Searching for Missing Data

The evaluator should review the first completed data collection instruments as soon as they are available,
before any data are entered. In any study using self-administered surveys (which includes many evaluations),
the evaluator can expect to find some survey questions unanswered. Participants may not answer questions for
many reasons. For example, they may not want to answer particular questions, or they may not understand
what they are being asked to do. Some participants may not understand the directions for answering the
questions or completing the survey because the questionnaire requires them to read too much or is composed
at a level beyond their reading capabilities. Participants in print surveys may be unsure about what methods
they should use in responding (e.g., whether they should fill in boxes completely or mark them with checks,
whether they should circle the correct answers). They may find that questions are presented in a format that is
difficult to use, such as the one in Example 7.5.

Example 7.5 A Question That May Be Confusing

Please mark an X through the choice that best describes you. Please answer each question.

DID YOU MARK ONE ANSWER TO EACH QUESTION EVEN IF YOUR ANSWER
IS
NEVER?

In this example, the participant has answered just one question even though he or she was asked (and
reminded) to answer all of the questions. What do the unanswered items mean? Has the participant declined
to answer the questions because he or she never does the actions described in them? If so, why didn’t the
participant mark “never”? In fact, it is not uncommon for participants on self-administered surveys to answer

169
only those questions that they believe are relevant to them.
The evaluator who designed the question in Example 7.5 probably used this tabular format in an attempt
to save space and to avoid repeating the response choices over and over. Participants may find such a format
confusing, however, because they are used to questionnaires in which items are formatted more like the one
shown in Example 7.6.

Example 7.6 A Question Format


That Is Relatively Easy to Understand

Confusing question formats lead to missing data because participants do not know how to answer the
questions, and as a result leave them blank. The evaluator may overcome some of the problems that cause
participants to misunderstand survey questions by conducting extensive cognitive pretests and pilot tests.
Cognitive pretests are interviews with potential participants in which they are asked to interpret each question
and response choice on a survey. Pilot tests are tests of the survey questions in the actual evaluation setting.
These two activities tell evaluators if a particular question format is unusable or whether some questions do
not make sense, enabling them to address problems before data collection begins.

What to Do When Participants Omit Information

One of the major issues evaluators face is deciding on the best way to handle the problem of missing data.
Say that you mail 100 questionnaires and get 95 back. You proudly announce that you have a 95% response
rate for your questionnaire. Then, on closer examination, you discover that half the participants did not
answer question 5, and that each of the remaining 24 survey questions also has missing data. With all that
missing information, you cannot really claim to have a 95% response rate for all questions.
What should an evaluator do about missing responses? In some program evaluations, it may be possible to
go back to the participants and ask them to answer the questions they left unanswered. In small studies where

170
the participants are known (such as within one clinic), it may be easy for the evaluator to contact the
participants. But, in most surveys, collecting information a second time is usually impractical, if not
impossible. Some evaluations use one or more anonymous data collection methods, and so the evaluators do
not even know the identity of the participants. In an institutional setting, the evaluator may have to go back
to the institutional review board to get permission to contact the participants a second time—a process that
takes time and may delay completion of the evaluation.
In online data collection, surveys can be programmed so that the respondent must answer one question
before proceeding to the next. This approach can help to minimize the amount of missing data. However,
evaluation participants may find this restriction frustrating and give up on the survey. Some evaluators believe
that forcing participants to answer every question is coercive and unethical. In their view, a program that
forces a respondent to answer a question, even if he or she prefers not to, may be construed as violating the
ethical principle of autonomy, or respect for individuals. Moreover, using this approach may result in
unreliable information, as some people may enter meaningless answers to some questions just to be able to
move on in the survey.
Evaluators often use a statistical method called weighting to make up for nonresponse. Suppose that you
expect 50% of men and 50% of women in a sample to complete a survey, but only 40% of the men do. In such
a case, it is possible to weight the men’s returns so that they are equivalent to 50%. There are several strategies
used for weighting returns, among them logistic regression analysis. In this method, responses are
dichotomized: response or no response. Independent variables (age, sex, educational level) are used to predict
the characteristics of the sample do or do not respond.
Dichotomous variables are divided into two components. For instance, if the variable age is dichotomous,
then it is divided into two parts, such as 65 and older and 64 and younger. Also, scores can be dichotomized:
for example, separated into scores of 50 and under and scores of 51 or more.
Suppose that you have information on age and sex, and you can predict whether or not a person responded
using these two variables—that is, your participants and nonparticipants differ on these characteristics.
Logistic regression provides you with an estimate of the probability that a given respondent’s data will be
missing. The probabilities are assigned as weights for all cases. Cases that have missing data are given higher
weights. All statistical software programs include mechanisms for weighting responses.
Another commonly used statistical method to account for missing data is the last observation carried forward
(LOCF). LOCF uses the last value observed before the participant left, regardless of when the dropout
occurred. LOCF makes the probably unrealistic assumption that participants who drop out would continue
responding precisely as they did at the time of drop out. Because of this assumption, LOCF is losing favor.
Newer methods use statistical models in which each participant is fitted with his or her own regression line
over time so that when a participant drops out, his or her curve is projected to the end of the study rather
than simply holding at whatever the last value was, as is the case when using the LOCF method.
Missing data is sometimes handled by filling in a “reasonable” value, such as an average score for each
participant who did respond. Suppose Participant A did not answer a survey question about his annual
household income. To fill in the blank with a reasonable value, the researcher can compute the average of all
participants’ responses to the question and use the average as the value for Participant A. The reasoning

171
behind this approach is that Participant A is unlikely to deviate substantially from the average person in the
study. This approach probably works, however, only if just a few respondents leave out a particular question,
and if the researchers have no reason to think that any given respondent is different, on average, from the
others.
It has become standard practice in randomized controlled trials to use an intention-to-treat analysis (ITT)
to handle the dropout problem. With this type of analysis, all participants are included in the study group to
which they were allocated, whether or not they received (or completed) the program given to that group:
Analyze as randomized! ITT’s critics assert that its use may underestimate the full effects of a program
because some people are included who did not receive the intended program. As an alternative to ITT, some
evaluators use per protocol analysis in which only participants are included in the analysis who received the
intended program even if they do not complete the entire program or provide complete data. Critics of per
protocol analysis argue that this technique may result in an overestimation of effectiveness.
Loss of study information from attrition, or refusal, or inability to complete data collection can be a very
serious problem. Statistical fixes are complex and controversial, and as a result most evaluators try to avoid the
problem at the onset.
Experienced evaluators avoid missing data by being realistic about the study’s inclusion and exclusion
criteria, training and monitoring project staff to recruit and work with participants effectively, reimbursing
participants for spending their time to complete study activities, providing participants with readable updates
on individual and study progress, ensuring informed consent, and keeping all information confidential.

Cleaning the Data

Once the data are entered, they need to be cleaned. When a data set is clean, anyone who uses it will get the
same results when running analyses. Data become “dirty” for a number of reasons, including miscoding and
incorrect data entry. To avoid dirty data, the evaluator must make certain that coders and data entry personnel
are experienced, well trained, and properly supervised. One way of checking the data is to compare variable
values against preset maximum and minimum levels; if, for example, someone enters 50 when the maximum
is 5, an error has clearly occurred. The evaluator can also minimize errors by making sure that the coding
scheme distinguishes truly missing data (no response or no data) from responses of “don’t know” or “not
applicable.”
The evaluator should run frequencies (such as tabulations of the responses to a survey’s question) on the
data as soon as about 10% of the responses are in, and then run them again and again until it is clear that data
collection is running smoothly. If the data set is relatively small, the evaluator can visually scan the frequencies
for errors. For large databases with data from many measures and variables and from surveys with skip
patterns and open-ended text responses, a systematic computerized check may be required. All leading
statistical programs provide for cleaning specifications that can be used during data entry and later as a
separate data cleaning process.
Several other problems may require the evaluator to clean up the data. These include information collection
measures that have not been completed at all, others that have been only partially completed, and still others

172
that contain data that are very different from those of the average evaluation participant.

Outliers

Outliers are pieces of data or observations that appear inconsistent with the rest of the data set. They may
be detected by the built-in checks in a statistical program, or the evaluator may discover them by running
frequencies and other descriptive statistics and checking the results against acceptable values. For instance,
suppose that you survey 50 people to find out if they like a particular movie. You review the returns and find
that 48 participants assigned ratings of 2, 3, and 4 on a scale of 1 (hated it) to 5 (loved it). One respondent is
consistently negative and assigns ratings of 1 to all 75 questions about the movie. Another respondent assigns
ratings of 5 to all 75 questions. These two people’s answers may be outliers—the question is what to do about
them.
Many researchers simply discard outliers from their data analyses. Each evaluator must decide on a case-
by-case basis what to do with data that clearly deviate from the norm. Such decisions should be made with
caution: When you discard seemingly deviant data, you may also be tossing out important information.
Methods of detecting outliers include regression analysis and formal tests that assume that variables are
normally distributed. (See Chapter 8 for a discussion of normal distribution.)

When Data Are in Need of Recoding

Data management may continue until the last analysis is performed. For example, as time goes on, you may
want to add additional people to the evaluation, or you may want to consider studying additional variables.
These activities will require coding, data entry, and data cleaning. You may need to recode the data you have
collected, as illustrated in Example 7.7.

Example 7.7 Recoding Data

Example 1: A program is initiated to encourage 15 communities to participate in collaborative health services


research. After the 2-day program is completed, the evaluators interview the participants to find out whether
their attitudes have changed. The evaluators hypothesize that the older participants are more interested in
collaboration than the younger ones. They have data on each person’s birth date, and preliminary analysis
reveals that the participants’ ages range from 26 to 71. The median age of the sample is 42. The evaluators
decide to compare the attitudes of older and younger people. They recode and dichotomize the data to create
two age categories: 42 and younger and 43 and older.

Example 2: The evaluators of an intervention to improve family function adopt a standardized measure of
family stability. The measure has 10 items, each rated on a scale of 1 to 5. The evaluators discover that 7 of
the items are worded so that a score of 1 is high and a score of 5 is low. On the other 3 items, a score of 5 is
high and a score of 1 is low. Because the evaluators want a total score of family stability, they have to recode
the three reverse-worded questions so that a score of 1 becomes a score of 5, a score of 2 is recoded to become
a 4, and so on. If they do not recode the data, the items do not have a common direction, and the evaluators

173
will be unable to sum them to get a score.

Creating the Final Database for Analysis

Once the data are entered, evaluators typically transfer the database to a statistical or qualitative software
program. (Although some statistical software programs have data management capabilities, special database
management programs tend to allow users to edit more easily.) Once the analysis begins, the evaluator may
alter the characteristics of some of the variables so that continuous variables are dichotomized and new
categories and derived variables are created. Example 7.8 shows how this works.

Example 7.8 During the Process of an Analytic Data Set Continuous Variables Are
Dichotomized

1. When using the Alcohol-Related Problems Screen (ARPS), scores on the DRINK measure range from 1
to 100. For the analysis, the evaluators compare persons with scores of 25 and less with those who
achieved scores of 26 or more.

2. The Alcohol-Related Problems Screen (ARPS) asks participants to complete this question:

During the past 12 months, how often did you have six or more drinks of alcohol at one sitting?

Please check one

Further examples of dichotomizing continuous variables include the following:

• The evaluators are concerned with comparing people who drink in great excess of recommended limits
with those who do not drink at that level. Because of this concern, they dichotomize the answers to the
above question by comparing the proportion of participants who answer “never” to those who give any
other response.

New Categories Are Created

• The ARPS lists a number of medications and asks participants to check off the ones they are taking.
The evaluators decide to group the specific medications into categories, such as antidepressants,
antihypertensives, and so on.

Derived Variables Are Produced

174
• The evaluators create a variable called “drink years” by multiplying a participant’s years of drinking times
the number of drinks per day consumed by the participant.

Storing and Archiving the Database

The evaluator must be certain to create backup copies of the database and the analytic data set (before
conversion to the statistical program). In addition, the evaluator must document in the codebook all changes
made to the database or data set. Once the evaluator is confident that the data are clean and that all changes
to the database are final, then the analysis can begin. The evaluator must then create a backup of the final,
analytic database and store it so that it is safe and password protected.

Summary and Transition to


the Next Chapter on Data Analysis

This chapter discusses the activities that evaluators engage in to ensure the proper management of evaluation
data so that those data are ready for analysis. These activities include: drafting an analysis plan, creating a
codebook, establishing coder reliability, reviewing data collection instruments for incomplete or missing data,
entering data into a database, cleaning the data, creating the final data set, and archiving the data set.
The next chapter focuses on the data analysis methods that are particularly useful in program evaluations. It
describes how the evaluator should go about determining the characteristics of the data that describe or
measure each main variable and which subsequently determine which analytic strategy is appropriate. The
next chapter also discusses how to establish practical as well as statistical meaning through the use of
confidence intervals and other techniques. Special statistical tests involving odds and risks are discussed
because of their importance in evaluation studies. In addition, the chapter discusses the analysis of qualitative
data and meta-analysis.

Exercises

Exercise 1

Directions

Interpret the codes for the following question asked on a telephone interview. The codes are in brackets.
[T_9] How much of the time during the PAST 4 WEEKS have you felt calm and peaceful? Would you
say all of the time, most of the time, some of the time, a little of the time, or none of the time?

All of the time [100]


Most of the time [75]

175
Some of the time [50]
A little of the time [25]
None of the time [0]
DON’T KNOW [8]
REFUSED [9]

Exercise 2

Directions

Assign codes to this set of questions asked during the same telephone interview as in Exercise 1.
My illness has strengthened my faith. Would you say this statement is very much, quite a bit, somewhat, a
little bit, or not at all true?

Very much
Quite a bit
Somewhat
A little bit
Not at all
DON’T KNOW
REFUSED

How much of the time during the LAST 4 WEEKS have you wished that you could change your mind
about the kind of treatment you chose for prostate cancer? Would you say all of the time, most of the time, a
good bit of the time, some of the time, a little of the time, or none of the time?

All of the time


Most of the time
A good bit of the time
Some of the time
A little of the time
None of the time
DON’T KNOW
REFUSED

References and Suggested Readings

Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, D., & Newman, T. B. (2013). Designing clinical

176
research (4th ed.). Philadelphia, PA: Lippincott Williams & Wilkins.
Shi, L. (2008). Health services research methods (2nd ed.). Clifton Park, NY: Delmar Learning.
Trochim, W. M. K. (2006). The research methods knowledge base (2nd ed.). Cincinnati, OH: Atomic Dog.
Retrieved from http://www.socialresearchmethods.net/kb/

177
Purpose of This Chapter

Program evaluators use statistical and nonquantitative methods to analyze data and answer
questions about each program’s effectiveness, quality, and value. This chapter explains why data
analysis depends as much on the characteristics of the evaluation’s questions and the quality of its
data as it is does on the evaluator’s skills in conducting the analysis. The chapter does not focus on
programming or how to do statistics because these topics are covered better in statistics textbooks
and computer manuals. Instead, the chapter discusses the logic that evaluators use to select the most
appropriate analysis plan.
In keeping with the book’s goal of informing prospective evaluators of the key concepts used in
designing and understanding evaluation, this chapter covers hypothesis testing, confidence intervals,
odds and risks, and meta-analyses. Finally, because evaluators often use both statistical and
qualitative methods in a single study, the chapter includes a discussion of one commonly used
qualitative method: content analysis.

178
8
Analyzing
Evaluation Data

A Reader’s Guide to Chapter 8

A Suitable Analysis: Starting With the Evaluation Questions

Measurement Scales and Their Data


Categorical, ordinal, and numerical variables

Selecting a Method of Analysis

Hypothesis Testing and p Values: Statistical Significance

Guidelines for Hypothesis Testing, Statistical Significance, and p Values Clinical or Practical
Significance: Using Confidence Intervals

Screening and Transforming Data

Establishing Clinical or Practical Significance

Risks and Odds


Odds Ratios and Relative Risk

Qualitative Evaluation Data: Content Analysis

Assembling the data, learning the contents of the data, creating a codebook or data dictionary,
entering and cleaning the data, and doing the analysis

Meta-Analysis

Summary and Transition to the Next Chapter on Evaluation Reports

Exercises

References and Suggested Readings

A Suitable Analysis: Starting


With the Evaluation Questions

To select the most appropriate analysis for evaluation, the evaluator must first answer these questions:

179
1. What are the characteristics of the data collected as a measure of each independent and dependent
variable? Quantitative data may be characterized as categorical (male, female), ordinal (high, medium,
low), or numerical (a score of 30 out of 100 possible points). Qualitative data are often summarized in
terms of the themes that emerge from participants’ comments and evaluators’ observations.

2. Given the characteristics of the data, which statistical or qualitative methods are appropriate for
answering the evaluation questions?

3. If a statistical method is appropriate, do the evaluation’s data meet all of its assumptions? (For example,
are the data “normally distributed”?)

Measurement Scales and Their Data

A first step in selecting an analytic method is to identify the characteristics of the data that are collected as a
measure of each independent and dependent variable. Independent variables are used to explain or predict a
program’s outcomes (or dependent variables). Typical independent variables in program evaluations include
group membership (experimental and control program), health status (excellent, very good, good, fair, poor),
age, and other demographic characteristics. Typical dependent variables in evaluations include knowledge,
attitudes, and social, educational, and health behavior and status.
The data used to describe independent and dependent variables come from measures (surveys, tests) that
ask people to choose among various options (e.g., male or female; strongly agree, agree, disagree, strongly
disagree) or to provide precise information (e.g., date of birth). The resulting data take three forms:
categorical, ordinal, and numerical.

Categorical Data

Categorical data come from asking for information that fits into categories. For example:

Both questions require the participant to select the category (male or female) into which the response to the
survey questions—the data—fit. When categorical data take on one of two values (e.g., male or female; yes or
no) they are termed dichotomous (as in “divided in two”).
Typically, categorical data are presented as percentages and proportions (e.g., 50 out of 100 individuals in
the sample, or 50%, were male). The statistic used to describe the center of their distribution is the mode, or
the number of observations that appears most frequently. Variables that are described categorically are often

180
called categorical variables.

Ordinal Data

If an inherent order exists among categories, the data are ordinal. For example:

Similar to categorical data, ordinal data are also presented as percentages and proportions; however, the
center of the distribution is called the median, or the observation that divides the distribution into halves. The
median is equal to the 50th percentile. Variables that are described ordinally are often called ordinal variables.

Numerical Data

Numerical data can theoretically be described in infinitely small units. For example, age is a numerical
variable; weight, length of survival, birth weight, and many laboratory values and standardized achievement
test results are also numerical variables. Differences between numbers have meaning on a numerical scale
(e.g., higher scores mean better achievement than lower scores, and a difference between 12 and 13 has the
same meaning as a difference between 99 and 100). Numerical data are amenable to precision; that is, the
evaluator can obtain data on age, for example, to the nearest second.
Numerical data may be continuous, such as height, weight, and age. Or they may be discrete, such as
numbers of visits to this clinic, number of absences from school. Means and standard deviations are used to
summarize the values of numerical measures. Variables that are described numerically are often called either
continuous or discrete variables.
Table 8.1. contrasts the three types of data.

Table 8.1 Categorical, Ordinal, and Numerical Data

181
Selecting a Method of Analysis

The analysis method is dependent on the following:

• The characteristics of the data collected for the independent variable. Are the data categorical, ordinal,
or numerical?
• The number of independent variables.
• The characteristics of the data collected for the dependent variable. Are the data categorical, ordinal, or
numerical?
• The number of dependent variables.
• Whether the design, sampling, and quality of the data meet the assumptions of the statistical method.
These assumptions vary from method to method and are based on expectations about the nature and
quality of the data and how they are distributed (normal or skewed).

Example 8.1 shows the relationships among evaluation questions and hypotheses, independent and
dependent variables, research design and sample, types of data, and data analysis. Note that in the
justification, the evaluator discusses why the chosen statistical test is appropriated because the data on hand
meet the test’s assumptions.

182
Example 8.1 Analysing Evaluation Data: Connections
Among Questions, Designs, Samples, Measures, and Analysis

Evaluation question: Is quality of life satisfactory?

Hypothesis: Experimental and control program groups will not differ in quality of life. (The evaluator wants to
reject this hypothesis; the study design enables him or her to determine if the hypothesis can be rejected
rather than confirmed.)

Evidence: A statistically significant difference in quality of life favoring program versus control group
participants

Independent variable: Group membership (program participants versus controls)

Design: An experimental design with parallel controls

Sampling: Eligible participants are assigned at random to either the experimental or control group; 150
participants are in each group (a statistically derived sample size)

Dependent variable: Quality of life

Data: Group membership (categorical data: experimental or control), quality of life (continuous numerical
data from the CARES Questionnaire, a 100-point survey in which higher scores mean better quality)

Analysis: A two-sample independent groups t-test

Justification for the analysis: This t-test is appropriate if the independent variable’s data are categorical and the
dependent variable’s data are numerical. In Example 8.1, the assumptions of a t-test are met. These
assumptions are that each group has a sample size of at least 30, the sizes of both groups are about equal, the
two groups are independent (an assumption that is met most easily with a strong evaluation design and a
high-quality data collection effort), and the data are normally distributed. (If one of these assumptions is
seriously violated, other rigorous analytic methods should be used, such as the Wilcoxon rank-sum test, also
called the Mann-Whitney U test. This test makes no assumption about the normality of the distribution;
whereas the t-test is termed parametric, the Mann-Whitney U test is one of a number called nonparametric.)

Unfortunately, no definitive rules can be set for all evaluations and their data. Table 8.2, however, provides
a general guide to the selection of 15 of the most commonly used statistical methods.

Table 8.2 A General Guide to Statistical Data-Analytic Methods in Program Evaluation

183
Hypothesis Testing and p Values: Statistical Significance

Evaluators often compare two or more groups to find out if differences in outcome exist that favor a program;
if differences are present, the evaluators examine the magnitude of those differences for significance. Consider

184
Example 8.2.

Example 8.2 Comparing Two Groups

Evaluation question: Do participants improve in their knowledge of how to interpret food label information in
making dietary choices?

Hypothesis: No difference will be found in knowledge

Evidence:

1. A statistically significant difference in knowledge between participants and nonparticipants must be


found. The difference in scores must be at least 15 points.

2. If a 15-point difference is found, participants will be studied for 2 years to determine the extent to
which the knowledge is retained. The scores must be maintained (no significant differences) over the 2-
year period.

Measurements: Knowledge is measured on the Dietary Choices Test, a 25-item self-administered test.

Analysis: A t-test will be used to compare the two groups in their knowledge. Scores will be computed a
second time, and a t-test will be used to compare the average or mean differences over time.

In the evaluation described in Example 8.3, tests of statistical significance are called for twice: to compare
participants and nonparticipants at one point in time and to compare the same participants’ scores over time.
In addition, the evaluators stipulate that for the scores to have practical meaning, a 15-point difference
between participants and nonparticipants must be obtained and sustained. With experience, health program
evaluators have found that, in a number of situations, statistical significance is sometimes insufficient
evidence of a program’s merit. With very large samples, for example, very small differences in numerical
values (such as scores on an achievement test or laboratory values) can be statistically significant, but have
little practical, educational, or clinical meaning and may actually incur more costs than benefits.
In the evaluation presented in Example 8.2, the standard includes a 15-point difference in test scores. If
the difference between scores is statistically significant, but only 10 points, then the program will not be
considered significant in a practical sense.

Statistical significance and the p value. A statistically significant program evaluation effect is one that is
probably due to a planned intervention rather than to some chance occurrence. To determine statistical
significance, the evaluator restates the evaluation question (“Does a difference exist?”) as a null hypothesis
(“No difference exists”) and sets the level of significance and the value that the test statistic must obtain to be
significant. After this is completed, the calculations are performed. The following guidelines describe the
steps the evaluator takes in conducting a hypothesis test and in determining statistical significance.

185
Guidelines for Hypothesis Testing, Statistical
Significance, and p Values

1. State the evaluation question as a null hypothesis. The null hypothesis (H0) is a statement that no difference
exists between the averages or means of two groups. The following are typical null hypotheses in program
evaluations:

• No difference exists between the experimental program’s mean score and the control program’s mean
score.
• No difference exists between the sample’s (the evaluation’s participants) mean score and the
population’s (the population from which the participants were sampled) mean score.

When evaluators find that a difference does not exist between means, they use the following terminology:
“We failed to reject the null hypothesis.” They do not say, “We accepted the null hypothesis.” Failing to
reject the null suggests that a difference probably exists between the means, say, between Program A and
Program B. Until the evaluators examine the data, however, they do not know whether A or B is favored.
When the evaluators have no advance knowledge of which is better, they use a two-tailed hypothesis test.
When they have an alternative hypothesis in mind—say, A is larger (better) than B—they use a one-tailed
test. Before they can describe the properties of the test, other activities must take place.

2. State the level of significance for the statistical test (for example, the t-test) being used. The level of significance,
when chosen before the test is performed, is called the alpha value (denoted by the Greek letter alpha: α).
The alpha gives the probability of rejecting the null hypothesis when it is actually true. Tradition keeps
the alpha value small—.05, .01, or .001—because among the last things an evaluator needs is to reject a
null hypothesis when, in fact, it is true and there is no difference between group means.

The p value is the probability that an observed result (or result of a statistical test) is due to chance (and
not to the program). It is calculated after the statistical test. If the p value is less than alpha, then the null
hypothesis is rejected.
Current practice requires the specification of exact p values. That is, if the obtained p is .03, the evaluator
should report that p = .03 rather than p < .05. Reporting the approximate p was common practice before the
widespread use of computers (when statistical tables were the primary source of probabilities). This practice
has not been eradicated, however. The merit of using the exact values is evidenced in this example: a finding
of p = .06 may be viewed as not significant, whereas the exact finding of p = .05 will be viewed as significant.

3. Determine the value that the test statistic must attain to be significant. Such values can be found in statistical
tables. For example, for the z distribution (a standard, normal distribution) with an alpha of .05 and a
two-tailed test, tabular values (found in practically all statistics textbooks) show that the area of acceptance
for the null hypothesis is the central 95% of the z distribution and that the areas of rejection are 2.5% of
the area in each tail. The value of z (found in statistical tables) that defines these areas is –1.96 for the

186
lower tail and +1.96 for the upper tail. If the test statistic is less than –1.96 or greater than +1.96, it will be
rejected. Figure 8.1 illustrates the areas of acceptance and rejection in a standard normal distribution using
α = .05.

4. Perform the calculation.

Figure 8.1 Defining Areas of Acceptance and Rejection in a Standard Normal Distribution Using α =
.05

Source: B. Dawson-Saunders and R. Trapp. (1990). Basic and clinical biostatistics. New York: Appleton and Lange, p. 74; used by permission.

Note: (A) two-tailed or nondirectional, (B) one-tailed or directional upper tail, (C) one-tailed or directional lower tail.

Clinical or Practical Significance:


Using Confidence Intervals

The results of a statistical analysis may be significant, but not necessarily practical. In clinical research, this
duality is called “statistical significance versus clinical significance.” The following discussion is based on an
editorial that appeared in the Annals of Internal Medicine (for the complete reference, see the entry for
Leonard Braitman in the “Suggested Readings” section of this chapter). Although that editorial refers to the
clinical significance of a treatment for cancer, the issues addressed and the extreme clarity of the discussion
make it especially useful for program evaluators when reviewing the significance of an evaluation.

Evaluation 1. Suppose that in a large, multicenter evaluation of a program in pain management for cancer
patients, 480 out of 800 (60%) patients respond well to the new program, while 416 out of 800 (52%) patients
do well in the traditional or standard program. Using a chi-square test to assess the existence of a real
difference between the two treatments, a p value of .001 is obtained. This value is the probability of obtaining

187
by chance the 8-point (60% − 52%) or even larger difference between patients in the new and traditional
programs. The point estimate is 8 percentage points, but because of sampling and measurement errors (which
always exist in evaluation research), the estimate is probably not identical to the true percentage difference
between the two groups of patients. A confidence interval provides a plausible range for the true value. A
confidence interval is computed from sample data that have a given probability that the unknown true value is
located between them. Using a standard method, the 95% confidence interval (95% CI) of the 8-percentage
point difference comes out to be between 3% and 13%. A 95% CI means that about 95% of all such intervals
would include the unknown true difference and 5% would not. Suppose, however, that given the side effects
and other costs of the new program, the smallest practical and thus acceptable difference (determined during
the standard-setting step of the evaluation process) is 15%; then the evaluator will conclude that the 8-
percentage point difference between interventions is not significant from a practical, health perspective,
although it is statistically significant.

Evaluation 2. Consider another evaluation with 15 of 25 patients (60%) responding to a new program in
cancer pain management and 13 of 25 (52%) patients are responding to a traditional program. The sample
size is 1/32 of that in the first example. The p value is .57 in this evaluation, in contrast to p = .001 in the
larger evaluation. These probabilities (ps) correspond to the same observed 8-percentage point difference as in
the above example. In this evaluation, the 95% CI extends from –19% to 35%; these values are statistically
indistinguishable from the observed difference of 8 percentage points. The larger evaluation permits a more
precise estimate of the true value. The greater width of the interval also shows the greater uncertainty that is
produced in an estimate based on a smaller sample. Thus, the use of a confidence interval enables the
evaluator to assess statistical and practical significance.

Establishing Clinical or Practical Significance

As shown in Figure 8.2, a difference in outcome between two groups in a program evaluation is significant
in a practical sense when its 95% confidence interval is completely above the smallest practical or clinically
important difference. As this figure shows, the confidence interval (3% to 13%) obtained for Evaluation
1 falls below the desired 15-point difference; it is not practically significant. The confidence intervals in
Evaluation 2 (−19% to 35%) and in Evaluation 3 (−3% to 51%) contain negative and positive differences, so
no definite conclusion about practical and clinical significance is possible.

188
Evaluation 3. In this evaluation, 15 of 25 (60%) and 9 of 25 (36%) of patients benefit from the new and
traditional programs. The confidence interval for the difference (60% − 36% = 24%) is −3% to 51%. The p
value (found in a table in a statistics text) is equal to or greater than .05, which is not statistically significant.
The confidence interval and p are related; if the interval contains 0, then the p is not significant. In this case,
0% can be found in the −3% to 51% interval. But, 0% is only one of many values inside the confidence
interval. The evaluator cannot state, “No meaningful difference”; therefore, because much of the interval falls
above the cutoff of the 15-point difference, the results can be interpreted as practically or clinically
inconclusive.

Evaluation 4. In this evaluation, 240 of 400 patients (60%) respond to the new program, while 144 of 400
(36%) patients respond to the traditional or standard program. The difference is 24% and the 95% CI is 17%
to 31%. The difference is statistically (p < .05) and practically significant.

Risks and Odds


189
Evaluators commonly assess risk and odds in evaluation studies. These are alternative ways of describing the
likelihood that a particular outcome will occur. Suppose that you are conducting a survey to find out why a
weight-loss program was not effective. You decide to survey the people in the program to find out the nature
and characteristics of the problems they have with dieting. You find that out of every 100 people who have
trouble dieting, 20 people report frequent problems. To identify the risk of a person’s having frequent
problems, you divide the number of individuals reporting frequent problems by the number you are studying,
or 20/100: The risk is 0.20. The odds of a person’s having frequent problems are calculated differently. To get
the odds, you subtract the number of persons with frequent problems (20) from the total (100) and use the
result (80) as the denominator. Thus the odds of having frequent problems with dieting are 20/80, or 0.25.
Table 8.3 shows the difference between risk and odds.
Because assessing risk and calculating odds are really just different ways of talking about the same
relationship, risk can be derived from odds, and vice versa. We can convert risk to odds by dividing it by 1
minus the risk, and we can convert odds to risk by dividing odds by 1 plus the odds:

Odds = Risk/(1 − Risk)


Risk = Odds/(1 + Odds)

Table 8.3 Risk and Odds, Compared and Contrasted

When an outcome is infrequent, little difference exists in numerical values between odds and risk.
However, when the outcome is frequent, differences emerge. If, for instance, 20 out of 100 persons have
frequent problems following a diet, the risk and odds are similar: 0.20 and 0.25 respectively. If 90 out of 100
persons have frequent problems, then the risk is 0.90 and the odds are 9.00.

Odds Ratios and Relative Risk

Evaluators often use odds ratios to compare categorical dependent variables between groups. Odds ratios
allow an evaluator to estimate the strength of the relationship between particular variables. For example,
suppose that you have evidence that people who have trouble dieting also do not eat breakfast. (You have
reason to believe that people who don’t eat breakfast often decide to snack before lunchtime.) You want to
check out the strength of the relationship between not eating breakfast and subsequent problems with
keeping to a diet. You put together two groups of people: those who have problems adhering to a diet and
those who do not. You ask them all, “Do you eat breakfast?” Their response choices are categorical: yes or no.
To analyze the responses, you might create a 2 × 2 table like the following:

Problems Adhering to a Weight-Loss Diet

190
No Breakfast

This is a 2 × 2 table because it has two variables with two levels: breakfast (yes or no) and problems adhering
(yes or no). Notice that the table has four cells for data entry. The two variables, “no breakfast” and “problems
adhering,” are categorical variables.
If you were interested only in whether statistical differences exist in the numbers of people in each of the
cells, you could use the chi-square test. A chi-square test does not measure the strength of the relationship
between the variables, it only allows you to infer differences. In contrast, an odds ratio allows you to posit this
type of statement: “The odds of having problems adhering to a diet are greater (or lesser) among people who
eat (or do not eat) breakfast.”
The odds ratio is employed in evaluations that use case control designs. In such evaluations, the researcher
decides how many “cases” (e.g., people who have problems) and “controls” (e.g., people who do not have
problems) to include in the sample. To calculate the odds ratio, the evaluator counts how many in each group
have the “risk factor” (e.g., no breakfast) and divides the odds of having the risk factor among the cases by the
odds of having the risk factor among the controls. (The use of terms, such as case control and risk factor comes
from studies of public health problems.)
Odds ratios are integral components of other statistical methods. Analytic software programs that do
logistic regressions or interpret them automatically yield odds ratios. Example 8.3 gives the formula for the
odds ratio.

Example 8.3 The Formula for the Odds Ratio

Risk Factor Present?

Example 8.4 illustrates how this formula works.

Example 8.4 The Odds Ratio

Suppose that you are interested in the relationship between not eating breakfast (the risk factor) and trouble
adhering to a diet. You ask this question: When compared with people who eat breakfast, what is the
likelihood that people who do not eat breakfast will have trouble adhering to a weight-loss diet?

191
You identify 400 people who have trouble adhering to a diet and 400 people who do not. You find that
among all people with problems, 100 people do not eat breakfast and 300 are breakfast eaters. Among people
without problems, 50 do not eat breakfast. To compare the odds of problems between the two groups, you
put the data into a 2 × 2 table:

Problems Keeping to a Diet

The odds ratio (OR) is calculated as follows:

OR = ad/bc = 100 × 400/50 × 300 = 40,000/15,000 = 2.67.

The odds of being exposed to the risk factor (no breakfast) are 2.67 higher for people in the sample who have
problems adhering to a weight-loss diet than for people who do not have such problems. The answer to your
question is that people who do not eat breakfast are 2.67 times more likely to encounter problems adhering to
a diet than people who eat breakfast.

Evaluators also use risk ratios to examine the strength of the relationship between two categorical variables,
however, this method for studying the relationship is quite different from calculating odds ratios. Risk ratios
are calculated for evaluations using a cohort design (for more on cohort designs, see Chapter 3). Suppose that
you want to determine how likely it is that people who do not eat breakfast have trouble adhering to a
weight-loss diet. First you pick the cohort—in this situation, people with problems adhering to a diet. Then
you select a period of time to observe them, say, 12 months. At the end of the 12 months, you count the
number of people who had adherence problems and the number of people who did not. Do these numbers
differ? If so, was a greater risk probable for people who did not eat breakfast?
When you rely on cohort designs to calculate risk ratios (also called relative risk or likelihood ratios), you
have no control over how many people are in each group. Obviously, you cannot control the number of
people in the study group who develop problems, although you do have control over the number of people in
your study group. The evaluator who uses a case control design has control over the numbers of people who
have and who do not have the problem. He or she can say: I will have 100 cases and 100 controls. It may take
the case control evaluator more or less than 12 months to identify enough cases for each group. In technical
terms, the risk ratio is the ratio of the incidence of “disease” (e.g., problems adhering to a diet) among the
“exposed” (people who do not eat breakfast) to the incidence of “disease” among the “nonexposed” (people
who do eat breakfast). Incidence rates can be estimated only prospectively.
In a case control study, you cannot estimate the probability of having a problem because you, the evaluator,
have determined in advance how many people are in the cases and how many are in the controls. All you can
do is determine the probability of having the risk factor. This is unusual given that normally evaluators are
concerned with finding out the probability of having the problem, not the risk for the problem. However,

192
when you compare odds in an odds ratio, you will find that the ratio of the odds for having the risk factor is
identical to the ratio of odds for having the disease or problem. Thus, you can calculate the same odds ratio
for a case control as for a cohort study or randomized controlled trial. It doesn’t matter which variable is
independent and which is dependent—the odds ratio will have exactly the same value. This is not true of risk
ratios, and so researchers find odds ratios an excellent way to answer questions or test hypotheses involving
categorical dependent variables between groups defined by categorical independent variables.

Qualitative Evaluation Data: Content Analysis

Content analysis is a set of procedures for analyzing qualitative information. The researcher may collect
information directly from people, by asking them questions about their knowledge, attitudes, and behavior; or
indirectly, through observations of their behavior. The behavior may be online (measuring how often a web
page is visited or an article is downloaded) or in person (how many informal groups of children are created in
a school yard during lunch period).
Surveys typically ask people to comment in their own words. Analyzing the resulting data requires the
evaluator to pore over the respondents’ written or verbal comments to look for ideas or themes that appear
repeatedly. Once the themes are identified, they can be coded, which allows them to be counted and
compared.
Another familiar type of qualitative data comes from observations of people. For example, two members of
an evaluation team might spend a month observing patients as they arrive at a hospital’s emergency
department to determine what happens to them as they wait to be seen by health professionals. The observers
would then compare notes and summarize their findings.
Evaluators might also obtain qualitative data from documents and from various forms of media that were
produced for other purposes. For instance, suppose that you are part of a program planning and evaluation
team that wants to start an antismoking media campaign in a particular community. A first step is to examine
how stories about smoking and health have been handled in the mass media, specifically in news coverage. Of
the several types of news media, you decide to restrict your review of stories about smoking and health to local
and national newspapers. As in any analysis, you also have to be concerned with research design, sampling,
and analysis. What time period should your review cover? The past 10 years of news? The past 20 years? How
many articles should you survey? All of them? A sample? How should you define and code each type of story?
Which data analytic methods are appropriate for comparing the number and types of stories over time? You
might hypothesize that the number of newspaper stories about the harmful effects of smoking increased
significantly from 2009 to 2012, but that the number of stories has remained relatively constant since that
time. You can use content analysis methods to test these hypotheses and to answer your research questions.
There are five main activities involved in conducting a content analysis:

1. Assembling the data from all sources

2. Learning the contents of the data

193
3. Producing a codebook

4. Entering and cleaning the data

5. Doing the analysis

Assembling the Data

Qualitative data often take the form of a great many pages of notes and interview transcripts. The unsorted
data are the foundation of a qualitative database. They are not the same as the database, and, on their own,
they are not interpretable or amenable to analysis.
In addition to transcripts of individual and group interviews and field notes from observations, qualitative
data include participants’ responses to open-ended survey questions and reviews of written, spoken, and
filmed materials.
Transcripts are written, printed, or recorded accounts of every word spoken in interviews or by observed
individuals during data collection. Producing a transcript can take a great deal of time. For example, a
verbatim report of a typical discussion among eight people during a 90-minute group interview may result in
50 or more pages of transcript text.
Of course, not all data in a complete transcript are necessarily relevant, and so producing complete
transcripts may be unnecessary. Sometimes people get sidetracked during a discussion—they tell jokes,
change the subject, and so on—and all of these digressions are included in a complete transcript. Because
producing complete transcripts is so time consuming, and because such transcripts often contain many data
that are irrelevant to the research topic, some evaluators rely on abridged transcripts that include all pertinent
discussions and omit irrelevant remarks. It should be noted, however, that by using abridged transcripts an
evaluator may run the risk of excluding important information.
Written transcripts do not capture the expressions on people’s faces during a discussion, nor do they
adequately describe the passion with which some people state their positions. Because of this fact, transcripts
are often supplemented by audio and visual documentation. Evaluators sometimes use portions of the visual
and audio records from their studies in their evaluation reports to illustrate the findings and to lend a
“human” touch to the reports’ words and statistics. However, evaluators should be aware that any use of
participants’ words, voices, or visual images to justify or explain the research findings or recommendations
may raise legal and ethical questions. Participants must be given the opportunity to consent to the use of their
words and visual images in advance of data collection. They should be told where and under what
circumstances the information will be used and the risks and benefits of use should be disclosed.
Field notes are the notes taken by observers or interviewers “in the field”—that is, while they are conducting
observations or interviews. Transcribing and sorting through field notes is an onerous process. The evaluator
must review the notes and fill in any missing information, which requires remembering what was said or
done, when, and by whom. Some people have great memories; others are less fortunate. Some observers and
interviewers take better notes than others. It is preferable to have two or more observers taking notes in any
given setting so that the evaluator can compare their findings in order to estimate interrater reliability (for
more on interrater reliability, see Chapter 6). Having two note takers or observers may also reduce the

194
amount of recall needed to assemble a complete set of data, because if one observer forgets who did what to
whom, the other may have recorded the information in great detail. Nevertheless, using two or more
observers in a setting can increase a study’s personnel costs. Also, with two or more observers, disagreement is
inevitable. The need for a third person to arbitrate when disagreements occur also increases personnel costs,
as well as the time needed to organize notes and observations.
Focus groups are targeted discussion groups. To be effective, a focus group must be led by a skilled
moderator, an individual who is able to focus the group’s attention on a specific topic. Note takers are often
employed to record focus group discussions because the moderators are too busy keeping group members on
track to take notes as well. However, even skilled note takers may leave out a great deal of information due to
the difficulty of capturing every speaker’s point. Because of the fear of losing vital information, evaluators
usually record focus group discussions on audiotape or, when possible, videotape. In recent years, voice
recognition software has become available that can be a part of a focus group leader’s toolbox. This type of
technology eliminates some of the difficulties of transcribing and interpreting notes.

Learning the Contents of the Data

The evaluator’s second step in content analysis is to become extremely familiar with the data that was
collected. The evaluator must understand the data thoroughly before he or she can assign codes to the data in
anticipation of data entry and analysis. Learning the contents of the data can mean reading through hundreds
of recorded pages of text, watching videotapes, and listening to audiotapes for days or even weeks. Becoming
reacquainted with the discussion in a single group interview, for example, may require a full day of the
evaluator’s time, and transcription of even an hour of audiotape can take as much as four to eight hours.

Creating a Codebook or Data Dictionary

Surveys with closed (or closed-ended) questions assign code or values in advance of data collection. Surveys
with open-ended questions assign them after the data are collected. This is illustrated in Example 8.5.

Example 8.5 Coding Closed and Open-Ended Questions

Closed Questions

The following are excerpts from a survey of visits to a clinic made by new mothers as part of an evaluation of
the effectiveness of a program to prevent postpartum depression.

Question 5. Did the mother visit the clinic within 2 weeks of delivery? (postpartum)

Yes (1)
No (0)
No data (.a)
Question 7. Did the mother visit the clinic within 6 weeks of delivery? (well visit)

195
Yes (1)
No (0)
No data (.a)

A corresponding portion of a codebook for this survey appears as follows:

Open-Ended Questions

The following are excerpts from a summarized transcript of a group interview discussion among participants
in an evaluation of a program to prevent postpartum depression in new mothers.

Entering and Cleaning the Data

In content analysis, data entry involves organizing and storing the contents of transcripts and notes. Data
may be entered and organized by person, place, observation, quotation, or some other feature. If you are

196
conducting a very small program evaluation, you might be tempted to organize such data on index cards.
However, better options for storage include spreadsheets, database management programs, and word
processing programs. Special software designed specifically for qualitative analysis is also available.
Cleaning the data may mean deciding on which data to discard. Why would anyone discard data?
Sometimes evaluators have to discard data because they are indecipherable (e.g., incomplete or unreadable
notes; broken audiotape) or irrelevant. If the evaluator elects to discard data, he or she must create rules
regarding what to do about lost or missing data. How will the lost data be handled? How much will the
absence of particular data influence the conclusions that can be drawn? When data are missing, the evaluator
must be sure to discuss the implications of this fact in the evaluation report.
Once the data are cleaned and their strengths and weaknesses are understood, they can be organized into a
database. Only a clean database stands a chance of producing reliable information. Inconsistent and
incomprehensible information is invalid. Obtaining a clean database is an objective shared by all evaluators,
whether their data collection methods are qualitative or statistical. Accomplishing this objective is extremely
time-consuming, and regardless of the size of the evaluation, evaluators using qualitative data collection
should plan adequate time for this major task.

Doing the Analysis

The final step is actually doing the analysis. One approach to the analysis involves asking respondents to
give their views without prompting, whereas another approach involves prompting in the form of mentioning
specific topics or themes. Example 8.6 shows how a content analysis with prompting might work.

Example 8.6 Barriers to Attendance at


Depression Therapy Classes

Evaluation data collection: Structured interview with 200 new mothers

Purpose: To discover barriers to attendance at depression therapy classes

Premise: Barriers include lack of transportation, lack of child care for older children, inability to get off from
work, lack of motivation (don’t think they need the classes)

Question: Which of the following [the barriers are listed] is the most important reason why you or other
mothers might not be able to attend parenting classes?

Analysis: Count each time a particular barrier (e.g., lack of transportation) is chosen as the most important
reason

Results:

197
*The number in each cell represents the number of times that barrier is mentioned.

Conclusions: Overall, lack of child care is the most frequently mentioned reason for not attending classes. Older and younger women differed in
their barriers, with younger women believing strongly that they don’t need classes. A greater number of older women than younger women
cited lack of transportation as a barrier. Inability to get off from work was not cited frequently by either age group.

Prompting to collect evaluation information means identifying preselected themes. The evaluator must
derive the themes from the research literature and from past experience. If you were to prompt the mothers in
Example 8.6 with preselected themes, you might ask them to come together in a group interview or focus
group to answer questions like these: How important is lack of transportation (child care, inability to get off
work, need for classes) as a reason for your coming or not coming to classes? What other reasons can you
think of that might help us understand why women might choose to stay away from classes? After the group
session, you would analyze the data by reviewing the transcript of the interview and counting the number of
times the participants cited each of the barriers.

Meta-Analysis

Meta-analysis is a method for combining and analyzing the findings of studies that address the same research
questions. The idea is that the larger numbers obtained from contributing studies have greater statistical
power and generalizability together than any of the individual studies. Meta-analysis provides a quantitative
alternative to the traditional review article in which experts use judgment and intuition to reach conclusions
about the merits of a program or, alternatively, base their conclusions on a count of the number of positive
versus negative and inconclusive studies.
Suppose that you are conducting a meta-analysis to answer the question, “Do programs to educate
adolescents about health care result in improved decisions among adolescents about their own health care?”
(See Figure 8.3.) You would answer the question by completing these seven tasks:

1. State the problem (in this case, whether programs to educate adolescents improve their decisions about
their own health care).

2. Identify all studies that address the problem.

3. Prepare a scale to rate the quality of the studies.

4. Have at least two people review and rate the quality of the studies.

198
5. Include all studies that meet the criteria for quality, according to the reviewers’ ratings of quality.

6. Calculate the difference in improvement between adolescents who were educated and those who were
not and plot the difference as a point on a chart.

7. Calculate the chances that each study can be repeated and produce the same results; show the statistical
range as a line on the chart.

The hypothetical meta-analysis of three evaluations illustrated in Figure 8.3 shows that the weight of the
results suggests that programs to educate adolescents have no advantage over the controls.
Although meta-analysis has its origins in psychology and education, it has become associated with health
care, epidemiology, and medicine. Meta-analyses have been used to summarize a number of diverse
investigations, including studies of the care of pregnant women, the effects of estrogen replacement therapy
on breast cancer, the influence of oat bran on lipid levels, and the treatment of heart attack.
Meta-analysis methods are continually advancing. Techniques for cumulative meta-analyses, for example,
permit the identification of the year when the combined results of multiple studies first achieved a given level
of statistical significance. The technique also reveals whether the temporal trend seems to be toward
superiority of one program or intervention over another, and it allows assessment of the impact of each new
study on the pooled estimate of the treatment effect.

Figure 8.3 Meta-Analysis of Three Educational Programs for Adolescent Health Care

Summary and Transition to


the Next Chapter on Evaluation Reports

This chapter discusses data analysis methods that are particularly useful in program evaluations. Before
choosing a method, the evaluator should determine the number of variables and the characteristics and
distribution of the data. When using tests of significance, the evaluator should decide on clinical or practical
as well as statistical meaning.
The next chapter discusses written and oral evaluation reports. It describes the contents of written reports,
including objectives, methods, results, conclusions, discussion, and recommendations. Special emphasis is
placed on the use of tables and figures to present data. The chapter also explains the contents of a report’s
abstract or concise overview (typically about 250 words). Published evaluation reports are required to conform
to standard reporting guidelines, such as CONSORT (CONsolidated Standards of Reporting Trials) and
TREND (Transparent Reporting of Evaluations with Nonrandomized Designs) statements. Each of these is

199
discussed in the next chapter along with evaluation ethics because reports are now required to demonstrate
that the evaluation adhered to ethical standards.
Since evaluation reporting is often oral, the chapter also provides guidelines for the preparation of oral
presentations and posters.

Exercises

Exercise 1

Directions

For each of the following situations, describe the independent and dependent variables and determine
whether they will be described with categorical, ordinal, or numerical data.

Exercise 2

Directions

Use the following information to select and justify a data analysis method.

Evaluation question: After program participation, is domestic violence decreased?

Evidence: A statistically significant difference in domestic violence is found between families who
participated in the experimental program and families in the control program.

Independent variable: Group or program membership (experimental versus control)

200
Design: An experimental design with parallel controls

Sampling: Eligible participants are assigned at random to either an experimental group or a control group;
100 participants are in each group (a statistically derived sample size).

Dependent variable: Domestic violence

Data: Data on domestic violence will come from the DONT Survey, a 50-point measure in which lower
scores mean less violence.

Exercise 3

Directions

Suppose that the evaluation of the program to reduce domestic violence described in Exercise 2 is
concerned with comparing younger and older persons in the experimental and control groups. Assuming the
use of the DONT Survey, which produces continuous scores, which statistical method would be appropriate?
Explain.

References and Suggested Readings

Braitman, L. E. (1991). Confidence intervals assess both clinical and statistical significance. Annals of Internal
Medicine, 114, 515–517.
See also the StatSoft Web site at http://www.statsoft.com/textbook/ for an online statistics textbook that
covers everything a good statistics text covers and then some. This volume, which includes an excellent
glossary, is highly recommended for those interested in learning about statistics as well as those who want
to learn statistics. StatSoft, the publisher of STATISTICA analysis and graphic software, offers this
textbook as a public service.

201
Purpose of This Chapter

This chapter explains how to prepare a transparent evaluation report. A transparent report provides
a detailed and accurate explanation of the evaluation’s purposes, methods, and conclusions.
Transparency improves with the use of standardized reporting checklists, such as the Consolidated
Standards of Reporting Trials (CONSORT Statement) and Transparent Reporting of Evaluations
with Nonrandomized Designs (TREND).

Evaluation reports take the form of printed or online manuscripts, oral presentations, or posters.
This chapter first discusses how to prepare manuscripts and describes the figures and tables you
need to show results. Next, the chapter examines oral reports and explains how to prepare slides and
posters.

Many program evaluations, particularly randomized and nonrandomized studies, include human
subjects. Increasingly, program evaluators are asked to state in their report whether their evaluations
measure up to ethical principles for conducting research with human subjects. Many journals in all
fields will only accept reports that provide evidence that the evaluation was reviewed and approved
by an ethics board. The chapter discusses how to conduct and report on evaluations that respect the
uniqueness and independence of individuals, evaluations that actively make an effort to secure their
well-being, and evaluations that balance the risks and benefits of participation.

202
9
Evaluation Reports

A Reader’s Guide to Chapter 9

The Written Evaluation Report


Composition of the Report: Introduction, methods, results, conclusions or discussion,
recommendations; the abstract, the executive summary

Reviewing the Report for Quality and Ethics


Need for the evaluation, justification of questions and choice of evidence, description of the program,
evaluation design and sampling, justification for and validity of the data sources, appropriateness of the
data analysis, completeness and accuracy of reporting, and limitations of the findings

Oral Presentations

Posters

Ethical Evaluations

The Internet and Ethical Evaluations

Sample Questionnaire: Maintaining Ethically Sound Online Data Collection

Example: Consent Form for an Online Survey

Research Misconduct

Exercises

Suggested Websites

The Written Evaluation Report

Evaluation reports can take the form of books, monographs, and journal articles. They may be available in
print, online, or both. A useful evaluation report provides enough information so that at least two interested
individuals who read the report will independently agree on the evaluation’s purposes, methods, and
conclusions.
If an evaluation report is prepared for submission to a funding organization, such as a foundation, trust, or
government agency, the composition and format of the report are usually established by that organization. In
many cases, however, evaluators are on their own in deciding on the length and content of their reports. Most

203
program evaluation reports are somewhere between 2,500 and 15,000 words in length, or from 15 to 60
double-spaced printed pages (using common fonts, such as Arial and Times New Roman, 11- to 12-point
type, and 1-inch margins).
In addition to the text, evaluation reports include lists of relevant bibliographic references as well as tables
and figures (e.g., photographs, graphs) to illustrate evaluation findings; usually, no more than 10 tables and
10 figures are the norm. In addition, an abstract of approximately 250 words and a summary of up to 15 pages
are often helpful. Very long reports are rarely read. Evaluators who produce long reports should always make
certain that their executive summaries are concise and accurate, because most people will focus on the
summary when faced with reams of pages.
Evaluation reports sometimes post online working documents, such as résumés, project worksheets, survey
response frequencies, complex mathematical calculations, copies of measures (e.g., survey questionnaires or
medical or school record review forms), organizational charts, memorandums, training materials, videos, and
project planning documents.
Example 9.1 gives the table of contents for an evaluation report on an 18-month program combining diet
and exercise to improve health status and quality of life for persons 75 years of age or older.

Example 9.1 Sample Table of Contents for a Report: An


Evaluation of the Living-at-Home Program for Elders

Abstract: 250 words

Summary: 8 pages

Text of report: 41 pages

I. Introduction: The Health Problem, the Program, the Evaluation’s Purpose and the Evaluation
Questions/Hypotheses and Evidence of Merit (6 pages)

II. Methods (10 pages)

A. Evaluation design

B. Objectives and activities of the intervention and control programs

C. Sample

1. Inclusion and exclusion criteria

2. Justification of sample sizes

3. How sample was selected and assigned to groups

D. Outcome Measures

1. Reliability and validity of measures of quality of life, health, and cost-effectiveness

2. Quality assurance system for the data collection

204
E. Analysis
(In this section, the evaluators cite and justify the specific method used to test each hypothesis or
answer each evaluation question. For example: “To compare men and women in their health, we used
a t-test, and to predict who benefited most from participation in the experimental program, we relied
on stepwise multiple regressions.”)

III. Results (15 pages)

A. Response Rates (such as how many eligible men and women agreed to participate in the evaluation;
how many completed the entire program; how many individuals completed all data collection
requirements). This information can be presented in a flow chart.

B. Demographic and other Descriptive Characteristics (for the experimental and control groups:
numbers and percentages of men and women; numbers and percentages under 65 to 75 years of age,
76 to 85 years of age, and 85 years and older; numbers and percentages choosing each of the two
health care staffing models).

C. Effectiveness: Quality of Life and Health Status

D. Cost-Effectiveness of Two Staffing Models of Care

IV. Conclusions (8 pages)

V. Recommendations (2 pages)

VI. Tables and Figure

A. Table 1. Demographic Characteristics of Participants

B. Table 2. Health Outcomes and Quality of Life for Men and Women With Varying Levels of Illness

C. Table 3. Costs of Three Clinic Staffing Models

D. Figure 1. Flowchart: How Participants Were Assigned to Groups by “Cluster”

VII. Online Supplements/Appendixes

A. Copies of all Measures

B. Calculations Linking Costs and Effectiveness

C. Final Calculations of Sample Size

D. Testimony From Program Participants Regarding Their Satisfaction With Participation in the
Experimental Program

E. Informed Consent Statements

F. List of Panel Participants and Affiliations

G. Training Materials for all Data Collection

205
H. Data Collection Quality Assurance Plan

Suppose that the Living-at-Home Program report outlined in Example 9.1 works this way:

• Program: The members of an experimental group of elderly people who still live at home receive the diet
and exercise program, whereas elderly individuals in another living-at-home group do not. Participants
in the evaluation who need medical services choose freely between two clinics offering differing models
of care, one staffed primarily by physicians and the other staffed primarily by nurses.
• Assignment to study groups: Participants are randomly assigned to the experimental or control programs
according to the streets on which they live. That is, participants living on Street A are randomly
assigned to either the experimental or control program, as are participants living on Streets B, C, and so
on.
• Main outcomes: (a) Whether program participation makes a difference in the health and quality of life of
elderly men and women and the role of patient mix in making those differences, and (b) the cost-
effectiveness of the two models of health care delivery.

Composition of the Report

Introduction

The introduction to a program evaluation report has three components: (a) a description of the problem,
(b) an explanation of the means that the experimental program will use to solve the problem, and (c) a list of
questions that the evaluation answers about the merits of the program’s solution to the problem. Example 9.2
illustrates the contents of the introduction to a written report.

Example 9.2 What to Include in the


Introduction to a Written Report

1. The problem: Describe the problem that the program and its evaluation are designed to solve. In the
description, tell how many people are affected by the problem and discuss its human and financial costs.
Cite the literature to defend your estimates of the importance and costs of the problem.

2. The program: Give an overview of the program’s objectives and activities and any unique features (such as
its size, location, and number and types of participants). If the program is modelled on some other
intervention, describe the similarities and differences and cite supporting references.

3. The evaluation: State the objectives of the evaluation, the questions, and the evidence of program merit. If
you used an evaluation framework to guide planning and evaluation, describe it and its use in the
evaluation. Establish the connections among the general problem, the objectives of the program, and the
evaluation. In other words, tell how the evaluation provides knowledge about this particular program and
also provides new knowledge about the problem, as in the following example of the evaluation of a home
health care program for older adults.

206
Sample Introduction

The purpose of this evaluation is to identify whether community-dwelling elderly who participated in a home
health care program showed improvement in their health and quality of life. Participants were randomly
assigned to an experimental or a control group. Because evidence exists that home health care can improve
social, emotional, and physical functioning in the elderly [references regarding the potential benefits of home
health care should have been cited in the first part of the introduction], we asked about the effectiveness of
the program for men and women of differing ages and levels of medical and social problems and the nature,
characteristics, and costs of effective home health services. Using the literature as a basis for deciding on
evidence of effectiveness, we hypothesized that the experimental group would see greater improvements than
the control group and that the program would not be any more costly.

Methods

The methods section of the report should define terms and describe the program or interventions, design,
participants, outcomes and their measures, and analysis. Some specific recommendations are as follows:

• The program: Describe the experimental and comparison programs, carefully distinguishing between
them. If you prepared protocols to standardize the implementation of the programs, describe them and
any training in the use of the protocols that took place.
• Definitions: Define all potentially misleading terms, such as quality of life, health status, high risk,
accessible care, high quality of care, and efficiency.
• Design: Convey whether the evaluation used an experimental or observational design. If the design was
experimental, specify the type (e.g., parallel controls, in which participants are randomly assigned to
experimental and control groups).
• Setting: Report on where the evaluation data were collected. You might include the geography (e.g.,
name of city and state or country), the locale (e.g., urban versus rural), and the site (e.g., clinic, academic
medical center, community-based group practice).
• Participants: Give the inclusion and exclusion criteria for participation in the evaluation. Tell whether
the participants were randomly selected and randomly assigned. Explain how the sample sizes were
determined.
• Outcomes and measures: Describe each outcome and the characteristics of the data collection measures for
the main evaluation questions. Who administered each measure? Was training required for each? Is each
measure reliable? Is each valid? How much time is required to complete each measure? How many
questions does each contain? How were the questions selected for each measure? If appropriate, cite the
theory behind the choice of questions or the other measures on which they were based. How is each
measure scored?
• Analysis: Check each evaluation question for the main variables. Then, for the main variables, describe
and justify the analytic method. Have you used any unusual methods in the analysis? If so, you should
describe them. Name the statistical package used in case other evaluators want to perform a similar

207
analysis using the same setup. If you have used a relatively new or complex data analytic method, provide
a reference for it.

Results

The results section of the report presents the results of the process or implementation evaluation and the
statistical analyses. All response rates should appear in this section, along with descriptions of the evaluation
participants’ characteristics. This section should include a comparison of the individuals who agreed to
participate with those who refused or participants who did not complete the entire program or provide
complete data. The results for each major evaluation question and its subquestions also should appear in this
section. For example, if one of your main questions asks whether patients’ quality of life improves, you should
present the results for that question. You might also provide data on the types of individuals (e.g., older men,
sicker patients) for whom the program was most and least effective in terms of quality of life.
Tables and figures (graphs or other illustrations) are useful for summarizing results in this section. Example
9.3 displays a typical table that often appears first in most reports.

Example 9.3 A Table to Describe the


Characteristics of an Evaluation’s Participants

Figure 9.1 Variability in Transfusion Practice During Surgery (Hypothetical)

208
Source: D. P. Sulmasy, G. Geller, R. Faden, and D. M. Levine. (1992). The quality of mercy: Caring for patients with “do not resuscitate”
orders. Journal of the American Medical Association, 267, 682–686. Copyright 1981, 1991, 1992, American Medical Association.

Note: Number (percentage) of patients transfused with red blood cells among 30 first-time surgery patients at each institution. The distribution
of institutions is expressed as a histogram, indicating the number of institutions each transfusing zero to 30 patients. The distribution of
patients who received red blood cells was variable among institutions. Open bars indicate autologous blood; shaded bars indicate homologous
blood.

The results section is not the place for an interpretation of the data. Statements like “These results
contradict the findings of previous . . .” belong in the discussion section.
The following are some specific recommendations for the use of figures and tables in reporting evaluation
results.

Using figures. Figure 9.1 provides an example of a figure that is useful for reporting results. This figure comes
from a hypothetical study concerned with describing variability in transfusion practice during surgery. Based
on the data presented in the figure, the study reported that the variation in the percentage of patients
receiving red blood cells among institutions differed significantly (range, 17% to 100%; p < .001). Among the
study’s conclusions was that, in view of the NIH’s Consensus Conference programs that addressed blood
component use during surgery, conference recommendations need to be applied more effectively at the
institutional level.
In preparing figures for your evaluation report, you should conform to the same rules followed by the
creators of Figure 9.1:

1. Place variables that are being compared (in the case of Figure 9.1, institutions) along the x-axis.

2. Place numbers along the y-axis; when appropriate, include percentages on the figure.

3. Make sure that visual differences correspond to meaningful differences. (In the case of Figure 9.1, the
authors express the distribution of institutions as a histogram. In a histogram, the area of each bar is
proportional to the percentage of observations in that interval. For example, if 11 of 100 patients were

209
found in an interval labeled 20 to 50 years of age, then the 11 patients would constitute 11/100, or 11%,
of the area of the histogram.)

4. Include an explanation of the findings (e.g., “The distribution of patients who received red blood cells
was variable among institutions.”)

5. Include a legend or key (e.g., “Open bars indicate autologous blood; shaded bars indicate homologous
blood.”).

Using tables. When you create a table for an evaluation report, you should keep three main rules in mind.
First, the most important values to be compared should appear in the table columns. For example, if you are
describing the characteristics (such as age or educational level) of users and nonusers of smokeless tobacco (see
Figure 9.2), the values (such as numbers and percentages of persons with the differing characteristics) go in
the columns.

Figure 9.2 The Columns: Users and Nonusers of Smokeless Tobacco

Second, if appropriate and possible, the statistical values should appear in ascending (largest values) to
descending order. Suppose that the table depicted in Figure 9.3 displays the results of a nationwide survey of
734 people who were asked whether or not they preferred fish or meat for dinner. Note that, in this table, the
preferences for meat are in descending order. The choice of which values to place first depends on the points
being emphasized. If the focus of the evaluation in this case were on preferences for fish instead of for meat,
then the first cell of the table under the column head “Region” would be West.

210
Figure 9.3 Example of a Table With Statistical Values in Order: The National Dietary Preferences
Survey (hypothetical)

*P = .003

**P = .002

Note: Survey administered by the Center for Nutrition and Health, Washington, D.C.

Conclusions or Discussion

The conclusions or discussion section of the evaluation report should tell readers what the results mean by
answering similar questions to the following:

• Taking the broadest perspective, what can one conclude from the evaluation? Does the program have
any merit? For whom? Do the findings apply to the real world?

Example 9.4 presents an illustration of the conclusions to a hypothetical study of an online education program
for older adults.

Example 9.4 Conclusions of


an Evaluation of an Online Education Program

Based on this study’s results, we conclude that a website like “Your Health Online” is feasible for use among
many older adults, and that it contains useful and usable information. People who work with older adults
should consider how to obtain and integrate web-based instruction into their practices. Among an online
program’s beneficial features are that it can be personalized (e.g., users may take as much time as they need),
made interactive (e.g., users can receive immediate feedback), and updated regularly. To our knowledge, this
is the first online educational program that combines these characteristics to meet the needs of the growing
population of older Internet users. Future research should focus on evaluating the website’s effectiveness in
improving the way divergent groups of older adults, especially those of differing ethnicities and
socioeconomic status, actually search for online health information; specifically, research on the effects of
better searches on the quality and value of health care, offline and ehealth (healthcare practice supported by
electronic processes and communication), literacy, health outcomes, and quality of life.

In addition to explaining the evaluation results, this part of the report might answer the following kinds of

211
questions:

• Did the program achieve its goals and objectives?


• For which participants was the program most effective?
• For which participants was the program least effective?
• Which components of the program were most/least effective?
• How do the results of this evaluation compare in whole or in part with the findings of other studies?
• Do the evaluation results contribute to any new knowledge about health, health program evaluation,
health care policy, and the health care system?
• What gaps in knowledge have been revealed by this evaluation?
• What are the limitations of this evaluation (due to imperfections in the design, sample, measurement,
and analysis), and how do these affect the conclusions?

Recommendations

In the recommendations section of the evaluation report, the evaluator might consider answering questions
like these:

1. Without changing its basic goals and objectives, if the program were redone to remove its flaws, what
are the top five changes or additions that should be considered?

2. If the program were applied to another setting or group of participants, who would likely benefit most?

3. If the program were instituted in the same or some other setting, what costs could be expected?

4. What objectives should be changed or added to the program to expand its scope and effectiveness?

The Abstract

The abstract of an evaluation report is usually quite brief: between 200 and 300 words. Its purpose is to
present the evaluation’s main objectives, methods, and findings. The following topics are usually addressed in
the abstract, although the amount of detail about each topic varies:

Objective: In one or two sentences, tell the purpose of the evaluation.

Design: Using standard terminology, name the design (e.g., randomized controlled trial or true experiment;
nonrandomized trial or quasi-experiment; survey). Describe any unique feature of the design, such as the
use of blinding.

Participants: Describe the characteristics of the participants, including the numbers of participants in the
experimental and control groups, demographics (such as age, income, and health status), region of the
country, and size of facility (such as hospital, clinic, school). Describe any unique features of the
participants, such as their location or special health characteristics.

212
Main outcome measures: Describe the main outcomes and how they were measured. Describe any unique
features of the measures, use their proper names if appropriate, and include any special notes on
reliability or validity.

Results: For each major dependent variable, give the results.

Conclusions: In one or two sentences, explain what the results mean. Did the program work? Is it
applicable to other participants?

Example 9.5 provides an illustration of an annotated abstract.

Example 9.5 An Annotated Abstract for an


Evaluation of an Online Education Program for Older Adults

Objective: To develop and evaluate an online program to improve older adults’ skills in identifying high-
quality web-based health information

Design: Mixed-methods: surveys and randomized, control trial. We conducted focus groups and individual
interviews to collect data on older adults’ preferences for online instruction and information. We used the
findings to develop a pilot test, and evaluate an interactive website which was grounded in health behavior
change models, adult education, and website construction

Programs: A newly designed web-based program, Your Health Online (the experimental group) compared to
an existing online slide show, Evaluating Health Information

Setting: Community senior center

Participants: 300 persons 55 years of age and older who used the Internet for health information at least once
in the last 12 months; 30 people participated in program development; 185 were assigned to the new or
existing program

Main Outcome Measures: Newly designed measures of usability, satisfaction, and knowledge; The Senior
Self-Efficacy Report

Results: Experimental participants assigned significantly higher ratings of usability and learning (p = .003) to
the new site than control participants did to their tutorial although no differences were found in self-efficacy
or knowledge. Experimental participants reported that participation was likely to improve future searches (p =
.02)

Conclusion: A website like Your Health Online is feasible for use among many older adults, in that it
contains useful and usable information. People who work with older adults should consider how to obtain and
integrate web-based instruction into their practices

(Word count, including subheadings is 240)

213
The Executive Summary

The evaluation report’s executive summary provides all potential users with an easy-to-read summarization
of the evaluation’s major purposes, methods, findings, and recommendations. Executive summaries are
relatively brief, usually from 3 to 15 pages in length. Evaluation funders nearly always require that reports
include such summaries, and they frequently specify the number of pages expected.
Three rules should govern your preparation of an executive summary for your evaluation report:

1. Include the important purposes, methods, findings, and recommendations.

2. Avoid the use of jargon. If necessary, define terms. For example:

Poor:
We investigated concurrent validity by correlating scores on Measure A with those on Measure B.

Better:
Concurrent validity means that two measures produce the same results. We examined the relationship
between scores on Measures A and B to investigate their concurrent validity.

3. Use active verbs.

Poor:
The use of health care services was found by the evaluation to be more frequent in people 45 years of age
or younger.

Better:
The evaluation found more frequent use of services by people 45 years of age or younger.

Poor:
It is recommended that the prevention of prematurity be the focus of prenatal care education.

Better:
The ABC Group recommends that the prevention of prematurity be the focus of prenatal care
education.

As you prepare the executive summary, make certain you provide information on the program, evaluation,
findings, conclusions, and recommendations.

The program or intervention: Describe the intervention or program’s purposes and objectives, settings, and
unique features—consider including answers to these types of questions:

• During which years did the intervention take place?


• Who were the funders?
• How did the intervention differ from others that have similar purposes?
• How great was the need for the intervention?

214
The evaluation: Present an overview of the purposes of the evaluation, describe the evaluation framework,
describe the questions and standards, review the design and sample, discuss the outcomes and how they
were measured, and explain the main analytic methods. Consider answering questions like these:

• How was the evaluation defined?


• Were the evaluation methods unique in any way?
• For whom are the evaluation’s findings and recommendations most applicable?
• Who performed the evaluation?
• During which years did the evaluation take place?

The findings: Give the answers to the evaluation questions and note whether the program or intervention
was effective. Did it achieve agreed-upon standards of merit? Additional questions to consider:

• Is the program likely to be sustained over time?


• Who benefited most (and least) from participation?

Conclusions: Tell whether the intervention solved the problem it was designed to solve. Were the findings
consistent with those of evaluations of similar interventions? What is the bottom line? Did the
intervention succeed? If so, explain the reasons for success. If the intervention did not succeed, explain the
reasons for failure.

Recommendations: Tell other evaluators, researchers, and policy makers about the implications of the
evaluation. Explain what changes to the program or intervention could make it more effective. Describe
the participants who are most likely to be in need of the intervention in the future. Describe ways to
improve future interventions and evaluations.

Reviewing the Report for Quality and Ethics

Program evaluations that are submitted to academic journals are increasingly required to complete a reporting
checklist designed with the objective of ensuring the report’s comprehensiveness and transparency. Perhaps
the most famous of these checklists is the Consolidated Standards of Reporting Trials (CONSORT). The
CONSORT Statement consists of standards for reporting on randomized controlled trials (RCTs). The
Statement is available in several languages and has been endorsed by prominent medical, clinical, and
psychological journals.
CONSORT consists of both a checklist and a flow diagram. The checklist includes items that an evaluator
report should address, while the flow diagram provides readers with a clear picture of the progress of all
participants in the research from the time they are randomized until the end of their involvement. The intent
is to make the experimental process more transparent so that consumers of RCT data can more appropriately
assess the evaluation’s validity and relevance.
Developed in 2001, the original CONSORT Statement is based on the standard two-group parallel
design. Today, the CONSORT Statement is continually revised to reflect new thinking, and recent checklists

215
reflect different program/study designs (such as the cluster RCT).
Figure 9.4 contains a portion of the CONSORT Statement.
The Statement, which is a checklist, contains relatively general items like “eligibility criteria for
participants.” However, each item is defined in detail on the CONSORT website (http://www.consort-
statement.org/) and also in the many journals that contain the checklist, all of which have links on the
website. In addition to detailed definitions, the Statement’s developers give numerous examples of adequate
adherence to each checklist item.
Not all evaluations are RCTs, and the TREND (Transparent Reporting of Evaluations with
Nonrandomized Designs) Statement of the American Public Health Association and Centers for Disease
Control was designed for these studies. Figure 9.5 contains the portion of the TREND Statement that is
concerned with the factors evaluators must include when reporting on baseline differences.
Although the CONSORT and TREND statements were developed by health researchers, other fields like
psychology and substance abuse now require that evaluators demonstrate that their reports account for each
item on the appropriate checklist.

Oral Presentations

An oral presentation consists of an account of some or all of the evaluation’s objectives, methods, and
findings. Most commonly, oral evaluation reports take the form of slideshow presentations. If you need to
prepare an oral presentation, regardless of its duration, the following recommendations can be helpful.

216
217
Figure 9.5 Brief Portion of the Transparent Reporting of Evaluations With NonRandomized Designs
(TREND) Statement: Descriptions of Information on Baseline Data and Equivalence of
Groups That Should Be Reported

Source: Des Jarlais, D. C., Lyles, C., Crepaz, N., & the TREND Group. (2004). Improving the reporting quality of nonrandomized evaluations

218
of behavioral and public health interventions: The TREND statement. American Journal of Public Health, 94, 361–366.

Recommendations for Slide Presentations

1. Do the talking and explaining and let the audience listen. Use slides to focus the audience’s attention on the
key points of your talk. Do not require audience members to read and listen at the same time.

Poor:

Reliability

A reliable measure is one that is relatively free from measurement error. Because of this error, individuals’
obtained scores are different from their true scores. In some cases, the error results from the measure itself:
It may be difficult to understand or poorly administered.

Better:

Reliable Measures

• Reliable measures are relatively free of error.


• Two causes of error are common:

– Measure is hard to understand.


– Measure is poorly administered.

The second of the two slides above is better than the first because the listener can more easily keep the
main points in view without being distracted by a lot of reading requirements. If your objective is to have the
audience read something, a handout (and the time to read it) is more appropriate than a slide.

2. Make certain that each slide has a title.

3. During the talk, address the talk’s purposes and the evaluation’s purposes, main methods, main results, conclusions,
and recommendations. A typical oral presentation covers the following:

A. Title of the talk and names and affiliations of the evaluators

Children and Prevention (CAP): What the Evaluation Found

Prepared by

Jane Austen, PhD

Louis Pasteur, MD

Michael Jackson, RN

219
The Center for Program Evaluation in the Health Professions

B. What the evaluation is about

Goal of the Evaluation of CAP

• To appraise impact
• To determine costs
• To estimate benefits

C. A description of the purpose of the report

Purpose

• Describe and compare children in CAP with other children in the following areas:
• Knowledge of selected health promotion activities
• Health status

D. A description of the program

The Children and Prevention (CAP) Program

• Goal is to improve children’s health prevention knowledge and behavior


• Duration of program is 3 years
• Cost of program is $3 million
• Program is sponsored by the Education Trustees, a nonprofit health promotion group

E. A description of the participants

Who Was in CAP?

• 500 children between the ages of 4 and 7 years of age


• Six public schools: three in the experimental group and three in the control group
• Random assignment of schools to groups
• Control schools: No special health promotion activities (“Usual care”)

F. A description and explanation of the main outcome measures (The description and explanation can
include information on reliability and validity and samples of the content of the measures.)

220
How Was Information Collected?

• Tests of students’ knowledge


• Interviews with students
• Interviews with parents
• Review of attendance records
• Medical records review

G. An accounting of the main results, as in the following hypothetical table

Knowledge: How the CAPs and the Control Compare

HIGHER SCORES ARE BETTER

STATISTICAL SIGNIFICANCE: *p = .03; **p = .001

Notice that no decimals are used in the table above; numbers should be rounded to the nearest whole
number.

H. Conclusions

• Younger children benefit more than older children

I. Alternative explanations, limitations, problems

Do the Results Fit?

• Few valid evaluations of the effects of preventive programs on young children exist
• We could not find other programs on prevention during period of evaluation

– Checked content of health education classes


– Checked movie and television listings

221
J. Recommendations

• Adopt CAP for 4- and 5-year-olds.


• Revise the program and evaluate again for 6- and 7-year-olds.

4. Keep tables and figures simple. Explain the meaning of each table and figure, the title, the column and row
headings, and the statistics. For the table above headed “Knowledge: How the CAPs and the Control
Compare,” you could say something like this:

The next slide compares the knowledge of children in CAP with those in the controls. We used the CAP
Test, in which higher scores are better, and the highest possible score is 50 points.
As you can see [if possible, point to the appropriate place on the screen], children who are 4 and 5
years of age did significantly better in CAP. We found no differences in children who were aged 6 and 7.

5. Check all slides carefully for typographical errors.

6. Avoid the use of abbreviations and acronyms unless you are certain that your audience members know what they
mean. In these examples, the acronym CAP was explained in the first slide. If necessary, you should explain
and define each abbreviation and acronym.

7. Outline or write out what you plan to say.

8. Rehearse the presentation before you create the final copies of the slides. Then rehearse again. The purpose of the
first rehearsal is to make sure that the talk is logical, that the spelling on all slides is correct, and that the
arrangement of words, figures, and tables is meaningful. The second rehearsal is to make sure that you haven’t
introduced new errors.

9. Ensure that the slides are easy to see. Horizontal placement is better than vertical. All potential audience
members should be able to see the slides. In advance of the talk, check the room, the seating plan, and the
place where you will stand.

10. Use humor and rhetorical questions to engage listeners. Typical rhetorical questions are given in three of the
slides above: Who was in CAP? How was information collected? Do the results fit?

11. Allow the audience 1 to 2 minutes to view each slide.

12. Be consistent in using complete sentences or sentence fragments or parts of speech within a given slide.

Poor:

Implementation Evaluation

• Teachers were given a one-week in-service course


• Random observation of their classes
• There were two observers

222
• 10% of all classes
• Kappa = .80, which is high

Better:

Implementation Evaluation

• Teachers participated in an in-service course for one week.


• Two researchers monitored 10% of the classes at random for adherence to the protocol.
• Agreement between observers was excellent (kappa = .80).

The first of these two slides mixes sentence fragments (“random observation of classes”) and full sentences,
whereas the second uses only full sentences.

13. If you have downtime with no appropriate slides, use fillers. These are often opaque, blank slides or slides
showing only the title of the presentation. Consider using cartoons as fillers, but be careful with their use;
they can be distracting in the middle of a talk.

14. Use handouts to summarize information and provide technical details and references. Make sure that your
name, the name of the presentation, and the date are on each page of every handout. Do not distribute
handouts until you are finished speaking unless you plan to refer to them during your presentation.

15. When you create your slides, use both uppercase and lowercase letters in the text; in general, this is easier to read
than text that is all uppercase.

16. Be certain to address in your talk all information that is displayed on the screen.

17. Be careful not to overwhelm your audience with animation, graphics, and sound. Also, do not assume that you
can routinely use graphics or other materials downloaded from the Internet. A great deal of the information
accessible there is protected by copyright.

18. Use no more than four colors per slide. If you are unsure about which colors to use, consider using the slide
presentation colors that are preselected by the software you use to create the slides. Yellow or white letters on
royal blue are easy to see and read.

Posters

A poster is a summary of the evaluation that is designed to be read and understood without oral explanation.
Using a relatively large printout, you should rely on short text sections and give the results as bullet points.
Most evaluation posters also incorporate graphic elements to help illustrate key points. Remember: A poster is
not a thesis or journal article, so don’t try to cram all the details onto it. A casual viewer should be able to get

223
the message in 3 to 5 minutes and read all the text in no more than 10 minutes.
The poster is a report so include the introduction and study objectives, methods (including research design,
programs, sampling, and data analysis), results, and conclusions. You may also want to include an abstract,
acknowledgments, and references.
A poster should have a main title that’s readable from 25 feet away. People will be wandering through the
poster session, so you need to catch their eye from a distance. A general rule is to use a 72-point type and a
common font, such as Times New Roman or Arial for your poster title and to use a smaller size of the same
font for the section titles. Use a simple color scheme. Don’t distract people by using too many different
colors, fonts, and font sizes.
Think about how you plan to print your poster before you design it. Because not every printing option
offers the same paper dimensions and because larger poster sizes generally cost more to print, first choose the
paper size for printing, and then design your poster accordingly. Also, check with your printing vendor to find
out whether you should be aware of any specific limitations or guidelines.
Microsoft PowerPoint is a relatively easy-to-use software tool for creating posters. Adobe Illustrator (a
vector graphic software) has more features and can provide very professional results for posters, including lots
of high-resolution images, but it is more complex and expensive. With a little searching, you can find free
online templates for setting up your poster. For instance, several online sites have collections of free
PowerPoint (.ppt and .pptx native formats) research poster templates. You download the appropriate
PowerPoint poster template, add text, images, and graphics and send it back to the company for printing.
Check out The University of North Carolina’s website for many excellent tips on poster presentations
(http://gradschool.unc.edu/academics/resources/postertips.html).

Ethical Evaluations

Evaluations That Need Institutional Review or Ethics Board Approval

If you intend to conduct an evaluation for a public or private nonprofit or for-profit organization or
institution that receives U.S. government support (even if you are a student), then you likely require approval
from an institutional review board, or IRB, before you can begin. The IRB is in charge of reviewing the
design of your evaluation study to guarantee that it is structured in a way that protects each participant’s
privacy and other rights. When you receive IRB approval, you can proceed with the evaluation. If the IRB
does not approve, you are not allowed to collect data. Although various state and local institutional review
boards may differ in their specific requirements, nearly all major social, health, and welfare agencies within
and outside of the United States have standards in place for the protection of human research subjects.
An IRB is an administrative body whose purpose is to protect the rights and welfare of individuals who are
recruited to participate in research activities. According to the U.S. government, all IRB activities related to
human research subjects should be guided by the ethical principles published in The Belmont Report: Ethical
Principles and Guidelines for the Protection of Human Subjects of Research (see the website for the U.S.
Government Office of Human Subject Protections at http://www.ohrp.osophs.dhhs.gov).

224
This 1979 report, which was prepared by the National Commission for the Protection of Human Subjects
of Biomedical and Behavioral Research, is the foundation for ethical standards in all research involving
human subjects, including evaluations. Three major principles originate from The Belmont Report:

Respect for persons: This principle incorporates at least two ethical convictions: first, that individuals should
be treated as autonomous agents; and second, that persons with diminished autonomy (for example, very
young children or people with dementing illnesses) are entitled to protection.

Beneficence: This principle holds that researchers must treat research participants in an ethical manner, not
only by respecting their decisions and protecting them from harm, but also by actively making efforts to
secure their well-being.

Justice: This principle concerns the balance for research participants between receiving the benefits of
research and bearing its burdens. For example, to ensure justice, researchers need to examine their selection
of research subjects in order to determine whether they are systematically choosing persons from some
classes (e.g., welfare recipients, persons in institutions) simply because of easy availability to those
individuals rather than for reasons directly related to the problems being studied.

U.S. government policy mandates that an IRB must have at least five members and that the members must
have varied backgrounds. When selecting IRB members, an institution must take into account the racial and
cultural heritage of potential members and must be sensitive to community attitudes. In addition to
possessing the professional competence necessary to review specific research activities, IRB members must be
able to ascertain the acceptability of proposed research in terms of institutional commitments and regulations,
applicable law, and standards of professional conduct and practice.
U.S. government policy also requires that if an IRB regularly reviews research that involves subjects within
vulnerable categories (such as children, prisoners, pregnant women, or handicapped or mentally disabled
persons), the institution must consider including among the IRB’s members one or more individuals who are
knowledgeable about and experienced in working with such subjects. Also, the institution must make every
effort to ensure that its IRB consists of a mix of male and female members.

Evaluations That Are Exempt From IRB Approval

A program evaluation is considered to be research (and thus needs IRB approval) when the evaluator
intends to create generalizable knowledge that will be shared outside of the program being evaluated, whether
in professional presentations, reports, or published articles. Process and implementation evaluations are less
likely to require IRB approval because the data gathered in such evaluations are typically used only to assess
progress and to identify areas for improvement within programs. Example 9.6 describes such an evaluation.

Example 9.6 An Evaluation for Which IRB Approval Is Not Required

A community health center wants to improve its rate of vaccination to prevent pneumonia in the elderly. A
system is set up to remind physicians by e-mail which of their patients is due for the vaccination. The health

225
center conducts an evaluation of the effectiveness of the reminder system, collecting data each year for 2 years.
Information from the evaluation is not shared outside the health center.

The evaluation in Example 9.6 is not considered to be research because the findings are used only by the
health center. Evaluations of this type (which focus on the progress of a program—here the reminder system)
are called quality improvement evaluations because their purpose is to improve deficiencies in health care
quality. In the community center’s case, the deficiency is the poor rate of delivery of a vaccination that is
recommended for the elderly by many national and international agencies.

What the IRB Will Review

When determining whether an evaluation can be implemented, an IRB considers all of the elements listed
below.

Criteria Used by the Institutional Review Board

Study design: Many experts agree that the IRB should approve only research that is both valid and of
value. Poorly designed studies necessarily produce biased results. Study design includes how subjects or
participants are recruited, selected, and assigned to groups; the reliability and validity of measures or
instruments; and the method of data analysis.
Risks and benefits: The IRB evaluates whether the risks to participants are reasonable in relation to the
anticipated benefits, if any, to the participants, as well as the importance of the knowledge reasonably
expected to result from the evaluation research.
Equitable selection of participants: The IRB usually considers the purpose of the research and the
setting of the evaluation and closely examines any proposed study involving vulnerable subject
populations, such as children, prisoners, persons with cognitive disorders, or economically or
educationally disadvantaged persons.
Identification of participants and confidentiality: The IRB reviews the researcher’s planned methods
for prospectively identifying and contacting potential participants for ensuring the participants’ privacy
and confidentiality.
Qualifications: The IRB examines the qualifications of the evaluator and the evaluation team. In
addition, the IRB considers the adequacy of the facilities and equipment to be used in conducting the
research and maintaining the rights and welfare of the participants.
Informed consent: The process of obtaining participants’ informed consent to be included in the
evaluation study goes to the heart of the matter of ethical research. The IRB often focuses a great deal
of attention on the issue of informed consent.

If your evaluation is considered research, the IRB will ask to review the following documents:

226
Advance letters: Are you sending letters or e-mails from physicians or members of the community
supporting the evaluation and recommending participation? If so, the IRB will need to review the contents.

Posters or flyers: Are you planning to post notices inviting participation? Are you going to prepare flyers to
hand out at schools, clinics, or health fairs? If so, the IRB will need to review the contents.

Screening scripts: Will you be screening potential participants to determine if they meet the criteria for
inclusion in the evaluation? Will you be asking people to give their ages or describe their personal
behaviors? Regardless of how you approach people (in person, by e-mail, or on the phone), the IRB will
need to review the details of your planned approach, including any scripts used for screening.

Informed consent forms: Once you have determined that a potential participant is eligible, willing, and able to
participate, you must give him or her the opportunity to decide freely whether or not to participate. To do
this, you must provide the potential subject with complete information about the research through an
informed consent form. The IRB will need to review this form.

Informed Consent

Purpose

When participants give their informed consent to participate in a program evaluation, this means that they
are knowledgeable about the risks and benefits of participation and the activities entailed in that participation.
They also agree to the terms of participation and are knowledgeable about their rights as research
participants.
Participants are usually required to acknowledge their informed consent to participate in writing. If you
cannot obtain a participant’s signature on an informed consent form (for instance, if your only contact with
the participant is over the telephone), then you must be able to provide evidence that the contents of the form
were explained to the participant and that he or she understood them. Informed consent forms are designed
to protect all parties: the subject, the evaluator, and the institution. Therefore, it is important that such forms
present information in a well-organized and easily understood format.
If your evaluation includes children, you may have to design separate informed consent forms for their
parents and assent (verbal) forms for the children. If your evaluation does not include children, the IRB may
ask you to justify their exclusion.

Contents of an Informed Consent Form to Participate in an Evaluation

The required contents of informed consent forms vary from institution to institution, but the integral
principles do not vary. Program evaluators should include all of the elements shown in Example 9.7 in their
informed consent forms.

Example 9.7 An Informed Consent Form for Participants in a Study of Alcohol Use and Health

Dear Participant,

227
The Medical Foundation Clinic is taking part in an important research project conducted by university
professors and sponsored by the National Institutes of Health. You have been asked to participate in this
project because you told us over the telephone that you are at least 65 years old, that you are planning to stay
in the area and live in the community during the next 12 months, and that you have had at least one drink
with alcohol in it during the past 3 months.
Your participation in the project is completely voluntary. If you decide not to participate, or if you agree
and then change your mind before the project is finished, the care you receive at the Medical Foundation
Clinic will not be affected in any way.

What is the project about?

Older adults become more sensitive to alcohol, and alcohol also interacts with medications used by older
persons. The aim of the project is to see if educational reports can help you and your doctor understand more
about alcohol use and its influence on health, medications, and physical and mental functioning among
people 65 years of age and older.

What am I being asked to do?

The study has two parts. Today, you are being asked to participate in the first part of the study. For this part
of the study, we would like you to complete a written questionnaire about your alcohol use, health, and use of
health care. Please complete the questionnaire, sign this consent form, and mail both forms back to us in the
accompanying stamped envelope. The information you provide in this questionnaire will be analyzed for
research purposes.
Some of the patients who participate in the first part of the study will be chosen for the second part of the
study as well. If you are chosen for this next stage, we will be contacting you for more information over the
next year.
Beginning from this communication, we will ask you to complete questionnaires and mail them back to us
at intervals of 3 months, 6 months, and 12 months. It will take you about 30 to 40 minutes to complete
today’s questionnaire, about 15 to 25 minutes to complete the second questionnaire, about 5 to10 minutes to
complete the third questionnaire, and about 25 to 30 minutes to complete the fourth questionnaire.
We also ask for your consent to give your personalized report of alcohol-related risks and problems to your
doctor and to consult the administrative records to find out which medical services (if any) you used during
the time period being studied.
If you are asked and agree to participate in this second part of the study, you will be assigned to one of two
groups of patients. The first group of patients will receive a booklet about alcohol use in older adults at the
beginning of the study and will also receive periodic reports of their questionnaire results during the course of
the study. Patients assigned to the first group will be asked to read the reports and the booklet and will be
called by a health educator to answer questions about the materials. The second group of patients will receive
the same materials, but not until the end of the study. Whether you are assigned to the first or the second
group will depend on the random group assignment of your doctor, and not on any decision made by the
researchers or your physician.

228
Are there any risks or discomforts of participating in this project?

The only risks of participating in the project are the inconvenience of taking the time to answer the questions
and review the report and the possibility that some of the questions on the questionnaire may worry or
embarrass you. You are not required to answer any questions you may feel uncomfortable about answering.
There are no medical procedures involved, and there is no financial cost to you associated with this study in
any way, shape, or form.

What are the possible benefits?

The benefits of the study are the chance to learn about the possible risks of alcohol use and the possibility of
improving the health education of other older adults. However, you personally might not benefit from
participation in the study.

Will I be paid for my participation in the study?

To partially compensate you for your time, we will mail you $5 in cash after you complete the first
questionnaire. If you are chosen for the second part of the study, you will be paid $15 for the second
questionnaire, $10 for the third questionnaire, and $20 for the fourth questionnaire, for a total of $50 if you
complete all of the questionnaires.

How will my privacy be protected?

All the information you report is highly confidential and will be used only for study purposes. The consent
form you sign will be filed in a locked storage cabinet during the project and shredded after completion of the
project. Only the study’s principal investigator, project manager, and the Medical Foundation Clinic data
coordinators will have access to these files. All reports of information gained from your answers to the
questionnaire will be shown as summaries of all participants’ answers. No individual results will ever be used
by the project team. All members of the team will be educated about the importance of protecting the privacy
of patients and physicians.

Whom do I contact if I have any further questions?

You may call Dr. Lida Swan, the principal investigator, at 31X-794-ABCD, or e-mail her at
[email protected]. Dr. Swan is Professor of Medicine and Public Health at the University of Eastwich. You
may also contact Dr. Samuel Wygand, Professor of Medicine and Public Health, at 31X-459-DEFG, or Dr.
John Tsai, Professor of Medicine, at 31X-459-UVWX. The project office is located in the Department of
Medicine, Division of General Medicine and Evaluation, Box ABC, Eastwich, CA 90000.

What are my rights as a research subject?

You may withdraw your consent at any time and discontinue participation without penalty. You are not
waiving any legal claims, rights, or remedies because of your participation in this research study. If you have
questions regarding your rights as a research subject, contact the Office for Protection of Research Subjects,

229
832222 Humanities and Science Building, Box DEF, Eastwich, CA 90000; 31X-825-XYZA.

What if I agree to be in the study and then change my mind?

You can leave the project at any time. Again, the care you receive at the Medical Foundation Clinic will not
be affected in any way if you decide to withdraw from the study.

How do I sign up?

If you want to join the project, please sign below on the line marked “Signature of Research Subject,” write
down the date next to your signature, and mail this form together with your completed questionnaire in the
prepaid envelope provided. We will mail you a copy of the signed consent form.
Thank you very much for your interest in this research study!

SIGNATURE OF RESEARCH SUBJECT

Notice that the participant is given a copy of the consent form. In mail and online surveys, a completed and
returned questionnaire may be considered evidence of informed consent, but this depends on the rules
established by the IRB and the purpose of the survey.

The Internet and Ethical Evaluations

Evaluators are increasingly using the Internet to collect data from participants. Online data collection involves
a web of computers that interact with one another. Communications take place between the evaluator and the
participant, the participant and the web server, and the web server and the evaluator. Security breaches are
possible anywhere within the web unless you put protections in place.

Communication Between the Evaluator and the Participant

It is not uncommon for an evaluator to contact participants by e-mail, text, or social media. The message
will discuss the evaluation and invite the participant to click on a URL or paste it into a browser, such as
Mozilla Firefox, Google Chrome, Safari, or Internet Explorer. Unfortunately, e-mail and other contact forms
are not always secure or private. Many people are unaware of whether their computers or phones are secure or
even how to secure them. E-mail programs maintained by employers often are not private. If people do not

230
log off or are careless about passwords, their privacy can be compromised easily. Also, inadequate passwords
are easy to crack. If you require people to use a password, you must ensure that the password setup is secure.
Social media have their own privacy guidelines, but as is well-known, they are sometimes difficult to interpret
and change.

Communication Between the Participant and the Website

When a participant enters sensitive data in the blank spaces of a commonly used data collection measure
like a web-based questionnaire, it is similar to a shopper providing a credit card number when shopping
online. Reputable online merchants use a Secure Socket Layer (SSL) protocol that allows secure
communications across the Internet. An SSL protocol encrypts (converts into code) the user’s input, and it
“decrypts” it when it arrives at the website. Many potential participants are becoming aware of how easily
their online responses can be intercepted unless they are secured, and without guarantees that responses are
encrypted, some of them may refuse to participate. You must decide in advance whether to use SSL and how
to explain your security choices to participants.

Communication Between the Website and the Evaluator

Sensitive identifiable data need to be protected in transit by using either an SSL protocol or a secure file
transfer protocol.

Data Protection

Some people are reluctant to complete online data collection measures or even connect to them for fear
that their privacy will be compromised. All databases storing sensitive and identifiable information must be
protected, regardless of whether they are created and maintained by commercial firms or by individuals.
Encrypting the databases probably provides the most security.
All reputable organizations develop or adapt rules for reassuring participants that privacy will be respected.
Here is a minimum set of rules for a privacy policy for Internet-based data collection:

Minimum Criteria for a Privacy Policy Using the Web

1. Describe exactly which evaluation data will be stored in the evaluation’s database

2. Explain why the data are being stored

3. Explain whether the organization gives, sells, or transfers information, and if it does, to whom and
under what circumstances

4. Explain how the site monitors unauthorized attempts to change the site’s contents

5. Discuss who maintains the site

6. If relevant, explain how cookies are used. Cookies are small amounts of information your browser
stores. Cookies allow web-based applications to store information about selected items, user

231
preferences, registration information, and other information that can be retrieved later. Are the cookies
session-specific? If not, can users opt out of the web page feature that stores the cookies beyond the
session?

Look at the excerpt below which is taken from a privacy statement. The statement comes from a very large
corporation that is conducting a survey as part of an evaluation. As you can see, the company is truthful about
the potential for other companies to track customers’ activities. However, consumers are left with the
obligation to (1) be aware that unwanted cookies may be placed on their hard drive, and (2) if they prefer,
they can do something about it by contacting the Privacy Officer of the company.

Excerpt from a Very Large Company’s Privacy Statement

Some of our business partners may use cookies on our sites (for example, links to business partners). We do
not want our business partners to use cookies to track our customers’ activities once they leave our sites.
However, we may not have total control [italics are added] over how our business partners may use cookies
on our websites. If you become aware that an [Name of company] business partner is placing an unwanted
cookie on your hard drive, please contact our Privacy Officer to assist us in resolving the problem.

This excerpt raises several questions: Will the participants actually know if cookies are on the hard drive?
How does the participant get in touch with the Privacy Officer? Information is available in the Contact Us
portion of the site, but the participant has to search for it. It makes sense that the public is increasingly
suspicious of online data collection and how their data are used.
You can help avoid some of these problems by being certain you have considered all the pitfalls of sending
surveys and survey information into cyberspace.
If you plan to use the Internet (including e-mail) to (1) communicate with study participants or (2) send
participant information to a collaborator or contractor, you should be able to complete this questionnaire for
maintaining ethically sound online evaluation data collection.

Sample Questionnaire: Maintaining Ethically Sound Online Data


Collection

Describe the measures that will be taken to ensure the web server hosting the Internet site is protected. In the
description, provide information on physical security, firewalls, software patches/updates, and penetration
drills.

If a password or other secure authorization method is intended to allow access to the website,

• how will user passwords be distributed?


• how will passwords and web access be terminated?

232
If the user session is encrypted, describe the method of encryption that will be used.

Explain who will have administrative access to data on the web server. Give names, study roles, and
organizational affiliations.

Explain in detail the administrative safeguards put in place to restrict unauthorized and unnecessary access.

Describe how the information will be used. Will you give, sell, or transfer information to anyone?

Give the name and address for the application owner, that is, the persons or person who maintains the
application.

If e-mail is used to contact participants, describe the measures taken to assure participants that the
communication is from an authorized person.

If participants are asked to contact the evaluators using e-mail, describe how the participants will be
authenticated to adequately ensure the source of the e-mail communication.

Explain how the study consent form describes the potential risks to privacy associated with use of e-mail.

If e-mail is to be used to send study data to investigators, vendors, or others, explain if and how the e-mail
will be encrypted.

If participants are to send you attachments by e-mail, explain whether those attachments will be encrypted
or password protected.

If automated e-mail routing systems are used, describe the security controls that will be in place.
Specifically, describe the testing and disaster recovery procedures.

Explain whether contractors or vendors have access to personal identifiable or confidential information of
the survey participants.

• Describe the language that is included in the contract to protect participant privacy.
• Describe the security requirements that will be provided to contractors or vendors who are designing
or hosting web-based services to the evaluation.

Communicate who on the evaluation team is responsible for ensuring that the outside organization’s
policies and procedures for confidentiality and security are followed. Provide the name of the person
responsible and his/her professional position and affiliation.

Communicate who is responsible for the general security administration for the information technology
associated with each online data collection measure. Provide the name of the person responsible and his/her
professional position and affiliation.

Each evaluation has different limits on specifically what data and from whom it needs to collect data. Some

233
evaluation participants are more vulnerable than others and need different safeguards. The following is an
informed consent form typical of one that can be used in an online survey of teachers in a large school district.
The survey’s purpose is to identify needs for a program to improve morale in the school workplace.

Example: Consent Form for an Online Survey

Your individual responses to survey questions will be kept confidential by The Survey Project and its survey
contractor, Online Systems, Inc. Confidential data (i.e., individual or school identification) are data that may
not be released outside of The Survey Project, except with permission from the participant. Individuals may
grant The Survey Project permission to release confidential data that relates specifically to them. An
authorized representative of a Survey Project member school may grant The Survey Project permission to
release confidential data that describe his or her school. [Comment: Defines and describes limits of
confidentiality]
Online Systems, Inc. will generate aggregate reports that contain school-wide and departmental
information to help your school identify, prioritize, and implement improvements in the school workplace
that will increase student engagement. Information will not be reported in instances where participant groups
contain less than five individuals. [Comment: It may be possible to identify individual views in very small groups.
This would violate privacy.] Data from open-ended questions will be provided to your school in de-identified,
redacted form. Only de-identified record level data will be retained by The Survey Project, and only de-
identified aggregate analyses will be shared in publications and research presentations with the academic
community. [Comment: How the data will be used?] The Survey Project may release de-identified responses to
individuals who agree to protect the data and who agree to the confidentiality policies of The Survey Project.
Online Systems, Inc. will store data on secure servers and will destroy all identified data within 2 years of
survey administration. By participating, you will be contributing valuable information to your school.
[Comment: Servers will be secured. The vendor must destroy identifiable data within 2 years.] The Survey Project
and Online Systems, Inc. have taken numerous steps to protect participants in the Survey Project. Ethics
Board requirements specify that you are informed that if the information collected were to become public
with individual identification, it could prove personally uncomfortable. [Comment: This is a risk of
participation.]
This survey has been reviewed and approved according to The Survey Project’s policies and procedures. By
continuing, you acknowledge that you have read and understood the above information and agree to
participate in this survey. [Comment: This is an online survey, and the participant is not asked to “sign” to indicate
willingness to participate. Signing software is available, but most evaluators will accept a completed survey as
confirmation of informed consent.] If you have any questions about the survey, contact …. If you have any
questions about your rights as a research participant, contact …. [Comment: Whom to contact with questions?].
Some large institutions and companies have ethics boards and privacy officers who can help ensure that you
conduct an ethical evaluation. Many evaluators, however, are not technically sophisticated regarding privacy,
nor are they trained in online data collection ethics. You can learn more about ethical survey research by
going online to the National Institutes of Health’s guidelines on ethical research

234
(http://grants.nih.gov/grants/policy/hs/ethical_guidelines.htm) and the Collaborative Institutional Training
Initiative, which provides training in ethical research with human subjects
(https://www.citiprogram.org/Default.asp). You can also consult the American Evaluation Association’s
guiding principles for evaluators at their website: http://www.eval.org/publications/guidingprinciples.asp.

Research Misconduct

Research misconduct is becoming an increasingly important concern throughout the world. Documenting
such misconduct (if it occurs) is currently an issue for evaluators who work in academic institutions and in
organizations that receive funding from the U.S. government. Table 9.1 lists and defines some of the
problematic research behaviors that may occur in many situations in which evaluations are conducted.
Faking the data is a clear example of research misconduct. Subtler examples include the following:

• Exaggerating an evaluation’s findings to support the point of view of the evaluator, funder, or
community
• Changing the evaluation protocol or method of implementation prior to informing the IRB or ethics
board
• Failing to maintain adequate documentation of the evaluation’s methods (such as by preparing a
codebook or operations manual)
• Releasing participant information without permission
• Lacking sufficient resources to complete the evaluation as promised
• Exhibiting financial or other interests in common with the funders or supporters of the evaluation
(conflict of interest)

Table 9.1 Problematic Research Behaviors

235
Exercises

Exercise 1

Directions

Review the following slide prepared as a visual aid for an oral evaluation presentation and, if necessary,
improve it.

A stratified random sample is one in which the population is divided into subgroups or “strata,” and a random
sample is then selected from each group. For example, in a program to teach women about options for
treatment for breast cancer, the evaluator can sample from a number of subgroups, including women of
differing ages (under 19 years, 20 to 30, 31 to 35, over 35) and income (high, medium, low).

Exercise 2

Directions

The following table compares boys’ and girls’ sleeping patterns. Following the table is an explanation of the
data in the table. The explanation contains two errors. What are they?

236
2 Abbreviation: BMI = body mass index.

a One-way analysis of variance

b x2-Test

Explanation of table:
The table presents unadjusted data on sleep duration. Boys slept, on average, significantly more than girls
(P<0.05). Socioeconomic level was inversely associated with sleep duration, and so was birth weight (P<0.02).
Significant associations were detected for maternal prepregnancy BMI or parity, for gestational age, maternal
smoking, and alcohol intake during pregnancy (data not shown). χ2-Tests confirmed a significantly increased
proportion of adolescents with longer sleep duration in girls, in adolescents from poorer backgrounds or of
lower level of maternal education, and with lower birth weight. Obesity was associated with maternal social

237
status, with prevalence values from richest to poorest social status groups occurring at 17.1%, 16.7%, 13.6%,
9.0%, and 2.9% respectively.

Exercise 3

Directions

The following is an informed consent form for a hypothetical diabetes self-care program. Parts of the form
are completed, but others are not. Using the descriptions provided, complete the form by writing in the
needed content.

Informed Consent Form for the Diabetes Self-Care Program

You are asked to take part in three telephone interviews and three self-administered questionnaires on your
general health, your quality of life since being diagnosed with diabetes, and the quality of health care you have
received while in the program. Robert Fung, MD, MPH, is directing the project. Dr. Fung works in the
Department of Medicine at the University of East Hampton. You are being asked to take part in the
interviews and questionnaires because you are enrolled in the Diabetes Self-Care Program.

Disclosure Statement

Your health care provider may be an investigator in this study protocol. As an investigator, he/she is
interested in both your clinical welfare and your responses to the interview questions. Before entering this
study or at any time during the study, you may ask for a second opinion about your care from another doctor
who is in no way associated with the Diabetes Self-Care Program. You are not under any obligation to take
part in any research project offered by your physician.

Reason for the Telephone Interviews and Self-Administered Questionnaires

The interviews and the questionnaires are being done for the following reason: to find out if the Diabetes
Self-Care Program is meeting the needs of the patients enrolled in the program.
During the telephone interview, a trained member of the Diabetes Self-Care Program staff will ask you a
series of questions about:

• your health,
• your quality of life since being diagnosed with diabetes, and
• the quality of the health care you have received while in the Diabetes Self-Care Program.

The self-administered survey questions will cover the same topics, but you will be able to answer them on
your own and in any place that is convenient for you.

What You Will Be Asked to Do

If you agree to take part in this study, you will be asked to do the following things:

238
1. Answer three short (20-minute) telephone interviews. The telephone interviewer will ask you general
questions about your health, your quality of life since you were diagnosed with diabetes, and the quality
of health care you have received while in the Diabetes Self-Care Program. You will be called to
complete an interview when you first enroll in the program, 6 months after your enrollment, and when
you leave the program. The interviews will be completed at whatever times are most convenient for you.

Sample questions:

How confident are you in your ability to know what questions to ask a doctor?

During the PAST 4 WEEKS, how much did diabetes interfere with your normal work (including
both work outside the home and housework)?

Would you say not at all, a little bit, moderately, quite a bit, or extremely?

2. Answer three short (20-minute) self-administered questionnaires. The self-administered questionnaires


will ask you general questions about your health, your quality of life since you were diagnosed with
diabetes, and the quality of health care you have received while in the Diabetes Self-Care Program. The
self-administered questionnaires will be mailed to you when you first enroll in the program, 6 months
after your enrollment, and when you leave the program. You can complete the self-administered
questionnaires at whatever times are best for you. You will be provided with a prepaid envelope to
return each questionnaire.

Sample questions:

How many drinks of alcohol did you usually drink each day during the past four weeks? (None,
one, two, three, or four or more)

On a scale of 1 (very confident) to 6 (not at all confident), how confident are you in your ability to
know what questions to ask a doctor?

3. If you do not understand a question or have a problem with a self-administered questionnaire, you will
be asked to call Ms. Estella Ruiz at the Diabetes Self-Care Program office at 1–800–000–0000. She will
be able to assist you.

Possible Risks and Discomforts

Potential Benefits to Subjects and/or to Society

The purpose of the telephone interviews and self-administered questionnaires is to improve the services
that the Diabetes Self-Care Program provides to the patients enrolled in the program. Your responses might
lead to changes in the program that would improve the services provided by the Diabetes Self-Care Program.

Payment for Taking Part

239
Confidentiality

Any information that is collected from you and that can be identified with you will remain confidential.
Your identity will not be revealed to anyone outside the research team unless we have your permission or as
required by law. You will not be identified in any reports or presentations. Confidentiality will be maintained
in the following ways:

1. All of your interviews and questionnaires will be coded with a number that identifies you. Your name
will not be on any of these materials.

2. A master list of names and code numbers will be kept in a completely separate, confidential, password-
protected computer database.

3. All copies of the self-administered questionnaires will be kept in a locked file cabinet in a locked
research office.

4. All telephone interviews will be recorded in a confidential computer database.

5. When analysis of the data is conducted, your name will not be associated with your data in any way.

6. Only research staff will have access to these files.

Taking Part and Choosing Not to Take Part in Telephone Interviews and Self-Administered Questionnaires

Identification of Investigators

If you have concerns or questions about this study, please contact Robert Funk, MD, MPH, by mailing
inquiries to Box 000, Los Angeles, CA 90000–9990. Dr. Funk can also be reached by telephone at 1-800-
000-0000 or by e-mail at [email protected].

Rights of Participants

You may choose to end your agreement to take part in the telephone interviews and self-administered
questionnaires at any time. You may stop taking part without penalty. You are not giving up any legal claims,
rights, or remedies because you take part in the telephone interviews and self-administered questionnaires. If
you have questions about your rights as a research subject, contact the Office for Protection of Research
Subjects, 2107 QQQ Building, Box 951694, East Hampton, CA 90273, 1-800-123-XYZZ.
I understand the events described above. My questions have been answered to my satisfaction, and I agree
to take part in this study. I have been given a copy of this

240
Suggested Websites

For reporting checklists for a wide variety of studies and study designs, including links to CONSORT and
TREND go to:

http://www.equator-network.org

To learn more about ethics in evaluations and information on ethical considerations in the evaluation
profession go to:

https://www.citiprogram.org/Default.asp

http://grants.nih.gov/grants/policy/hs/ethical_guidelines.htm

http://www.nigms.nih.gov/Research/Evaluation/standards_ethics.htm

http://www.jcsee.org/program-evaluation-standards

241
Answers to Exercises

Chapter 1

Exercise 1

Effectiveness of Home Visitation by Nurses

Evaluation Question

Did the program achieve its objective of preventing recurrence of child abuse and neglect?

Evidence

• At 3-year follow-up, a difference between intervention and control with significantly less recurrence of
physical abuse in the intervention group
• At 3-year follow-up, a difference between intervention and control with significantly less recurrence of
abuse or neglect between the two groups

Data Collection Measures


Standardized review of child protection records; hospital records.

Evaluating a Mental Health Intervention for Schoolchildren Exposed to Violence: A Randomized Controlled
Trial

Evaluation Question

Did the program achieve its objectives of reducing PTSD, depression, symptoms, and teacher-perceived
behavior?

Evidence

• At 6-month follow-up, after both groups receive the intervention, no difference between experimental
and control group.

Data Collection Measures

The Child PTSD Symptom Scale, the Child Depression Inventory, Pediatric Symptom Checklist, and the
Teacher-Child Rating Scale.

Exercise 2

Program evaluation is an unbiased exploration of a program’s merits, including its effectiveness, quality, and
value. An effective program provides substantial benefits to individuals, communities, and societies, and these

242
benefits are greater than their human and financial costs. A high-quality program meets its users’ needs and is
based on sound theory and the best available research evidence. A program’s value is measured by its worth to
individuals, the community, and society.

Exercise 3

a. Yes. This is an evaluation study. The program is an intervention to prevent high HIV risk sexual
behaviors for Latina women in urban areas.

b. Yes. This is an evaluation study. The intervention is a spit tobacco intervention.

c. No. This is not an evaluation study. The researchers are not analyzing the effectiveness, quality, or value
of a program or intervention.

Chapter 2

Exercise 1
Evaluation question: Did the program (brief intervention) achieve its objective of reducing gambling?

Evidence: A reduction in gambling at week 6 and month 9

Independent variable: program participation (assessment-only control, 10 minutes of brief advice, one
session of motivational enhancement therapy (MET) or one session of MET, plus three sessions of
cognitive–behavioral therapy (CBT)

Dependent variable: reduction in gambling as measured by the Addiction Severity Index (ASI-G)
module, which also assesses days and dollars wagered

Exercise 2
Evaluation Question: Did the revised curriculum achieve beneficial outcomes? The outcomes of interest
include tobacco, alcohol, and marijuana; school attendance

Evidence: Implied: decreased use of tobacco, alcohol, and marijuana; improvements in attendance

Independent variable: Program participation (2006—2007 5th graders = comparison group; 2007—2008
5th graders = revised curriculum group)

Dependent variables: substance use; attendance

Chapter 3

Exercise 1

1. Experimental: randomized control trial or wait-list control

2. Hypothesis 1. When compared to usual practice (delayed intervention) the Cognitive Behavioral
Intervention for Trauma in Schools improves symptoms of PTSD and depression, parent-reported
psychosocial dysfunction, and teacher-reported classroom problems at baseline (before the intervention)

243
and three months later.

Hypothesis 2. There will be no difference in symptoms, depression, dysfunction, and problems between
students in the experimental and comparison programs three months after control students receive the
experimental intervention.

3. Improvement in symptoms, depression, dysfunction, and problems for both the experimental and
comparison group over a three-month period.

4. Sixth-grade students at two large middle schools in Los Angeles who reported exposure to violence and
had clinical levels of symptoms of PTSD were eligible for participation in the evaluation.

5. No information is given on assignment.

6. Measures are administered twice: before the intervention and three months later.

Exercise 2

Answers are highlighted and in parentheses except for commentary on this author’s note concerning the
second study.

The Role of Alcohol in Boating Deaths


Although many potentially confounding variables were taken into account, we were unable to adjust for
other variables that might affect risk, such as the boater’s swimming ability, the operator’s boating skills and
experience, use of personal flotation devices, water and weather conditions, and the condition and
seaworthiness of the boat. Use of personal flotation devices was low among control subjects (about 6.7% of
adults in control boats), but because such use was assessed only at the boat level and not for individuals, it was
impossible to include it in our analyses (Selection resulting in potentially nonequivalent groups)…. Finally,
although we controlled for boating exposure with the random selection of control subjects, some groups may
have been underrepresented (Selection).

Violence Prevention in the Emergency Department


The study design would not facilitate a blinding process (Expectancy) that may provide more reliable
results…. The study was limited by those youth who were excluded, lost to follow-up, or had incomplete
documents (Selection; Attrition). Unfortunately, the study population has significant mobility and was
commonly unavailable when the case managers attempted to interview them (Attrition). The study was
limited by the turnover of case managers (Attrition; Instrumentation).
Note: The first statement regarding the length of time needed for the program’s effects to be observed is
not a limitation in the study’s research design, but in the evaluators’ rush to try out a program with
insufficient evidence. Limitations like this can be avoided in practice by relying only on programs that are
known to work and that define the circumstances in which they work best.
Regarding the second limitation concerning the evaluation tool, the problem here is that the evaluators
used a measure of unknown validity. This is a serious (and not uncommon) problem. Invalid measures do
harm to any study no matter how carefully it is designed. That is, a brilliantly designed RCT with an invalid

244
test or other measure will produce inaccurate results.

Exercise 3
RCTs with parallel and wait-list controls can guard against most biases. They produce the most internally
and externally valid results. In their “purest” forms, they may be somewhat complex to implement. For
example, it is often difficult, if not impossible, to “blind” evaluation participants. Quasi-experimental designs
are often more realistic designs for clinical and other real-world settings like schools and field settings.
However, preexisting differences in groups may interfere with the results so that you cannot be certain if
group differences or programs are responsible for outcomes.
Observational designs are convenient because the evaluator does not have to develop and implement a
research protocol—complicated activities, to say the least. At the same time, the evaluator may have little
control over data collection or the assignment of participants.

Chapter 4

Exercise 1
Situation 1: C

Situation 2: B

Situation 3: A

Chapter 5

Exercise 1

a. Self-administered questionnaires

The Suicidal Ideation Questionnaire—Junior is a 15-item self-report questionnaire used to assess the
frequency of a wide range of suicidal thoughts.
The Spectrum of Suicide Behavior Scale is a 5-point rating of the history of suicidality (none, ideation,
intent/threat, mild attempt, serious attempt).
Measures of internalizing symptoms include the Youth Self-Report internalizing scale and the Reynolds
Adolescent Depression Scale. The YSR consists of 119 problem behavior items that form internalizing and
externalizing scales. The RADS is a 30-item self-report questionnaire that assesses frequency of depressive
symptoms on a 4-point scale with endpoints of almost never and most of the time.
The CAFAS, Child and Adolescent Functional Assessment Scale assesses functional impairment in
multiple areas, including moods/self-harm. On the basis of parent responses to a structured interview, a
trained clinician rates level of functioning on a 4-point scale (0, 10, 20, 30) ranging from 0 (minimal or no
impairment) to 30 (severe impairment).

b. Large database

The 2002 National Health Interview Survey (NHIS), a national household survey sponsored by the
National Center for Health Statistics, was used to collect data on whether participants had a diagnosis of

245
diabetes or other illness, the use of complementary and alternative medicine, demographic and socioeconomic
characteristics, preventive health care practices, and the use of conventional medical services.

c. Audio records

A computer-implemented acoustic voice measure was used to track slight as well as profound cognitive
impairment.

Exercise 2

Directions: This form is to be completed after the delivery of each project participant.
WHAT IS THE MOTHER’S EVALUATION ID?

1. Baby’s birth date: __/__/__

2. Birth weight: __________ grams OR ____ lbs ____ ozs

3. Baby’s sex: Male _ Female _ Unknown/Could not get _

4. Gestational age assessed by clinician at birth: ______ weeks

5. Was drug toxicology performed at birth on the mother at delivery?

Yes
No
Unknown/Not reported

5a. If yes, what were the results?

negative
positive for alcohol
positive for amphetamines
positive for opiates
positive for barbiturates
positive for cannabinoids
positive for cocaine or metabolites
positive, other, specify: ________________________
unknown/not reported

6. Was a drug toxicology screen performed at birth on the baby?

Yes
No
Unknown/Not reported

6a. If yes, what were the results?

246
negative
positive for alcohol
positive for amphetamines
positive for opiates
positive for barbiturates
positive for cannabinoids
positive for cocaine or metabolites
positive, other, specify: ________________________

7. What was the total number of clinical prenatal visits attended by the patient during this pregnancy?
total prenatal care visits __

8. Was the baby: stillborn _ live birth _

Chapter 6

Exercise 1
Excerpt A: The concept is content validity because the instrument is based on a number of theoretical
constructs (e.g., the health beliefs model and social learning theory).

Excerpt B: The concept is interrater reliability because agreement is correlated between scorers. If we also
assume that each expert’s ratings are true, then we have concurrent validity; κ (kappa) is a statistic that is
used to adjust for agreements that could have arisen by chance alone.

Excerpt C: The concept is test-retest reliability because each test is scored twice.

Exercise 2

This information collection plan has several serious flaws. First, it anticipates collecting data only on
student knowledge and skill despite the fact that the objectives also encompass attitude. Second, even if the
evaluators are expert test constructors, they must pilot test and evaluate the tests to determine their reliability
and validity. Finally, the evaluators do not plan to monitor the use of the educational materials. If faculty
cannot or will not use these materials, then any results may be spurious. The 5-year period planned for testing
is, however, probably of sufficient duration to allow the evaluators to observe changes.

Chapter 7

Exercise 1

• The “T” stands for “telephone” and the “_9” stands for the “ninth question.”
• “All of the time” is coded 100; “most of the time,” 75; “some of the time,” 50; “a little of the time,” 25;
“none of the time,” 0; “don’t know,” 8; and “refused,” 9.
• The codes of 100, 75, 50, and 25 allow the data analyst to compute a “score.”

247
Exercise 2

[T_1] My illness has strengthened my faith. Would you say this statement is very
much, quite a bit, somewhat, a little bit, or not at all true?

Very much _ (4)

Quite a bit _ (3)

Somewhat _ (2)

A little bit _ (1)

Not at all _ (0)

DON’T KNOW _ (8)

REFUSED _ (9)

2 (a) [T_2] How much of the time during the LAST 4 WEEKS have you wished that you could change
your mind about the kind of treatment you chose for prostate cancer? Would you say all of
the time, most of the time, a good bit of the time, some of the time, a little of the time, or
none of the time?

Very much _ (4)

Quite a bit _ (3)

Somewhat _ (2)

A little bit _ (1)

Not at all _ (0)

DON’T KNOW _ (8)

REFUSED _ (9)

Note that because 8 and 9 are used as codes in the first question, these codes are repeated in the answers to
the second. It is always a good idea to keep codes consistent in all evaluation data collection measures.

Chapter 8

Exercise 1
EVALUATION FUNDAMENTALS

248
Exercise 2

Analysis: A two-sample independent groups t-test

Justification for the analysis: This t-test is appropriate when the independent variable is measured on a
nominal scale and the dependent variable is measured on a numerical scale. In this case, the assumptions of
a t-test are met. These assumptions are that each group has a sample size of at least 30, the two groups are
about equal in size, and the two groups are independent (an assumption that is met most easily with a
strong evaluation design and a high-quality data collection effort).

Exercise 3

If the evaluation aims to find out how younger and older persons in the experimental and control groups
compare in amount of domestic violence, and presuming that the statistical assumptions are met, then
analysis of variance is an appropriate technique.

Chapter 9

Exercise 1

249
This slide requires the audience to do too much reading. Also, the title should be more informative. The
text suggests the following two slides:

A. Stratified Random Sampling

• Population is divided into subgroups or strata.


• Random sample is selected from each stratum.

B. Stratified Random Sampling: Blueprint

Income

Age (years)

Under 19

20–30

31–35

Over 35

High

Medium

Low

Exercise 2
The errors are:

1. Boys slept, on average, significantly LESS than girls (P<0.05).

2. NO significant associations were detected for maternal prepregnancy BMI or parity, nor for gestational
age, maternal smoking, and alcohol intake during pregnancy (data not shown).

Exercise 3

Informed Consent Form for the Diabetes Self-Care Program

1. Tell eligible subjects that participation is voluntary, they can leave whenever they choose, and their
health care will not be affected.

Reason for the Telephone Interviews and Self-Administered Questionnaires

2. Tell participants that the program will use their answers (and answers from others) to find out about
the appropriateness of services and any needed changes.

Possible Risks and Discomforts

3. Tell respondents that the surveys may include questions about their physical or emotional health or
their experience with the program that they may feel sensitive about answering, and that they do not

250
have to answer questions that bother them.

Payment for Taking Part

4. Tell participants that they will not be paid for their time.

Taking Part and Choosing Not to Take Part in Telephone Interviews and Self-Administered Questionnaires

5. Tell participants that it is up to them whether or not they want to take part and that, if they do, they
can stop at any time and their health care will not be affected.

251
Index

Page references followed by (figure) indicate an illustrated figure; followed by (table) indicate a table.

Abstract
description of report, 228–229
online education program for older adults report annotated, 229–230
Achievement tests
description of, 125
example of multiple-choice question on, 126 (figure)
Advance letters, 244
Agency for Drug and Alcohol Misuse, 16
Alcohol-Related Problems Screen (ARPS), 167–169, 182
Alpha error (or Type I error)
description of, 112
example of, 113 (figure)
Alternative-form (or equivalence) reliability, 148, 149
Altman, D. G., 234
American Psychological Association, 136
ANCOVA (analysis of covariance)
description of, 82
to solve problem of incomparable participants, 82–83
Annals of Internal Medicine, 197
Attrition (drop-out), as threat to internal validity, 92–93

Baseline data
description and collection of, 15–16
example of, 16
The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research, 242
Beta error (or Type II error)
description of, 112
example of, 113 (figure)
Biases
how threats to validity promote, 94
interaction effects of selection, 93
“Big data” (large databases)
advantages and disadvantages of using, 132–133
as data source, 131–133

252
as program merit evidence source, 55–56
Blinding (double-blind) experiment, 80
Blocked randomization, 79–80
Braitman, L., 197

Case control designs


benefits and concerns of, 91 (table)
description and purpose of, 85–86
examples of, 86
Categorical data, 188–189, 190 (table)
Categorical variables
odds ratios comparing, 201–204
risk ratios of strength of the relationship of, 203
CATI (computer-assisted telephone interviewing), 130–131
CDC’s framework for program evaluation
illustration and description of, 26 (figure)–27
similarities to RE-AIM and PRECEDE-PROCEED, 27
Centers for Medicare and Medicaid Services, 132
Children and Prevention (CAP) slide presentation, 236–241
Cleaning data, 180–181
Clinical (or practical) significance
establishing, 198–200
statistical significance versus, 197–198
Closed questions, 207–208
Cluster sampling
description of, 109
in study of attitudes of Italian parent toward AIDS education, 110
Codebooks
content analysis and creating the, 207–208
creating a data dictionary or, 169–171
establishing reliable coding for, 172
measuring agreement using the kappa, 172–174
portions of a, 170–171
Codes
data analysis, 169–174
definition of, 169
qualitative data content analysis, 207–208
Coding
closed questions, 207–208
establishing codebooks using reliable, 172

253
open-ended questions, 208
Cognitive pretests, 177
Cohort designs
benefits and concerns of, 85, 90 (table)–91 (table)
calculating risk ratios in, 203–204
description and purpose of, 85
example of, 85
Communication
between evaluators and participants, 249
between participant and evaluation website, 249
between website and the evaluator, 250
Community
evaluation questions on program goals for, 40
participatory evaluation requiring involvement of, 34
program’s effectiveness measured by benefits to, 33
program’s value measured by worth to, 33
Community Cancer Center, 57–58
Comparative effectiveness evaluations or research (CER)
benefits and concerns of, 91 (table)
description of, 87
four defining characteristics of, 87–88
two examples of, 88–90
Concurrent validity, 150–151
Confidence intervals, 197–198
Consent form for online survey, 253–254
CONSORT (CONsolidated Standards of Reporting Trials)
checklist for reporting on RCTs, 232
description of, 212
excerpts from 2010 checklist of, 233 (figure)–234 (figure)
Construct validity, 151
Content analysis
description of, 204–205
steps in, 205–210
Content analysis steps
1: assembling the data, 205–206
2: learning the contents of the data for, 206–207
3: creating a codebook or data dictionary, 207–208
4: entering and cleaning the data, 209
5: doing the analysis, 209–210
Continuous numerical data, 190

254
Convenience sampling
description of, 104, 110
example of, 104–105
Cost analyses
cost minimization, 60
cost utility, 60
cost-benefit, 45, 60
cost-effectiveness, 45, 60
Cost minimization analysis, 60
Cost utility analysis, 60
Cost-benefit analysis, 45
Cost-benefit evaluation, 45
Cost-effectiveness analysis, 45
Cost-effectiveness evaluation, 45
Crepaz, N., 235
Criterion validity
concurrent, 150–151
predictive, 150
Cronbach’s coefficient alpha, 149
Cross-sectional designs
benefits and concerns of, 90 (table)
description and purpose of, 83
example of, 83–84

Data
baseline, 15–16
categorical, 188–189, 190 (table)
“dirty,” 180
interim, 16–17
managed so it can be analyzed, 12
missing, 178–179
numerical, 190 (table)–191 (table)
ordinal, 189–191 (table)
outliers, 180–181
recoding, 181
searching for missing, 175–177
Data analysis
cleaning the data during, 180–181
creating a codebook or data dictionary for, 169–174
creating the final database for, 181–182

255
to decide on program merit, 12
drafting a plan for, 167–169
entering the data for, 174–175
managing data for, 12
searching for missing data during, 175–179
secondary, 132
selecting a method of, 191–194 (table)
starting with the evaluation questions, 188
storing and archiving the database created during, 183
Data analysis methods
clinical (or practical) significance, 197–200
content analysis, 204–210
example of factors used to decide on, 191–192
general guide to statistical, 192 (table)–194 (table)
hypothesis testing, p values, and statistical significance, 194–197 (figure)
meta-analysis, 210–212 (figure)
odds ratios and relative risk, 201–204
risks and odds, 200–201 (table)
what is required for, 191
Data collection
of baseline data, 15–16
consent form for an online survey, 253–254
example and case study on failure of, 122–123
examples of questions, variables, evidence, and data sources, 120–121
first deciding on what the question is, 119–121
lexicon on terms related to, 151–152
primary, 132
on program merit, 11
qualitative evaluation data, 204–205
sample questionnaire maintaining ethically sound online, 251–253
Data dictionary. See Codebooks
Data entry
to create simple data set, 174–175
description of, 174
Data management
comparing female and male physicians’ preventive health services to patients by gender, 166, 167 (figure)
for data analysis, 12
description of, 165
eight activities included in, 166
selecting software for, 166–167

256
Data protection
minimum criteria for privacy policy using the web, 250
security and issues of, 250–251
Data sets
continuous variables dichotomized in analytic, 182
data entry to create a simple, 174–175
description of, 165–166
Data sources
achievement tests, 125, 126 (figure)
choosing appropriate, 121–123
definition of, 152
example and case study on failure of, 122–123
example of data collection, 120–121
interviews, 129 (figure)–131
large databases (“big data”), 55–56, 131–133
the literature, 134–141 (figure)
observations, 128 (figure)
physical examinations, 131
qualitative evaluation data, 204–205
questions to ask when choosing, 121–122
record reviews, 125–127
self-administered surveys, 123 (figure)–125
vignettes, 133–134
Databases
“big data” (large databases), 55–56, 131–133
CDC-maintained, 55–56, 132
created during data analysis for evaluation, 181–182
ERIC, 32–33 (figure)
PubMed, 30, 31 (figure)–32, 136
storing and archiving evaluation, 183
Dawson-Saunders, B., 197
Decision making
using existing data and large databases for, 55–56
using experts for, 53 (figure)–55
on program merit evidence, 53 (figure)–59
using research literature for, 57–58
Dependent (or outcome) variables
data collection on the, 120–121
evaluation questions on, 59
guide to statistical data-analytic methods using, 192 (table)–194 (table)

257
medical record abstracts program, 59
odds ratios comparing categorical, 201–204
purpose of, 188
QEV Reporting Form on, 61, 62 (figure), 63
Depression therapy class attendance, 209–210
Des Jarlais, D. C., 235
Dichotomous variables, 178–179, 182, 189
“Dirty data,” 180
Discrete numerical data, 190
Double-blind experiment, 80

Economic evaluations
cost minimization analysis, 60
cost utility analysis, 60
cost-benefit analysis, 45, 60
cost-effectiveness analysis, 45, 60
evaluation questions on, 44–46
evidence used in, 60–61
examples of evaluation questions on, 45–46
quality-adjusted life year (QALY), 60
E-mail communication, 249
Equivalence (or alternative-form) reliability, 148, 149
ERIC (Education Resources Information Center), 136
online evaluation reports created using, 32
search for recent articles evaluation elementary education programs, 33 (figure)
Ethical issues
consent form for an online survey, 253–254
informed consent, 245–249
the Internet and related, 249–251
IRB (institutional review board), 242–245
online data collection, 251–253
research misconduct, 254–255 (table)
reviewing the evaluation report for quality and, 232
Evaluation design examples
case control design, 86
cohort design, 85
comparative effectiveness evaluation, 88–90
cross-sectional designs, 83–84
experimental evaluation design with random assignment, 69
experimental evaluation without random assignment, 69

258
observational evaluation design, 69–70
randomized controlled trial (RCT) for effective literacy program, 71
Evaluation designs
comparative effectiveness evaluations or research (CER), 87–90
comparing six commonly used benefits and concerns, 90 (table)–91 (table)
description of, 10, 63–64, 68
external validity, 91, 93–94
factorial designs, 75–77, 91 (table)
internal validity, 91, 92–93
nonrandomized controlled trials, 81–83, 90 (table)
observational designs, 83–87
randomized controlled trial (RCT), 70–75, 90 (table), 179, 232
randomizing and blinding, 77–80
three examples of designs for one program, 68–70
Evaluation frameworks/models
CDC’s framework for planning and implementing practical program evaluation, 26 (figure)–27
logic models, 27–30
PRECEDE-PROCEED, 24–25 (figure), 27
RE-AIM, 26, 27
Evaluation of a Healthy Eating Program for Professionals Who Care for Preschoolers (Hardy, King, Kelly,
Farrell, & Howlett), 13
Evaluation questions
asked for the purpose of data collection, 120–121
data analysis as starting with the, 188
description of, 8
example of formative evaluation, 17–18
examples of hypothesis and, 8–9
on financial costs, 44–46
on independent and dependent variables, 59
medical record abstracts program, 59
on participants, 43
on program characteristics, 44
on program goals, 40–43
on program objectives, 41–44
on program’s environment, 46–47
QEV Reporting Form, 61, 62 (figure), 63
See also Survey questions
Evaluation reports
CONSORT standard for, 212, 232, 233 (figure)
oral presentations, 232, 235–241

259
posters, 241
reviewing the report for quality and ethics, 232
TREND standard for, 212, 232, 235 (figure)
written, 218–231
Evaluations
cost-benefit, 45
cost-effectiveness, 45
formative, 16–18, 34
mixed-methods, 20–21
online evaluation reports, 30–33 (figure)
participatory and community-based, 22–24, 34
process or implementation, 18–19, 34
qualitative, 19–20
summative, 19
See also Practitioners/evaluators; Program evaluations
Evidence
data collection for gathering, 120–121
program merit, 47–49, 53 (figure)–59, 61, 62 (figure), 63
Executive summary, 230–231
Expectancy threat, 93
Experimental evaluation design with random assignment, 69
Experimental evaluation without random assignment, 69
Experts
chosen to guide the choice of evaluation evidence, 55
guidelines for creating panels of, 54–55
program merit evidence decisions using, 53e–55
review of new measures by, 155
External validity
description of, 91
how threat promote bias, 94
threats to, 93

Factorial designs
benefits and concerns of, 91 (table)
description of, 75–77
example of Pleading (Factor 1) and Notification Status (Factor 2), 76 (figure)
Farrell, L., 13
Financial costs. See Economics
Food diaries/obese children program, 137
Formative evaluation

260
example of questions asked in a, 17–18
helping in determining feasibility, 34
interim data for, 16–17

Goals. See Program goals


Green, L., 25

Hardy, L., 13
Hawthorne Effect, 93
Health Promotion Planning: An Educational and Ecological Approach (Green & Kreuter), 25
History, 92
Homogeneity, 148, 149
Howlett, S., 13
Hypotheses
evaluations that are designed to test, 8
examples of evaluation questions and, 8–9
questions on program goals and objectives included in the, 40–43
RISK-FREE anti-smoking program, 60–61
See also Null hypotheses
Hypothesis testing
comparing for statistical significance and p values, 194–195
defining areas of acceptance and rejection in standard distribution, 195, 197 (figure)
guidelines for statistical significance, p values, and, 195–196

Implementation evaluation. See Process (or implementation) evaluation


Independent (or explanatory) variables
data collection on the, 120–121
evaluation questions on, 59
guide to statistical data-analytic methods using, 192 (table)–194 (table)
medical record abstracts program, 59
purpose of, 188
QEV Reporting Form on, 61, 62 (figure), 63
Index, 152
Informed consent
contents to participate in an evaluation, 245
example for a study of alcohol use and health form, 246–249
IRB review of forms for, 245
purpose of the, 245
Institute of Education Sciences (IES), 32, 33
Institute of Medicine, 53
Institutions, 40

261
Instrument, 152
Instrumentation threat, 92
Intention-to-treat analysis (ITT), 179
Interim data
description and collection of, 16
example of formative evaluation and, 16–17
Internal consistency, 149
Internal validity
description of, 91
how threat promote bias, 94
threats to, 92–93
Internet
communication between evaluators and participants via, 249
communication between participant and the evaluation website, 249
communication between website and the evaluator, 250
consent form for an online survey, 253–254
data protection issues, 250–251
ethically sound online data collection, 251–253
minimum criteria for privacy policy, 250
Interrater reliability, 148, 149–150
Interventions, 4–5
See also Programs
Interviews
advantages and disadvantages of, 129–130
cognitive pretests, 177
computer-assisted telephone interviewing (CATI), 130–131
description of, 129
pilot tests questions, 177
portion of an interview form, 129 (figure)
Intrarater reliability, 148, 150
Intratrater reliability, 148
IRB (institutional review board)
criteria used by the, 244
description and requirements for, 242–245
documents reviewed by the, 244–245
evaluations that are exempt from, 243–244

Kappa (k) statistic, 173–174


Kelly, B., 13
King, L., 13

262
Kreuter, M., 25

Large databases (“big data”)


advantages and disadvantages of using, 132–133
as data source, 131–133
as program merit evidence source, 55–56
Last observation carried forward (LOCF), 179
Left-to-right (or forward logic) logic model, 30
Likelihood ratios (or relative risks), 203–204
The literature
description of, 134
guidelines for reviewing, 135–141 (figure)
reasons for evaluators’ use of, 134–135
The literature guidelines
abstract the information, 139, 140 (figure)–141 (figure)
assemble the literature, 136
consider the non-peer reviewed literature, 139
identify inclusion and exclusion criteria, 136–137
identify the best available literature, 138 (figure)–139
overview of, 135
select the relevant literature, 137
Logic models
description and purpose of the, 27, 29
illustration of the components of a basic, 28 (figure)
left-to-right (or forward logic), 30
right-to-left (or reverse logic), 29 (figure)
Lyles, C., 235

Marczinski, C. A., 21
Maturation, 92
Measure checklist (existing measure)
1: find out the costs, 157
2: check the content, 157
3: check the reliability and validity, 157
4: check the measure’s format, 157
Measure checklist (new measures)
1: set boundaries, 153
2: define the subject matter or topics, 153
3: outline the content, 153–154
4: select response choices for each question or item, 154
5: choose rating scales, 155

263
6: review the measure with experts and potential users, 155–156
7: revise the measure based on reviewer comments, 156
8: put the measure in an appropriate format, 156
9: review and test the measure before administration, 156
Measurement chart
content, 159 (figure)–160 (figure), 161
description and purpose of, 158
duration of measures, 159 (figure)–160 (figure), 161
general concerns, 159 (figure)–160 (figure), 161
how measured, 158, 159 (figure)–160 (figure)
reliability and validity, 159 (figure)–160 (figure), 161
sample, 158, 159 (figure)–160 (figure)
timing of measures, 158, 159 (figure)–160 (figure), 161
variables, 158, 159 (figure)–160 (figure)
Measures
checklist for creating a new, 152–156
checklist for selecting an already existing measure, 156–157
definition of, 152
issues to consider for, 142
lexicon on terms for data collection, 151–152
reliability, 148–150
validity, 150–151
Medical record abstracts program, 59
Meta-analysis
description of, 210–211
seven tasks for, 211
of three educational programs for adolescent health care, 211, 212 (figure)
Methods
analyzing data to decide on program merit, 12, 165–183, 188–212 (figure)
collecting data on program merit, 11, 119–143
designing the evaluation, 10, 63–64, 68–94
evaluation questions and hypotheses, 8–9, 33–34, 40–47, 61–63
evidence of merit, 9–10, 33–34, 47–63
listed, 7–8
managing data so that it can be analyzed, 12
reporting on effectiveness, quality, and value, 12–13, 218–254
sampling participants for evaluation, 11, 94, 101–115 (figure)
Methods section, 221–222
Metrics, 152
Missing data

264
intention-to-treat analysis (ITT) to handle, 179
last observation carried forward (LOCF) response to, 179
nonresponse and weighting, 178
per protocol analysis to handle, 179
Mixed-methods evaluation
description and purpose of, 20
reasons to use, 21
Mixed-methods evaluation examples
to better understand experimental results, 21
to develop ways to improve use of web-based health information, 21
to examine elevated BrACs (breath alcohol concentrations), 21
Moher, D., 234
Multiple program interference, 93

National Center for Education Statistics, 132


National Center for Health Statistics, 132
National Health and Nutrition Examination Survey (NHANES), 132
National Library of Medicine, 30, 136
Nemet, D., 127
NIH’s Consensus Conference programs, 224–225
Nonprobability (or convenience) sampling
description of, 104, 110
example of, 104–105
Nonrandommized controlled trials
ANCOVA to solve incomparable participants problem of, 82–83
benefits and concerns of, 90 (table)
desirable features of, 81–82
parallel controls, 81
Nonresponses, 178
Null hypotheses
description of, 112
determine statistical significance using, 195
example of, 112 (figure)
stating the evaluation question as a, 195–196
See also Hypotheses
Numerical data
continous and discrete, 190
description and examples of, 190 (table)–191 (table)

Objectives (program), 41–44


Observational designs

265
case control designs, 85–86, 91 (table)
cohort designs, 85, 90 (table)–91 (table)
cross-sectional designs, 83–84, 90 (table)
pretest-posttest only or self-controlled designs, 86–87
Observational evaluation design, 69–70
Observations
description and, 218
portion of an observation form, 128 (figure)
Obstetrical Access and Utilization Initiative, 56
OCR/OMR system, 174
Odds ratios
description of, 201
example of, 202–203
formula for the, 202
Online data collection
consent form for an online survey, 253–254
ethically sound, 251–253
Online education program evaluation, 227
Online evaluation reports
description of, 30
ERIC database used to create, 32–33 (figure)
PubMed database used to create, 30, 31 (figure)–32
Online health information
evaluation questions on program objectives on, 42–43
example of mixed-methods evaluation to improve, 21
program objectives to teach people about, 41
Online survey consent form, 253–254
Open City Online Survey of Educational Achievement, 148
Open-ended questions, 208
Oral presentations
description of, 232–235
recommendations for slides, 235–241
Ordinal data, 189–190, 189–191 (table)
Outcomes
assessing risk and odds likelihood of a, 200–201 (table)
definition of, 152
Outcomes measure (or indicator), 152
Outliers, 180–181

p value

266
description of, 196
determining the, 196
Parallel controls
nonrandomized controlled trials, 81–82
randomization, 72–73
Participants
ANCOVA to solve problem of incomparable, 82–83
communication between evaluation website and, 249
evaluation questions on program, 43
informed consent of, 245–249
Internet communication between evaluators and, 249
question format that is relatively easy to understand by, 177
question format that may be confusing to, 176–177
respondent’s identification (RESPID) numbers assigned to, 175
selecting, 11
table to describe characteristics of, 223
See also Samples
Participatory evaluation
community involvement in, 34
description and purpose of, 22
examples of different applications of, 23–24
four reasons that findings are particularly useful, 22
Per protocol analysis, 179
Physical examination data source, 131
Pilot tests, 177
Populations
definition of, 101
example of, 101–102
Posters
IRB review of, 245
recommended format of, 241
Power analysis
description of, 111–112
null hypothesis, 112
PowerPoint, 241
Practical (or clinical) significance
establishing, 198–200
statistical significance versus, 197–198
Practitioners/evaluators
communication between evaluation website and, 250

267
communication between participants and, 249
evaluation questions on program goals for, 40
reasons for use of the literature by, 134–135
research misconduct and problematic behaviors by, 254–255 (table)
See also Evaluations
PRECEDE-PROCEED framework
description and phases of, 24
illustrated diagram of, 25 (figure)
similarities to RE-AIM and CDC’s approach to the, 27
Predictive validity, 150
Prenatal Care Access and Utilization Initiative, 153–154
Pretest-posttest designs
description of, 86–87
disadvantages of, 87
Primary data collection, 132
Privacy policy, 250
Privacy statements, 251
Probability sampling
cluster sampling, 109–110
description of, 104
random selection and random assignment, 105–107
simple random sampling, 105, 106 (figure)
stratified sampling, 107–109
systematic sampling, 107
Problematic research behaviors, 254, 255 (table)
Process (or implementation) evaluation
benefits of using, 34
description and purpose of, 18
follow-up of abnormal pap smears as example of, 18
Program characteristics
evaluation questions about, 44
what constitutes, 44
Program environment
description of, 46
evaluation questions on, 46–47
Program evaluations
definition and purpose of, 4, 33
economics of, 60–61
Evaluation of a Healthy Eating Program for Professionals Who Care for Preschoolers example of, 13
as an interdisciplinary discipline example of, 14–15

268
methods used for, 7–15
A Psychological Intervention for Children With Symptoms of Posttraumatic Stress Disorder example of, 13–14
understanding who uses, 15
See also Evaluations
Program goals
evaluation questions on, 40–43
for institutions, 40
for practitioners, 40
for the public or community at large, 40
for the system, 41
Program merit
analysis data to decide on, 12
collecting data on, 11
evidence of, 47–59
finding evidence of, 9–10
program evaluation as unbiased exploration of, 33
Program merit evidence
description of, 47
example of access to and use of prenatal care services, 48–49
example of clarifying terms for, 48
making decisions about, 53 (figure)–59
medical record abstracts program, 59
providing specific, 47–48
QEV Reporting Form on, 61, 62 (figure), 63
Program merit evidence decisions
using experts to help make, 53 (figure)–55
made from existing data and large databases, 55–56
using research literature to make, 57–58
when to make, 58–59
Program merit evidence sources
comparison of programs, 49–52
from existing data and large databases (“big data”), 55–56
from expert consultation, 52–55
from research literature, 57–58
Program objectives, 41–44
Program value
definition of, 7
measured by worth to stakeholders, 33
typical questions about, 7
Programs

269
anticipated outcomes as objective of a, 5
characteristics of, 5–6
as the core of an evaluation, 4–5
costs of, 6
impact of, 6
quality of, 6–7
value of, 7
See also Interventions
A Psychological Intervention for Children With Symptoms of Posttraumatic Stress Disorder (Stein et al.), 13–14
PsycINFO, 136
PubMed database
creating online evaluation reports using, 30, 32
description of, 30
evaluation search terms provided by, 31 (figure)
literature review using the, 136
restricting a search to type of study, years published, and availability of free text, 31 (figure)

Qualitative evaluation
description and purpose of, 19
examples of qualitative methods in program evaluation, 19–20
Qualitative evaluation data
content analysis, 204–210
sources and collection of, 204–205
Quality-adjusted life year (QALY), 60
Quasi-experiment. See Nonrandomized controlled trials

Random assignment
benefits and concerns of, 90 (table)
blinding (double-blind), 80
comparing random selection and, 105–107
description and purpose of, 77
example of, 77–78
random clusters, 78–79
stratifying and blocking, 79–80
Random cluster, 78–79
Random selection
comparing random assignment to, 105
description of, 107
example of, 106–107
Randomized controlled trial (RCT)
CONSORT checklist for reporting on, 232

270
description of, 70–71
effective literacy program using, 71
illustrated diagram on flow of, 74 (figure)–75 (figure)
intention-to-treat analysis (ITT) used in, 179
parallel controls used in, 72–73
wait-list controls used in, 73, 75, 76 (figure)
Rating scale
choosing a, 155
definition of, 152
REACH-OUT Program, 102
RE-AIM framework
description and five dimensions of the, 26
similarities to PRECEDE-PROCEED and CDC’s approach to the, 27
Recoding data, 181
Recommendations section, 228
Record reviews
advantages and disadvantages of, 127
description and two types of, 125–127
example for food diaries and obese children, 127
as unobtrusive measures, 126
Relative risks (or likelihood ratios), 203–204
Reliability
check the, 157
comparing validity to, 161
description of, 148
four kinds of, 148–149
measurement chart information on, 159 (figure)–160 (figure), 161
Research literature
description of, 57
program merit evidence from, 57–58
Research misconduct, 254–255 (table)
Respondent’s identification (RESPID) numbers, 175
Results section, 222–227
Right-to-left (or reverse logic) logic model, 29 (figure)
Risk ratios
cohort design and calculation of, 203–204
to examine relationship between categorical variables, 203
RISK-FREE anti-smoking program, 60–61
Risks and odds
comparing and contrasting, 200–201 (table)

271
description of, 200
Research misconduct, 254–255 (table)

Sample examples
inclusion/exclusion criteria for surgical cancer patient program, 103–104
sampling units, 111
Sample size
alpha error (or Type I error) and, 112, 113 (figure)
beta error (or Type II error), 112–113 (figure)
formulas used to select the appropriate, 113
null hypothesis, 112
power analysis of, 111–112
Samples
definition of, 101
inclusion or exclusion criteria or eligibility, 103–104
power analysis and alpha and beta errors, 111–113 (figure)
size of, 111–113 (figure)
two examples of, 102
units of, 110–111
See also Participants
Sampling
description of, 94
as internal validity threat, 92
methods used for, 104–110
reasons for, 102–103
Sampling examples
blueprint for program using strategized sampling, 108 (figure)–109
cluster sampling of attitudes toward AIDS education, 110
null hypothesis, 112
random selection and random assignment, 106–107
Sampling methods
nonprobability or convenience type of, 104–105, 110
probability and convenience types of, 104, 105–110
random selection and random assignment, 105–107
simple random sampling, 105, 106 (figure)
stratified random sampling, 107–109
systematic sampling, 107
Sampling report (SR)
description and purpose of, 114
example of a SR form, 114 (figure)–115 (figure)

272
Sampling units
description of, 110–111
example of, 111
Scales
definition of, 152
rating, 152, 155
Schultz, K. F., 234
Screening scripts, 245
Secondary data analysis, 132
Secure Socket Layer (SSL) protocol, 249
Self-administered surveys
advantages and disadvantages of, 125
description and examples of, 123 (figure)–125
Simple random sampling, 105, 106 (figure)
Slide presentations
best practices for, 235–238, 240
recommendations for, 239, 240–241
Social media communication, 249
Society
evaluation questions on program goals for, 40
program’s effectiveness measured by benefits to, 33
program’s value measured by worth to, 33
Split-half reliability, 149
Stakeholders
evaluation to define program effectiveness to, 33
program value measured by worth to, 33
who use evaluations, 15
Stamates, A. L., 21
Statistical regression, 92
Statistical significance
clinical significance versus, 197–198
comparing two groups for, 194–195
guidelines for determining, 196
null hypothesis used to determine, 195
Stein, B. D., 13
Stratified blocked randomization, 79–80
Stratified random sampling
blueprint for program educating women in breast cancer treatment, 108 (figure)–109
description of, 107–108
Survey questions

273
coding closed, 207–208
coding open-ended, 208
format that is relatively easy to understand, 177
that may be confusing to participants, 176
nonresponses to, 178
when participants omit information, 178–179
See also Evaluation questions
Systematic sampling, 107
Systems (program goals for), 41

Target population
definition of, 101
two examples of, 102
Test-retest reliability, 148–149
Tests
definition of, 152
external validity threatened by reactive effects of, 93
as threat to internal validity, 92
Text communication, 249
Trapp, R., 197
TREND (Transparent Reporting of Evaluations with Nonrandomized Designs)
on baseline data and equivalence of groups, 235 (figure)
description of, 212
reviewing report for quality and ethics using, 232
Type I error (or alpha error), 112–113 (figure)
Type II error (or beta error), 112–113 (figure)

University of North Carolina website, 241


U.S. Centers for Disease Control and Prevention (CDC)
databases maintained by, 55–56, 132
evaluation framework recommended by the, 26 (figure)–27
Youth Risk Behavior Surveillance System of the, 56
U.S. Department of Education, 32
U.S. Department of Health, Education, and Social Services, 41
U.S. Government Office of Human Subjects of Research, 242

Validity
check the, 157
comparing reliability to, 161
construct, 151
content, 150

274
criterion, 150–151
definition of, 150
external, 91, 93–94
face, 150
internal, 91–93
measurement chart information on, 159 (figure)–160 (figure), 161
Variables
dependent (or outcome), 59, 61, 62 (figure), 63, 120–121, 188, 192 (table)–194 (table)
dichotomous, 178–179, 182, 189
independent (or explanatory), 59, 61, 62 (figure), 63, 120–121, 188, 192 (table)–194 (table)
odds ratios comparing categorical dependent, 201–204
risk ratios examining relationship between categorical, 203
Vignettes
advantages and disadvantages of using, 134
description of, 133
example on influence of physicians’ characteristics, 133

Wait-list controls, 73, 75, 76 (figure)


Weighting, 178
Written evaluation report
the abstract, 228–230
composition of the, 220
conclusions or discussion section, 227–228
example of statistical values table, 226 (figure)
executive summary, 230–231
using figures and tables, 224–227
methods section, 221–222
overview of, 218
recommendations section, 228
results section, 222–227
sample table of contents for a, 218–220
table to describe characteristics of participants, 223
users and nonusers of smokeless tobacco, 225 (figure)–226 (figure)
variability in transfusion practice during surgery, 224 (figure)
what to include in the introduction, 221

Your Health Online program


annotated abstract for evaluation of, 229–230
conclusions of evaluation on, 227–228
Youth Risk Behavior Surveillance System, 56
Yu, S., 21

275
276
277

You might also like